Introducing a separate charge specifically targeting those of your customers who choose to self-host your hilariously fragile infrastructure is certainly a choice.. And one I assume is in no way tied to adoption/usage-based KPIs.
Of course, if you can just fence in your competition and charge admission, it'd be silly to invest time in building a superior product.
We've self-hosted github actions in the past, and self-hosting it doesn't help all that much with the fragile part. For github it is just as much triggering the actions as it is running them. ;) I hope the product gets some investment, because it has been unstable for such a long time, that on the inside it must be just the usual right now. GitHub has by far the worst uptime of any SaaS tools we use at the moment, and it isn't even close.
> Actions is down again, call Brent so he can fix it again...
We self host the runners in our infrastructure and the builds are over 10x faster than relying on their cloud runners. It's crazy the performance you get from running runners on your own hardware instead of their shared CPU. Literally from ~8m build time on Gradle + Docker down to mere 15s of Gradle + Docker on self hosted CPUs.
This! We went from 20!! minutes and 1.2k monthly spend on very, very brittle action runs to a full CI run in 4 minutes, always passing, by just by going to Hetzner's server auction page and bid on a 100 euro Ryzen machine.
After self hosting our builds ended up so fast, that we were actually waiting for was GitHub scheduling our agents, rather than it being the job running. It sucked a bit, because we'd optimized it so much, but we on 90th percentile saw that it took 20-30 seconds for github to schedule the jobs as they should. Measured from when the commit hit the branch, to the webhook begin sent.
My company uses GitHub, GitLab, and Jenkins. We'll soon™ be migrating off of GitLab in favor of GitHub because it's a Microsoft shop and we get some kind of discount on GitHub for spending so much other money with Microsoft.
Scheduling jobs, actually getting them running, is virtually instant with GitLab but it's slow AF for GitHub for no discernable reason.
lmao I just realized on this forum writing this way might sound like I own something. To be clear, I don't own shit. I typically write "my employer", and should have here.
I don't know, maybe it's a compliment. Wood glue can form bonds stronger than the material it's bonding. So, the wood glue in this case is better than the service it's holding together :)
I tend to just rely on the platform installers, then write my own process scripts to handle the work beyond the runners. Lets me exercise most of the process without having to (re)run the ci/cd processes over and over, which can be cumbersome, and a pain when they do break.
The only self-hosted runners I've used have been for internalized deployments separate from the build or (pre)test processes.
Aside: I've come to rely on Deno heavily for a lot of my scripting needs since it lets me reference repository modules directly and not require a build/install step head of time... just write TypeScript and run.
We choose github actions because it was tied directly to github providing the best pull-request experience etc. We actually didn't really use github actions templating as we'd got our own stuff for that, so the only thing github actions actually had to do was start, run a few light jobs as the CI was technically run elsewhere and then report the final status.
When you've got many 100s of services managing these in actions yaml itself is no bueno. As you mentioned having the option to actually be able to run the CI/CD yourself is a must. Having to wait 5 minutes plus many commits just to test an action drains you very fast.
Granted we did end up making the CI so fast (~ 1 minute with dependency cache, ~4 minutes without), that we saw devs running their setup less and less on their personal workstations for development. Except when github actions went down... ;) We used Jenkins self-hosted before and it was far more stable, but a pain to maintain and understand.
In 2023 I quoted a customer some 30 hours to deploy a Kubernetes cluster to Hetzner specifically to run self-hosted GitHub Actions Runners.
After 10-ish hours the cluster was operational. The remaining 18 (plus 30-something unbillable to satisfy my conscience) were spent trying and failing to diagnose an issue which is still unsolved to this day[1].
Am I right in assuming it’s not the amount of payment but the transition from $0 to paying a bill at all?
I’m definitely sure it’s saving me more than $140 a month to have CI/CD running and I’m also sure I’d never break even on the opportunity cost of having someone write or set one up internally if someone else’s works - and this is the key - just as well.
But investment in CI/CD is investing in future velocity. The hours invested are paid for by hours saved. So if the outcome is brittle and requires oversight that savings drops or disappears.
I use them minimally and haven't stared at enough failures yet to see the patterns. Generally speaking my MO is to remove at least half of the moving parts of any CI/CD system I encounter and I've gone a multiple of that several times.
When CI and CD stop being flat and straightforward, they lose their power to make devs clean up their own messes. And that's one of the most important qualities of CI.
Most of your build should be under version control and I don't mean checked in yaml files to drive a CI tool.
The only company I’ve held a grudge against longer than MS is McDonalds and they are sort of cut from the same cloth.
I’m also someone who paid for JetBrains when everyone still thought it wasn’t worth money to pay for a code editor. Though I guess that’s again now. And everyone is using an MS product instead.
I run a local Valhalla build cluster to power the https://sidecar.clutch.engineering routing engine. The cluster runs daily and takes a significant amount of wall-clock time to build the entire planet. That's about 50% of my CI time; the other 50% is presubmits + App Store builds for Sidecar + CANStudio / ELMCheck.
Using GitHub actions to coordinate the Valhalla builds was a nice-to-have, but this is a deal-breaker for my pull request workflows.
I found that implementing a local cache on the runners has been helpful. Ingress/egress on local network is hella slow, especially when each build has ~10-20GB of artifacts to manage.
ZeroFS looks really good. I know a bit about this design space but hadn't run across ZeroFS yet. Do you do testing of the error recovery behavior (connectivity etc)?
This has been mostly manual testing for now.
ZeroFS currently lacks automatic fault injection and proper crash tests, and it’s an area I plan to focus on.
SlateDB, the lower layer, already does DST as well as fault injection though.
GH actions templates don’t build all branches by default. I guess it’s due to them not wanting the free tier to use to much resources. But I consider it an anti-pattern to not build everything at each push.
They already charge for this separately (at least storage). Some compute cost may be justified but you'd wish that this change would come with some commitment of fixing bugs (many open for years) in their CI platform -- as opposed to investing all their resources in a (mostly inferior) LLM agent (copilot).
- github copilot PR reviews are subpar compared to what I've seen from other services: at least for our PRs they tend to be mostly an (expensive) grammar/spell-check
- given that it's github native you'd wish for a good integration with the platform but then when your org is behind a (github) IP whitelist things seem to break often
- network firewall for the agent doesn't seem to work properly
raised tickets for all these but given how well it works when it does, I might as well just migrate to another service
I don't work for Microsoft (in fact, I work for a competitor), and I think it's totally reasonable to charge for workflow executions. It's not like they're free to build, operate, and maintain.
You don't trust devs to run things, to have git hooks installed, to have a clean environment, to not have uncommitted changes, to not have a diverging environment on their laptop.
Actions let you test things in multiple environments, to test them with credentials against resources devs don't have access to, to do additional things like deploys, managing version numbers, on and on
With CI, especially pull requests, you can leave longer running tests for github to take care of verifying. You can run periodic tests once a day like an hour long smoke test.
CI is guard rails against common failure modes which turn requiring everyone to follow an evolving script into something automatic nobody needs to think about much
> You don't trust devs to run things, to have git hooks installed, to have a clean environment, to not have uncommitted changes, to not have a diverging environment on their laptop.
... Is nobody in charge on the team?
Or is it not enough that devs adhere to a coding standard, work to APIs etc. but you expect them to follow a common process to get there (as opposed to what makes them individually most productive)?
> You can run periodic tests once a day like an hour long smoke test.
Which is great if you have multiple people expected to contribute on any given day. Quite a bit of development on GitHub, and in general, is not so... corporate.
I don't deny there are use cases for this sort of thing. But people on HN talking about "hosting things locally" seem to describe a culture utterly foreign to me. I don't, for example, use multiple computers throughout the house that I want to "sync". (I don't even use a smartphone.) I really feel strongly that most people in tech would be better served by questioning the existing complexity of their lives (and setups), than by questioning what they're missing out on.
It seems like you may not have much experience working in groups of people.
>... Is nobody in charge on the team?
This isn't how things work. You save your "you MUST do these things" for special rare instructions. A complex series of checks for code format/lint/various tests... well that can all be automated away.
And you just don't get large groups of people all following the same series of steps several times a day, particularly when the steps change over time. It doesn't matter how "in charge" anybody is, neither the workplace nor an open source project are army boot camp. You won't get compliance and trying to enforce it will make everybody hate you and turn you into an asshole.
Automation makes our lives simpler and higher quality, particularly CI checks. They're such an easy win.
The runner software they provide is solid and I’ve never had an issue with it after administering self-hosted GitHub actions runners for 4 years. 100s of thousands of runners have taken jobs, done the work, destroyed themselves, and been replaced with clean runners, all without a single issue with the runners themselves.
Workflows on the other hand, they have problems. The whole design is a bit silly
it's not the runners, it's the orchestration service that's the problem
been working to move all our workflows to self hosted, on demand ephemeral runners. was severely delayed to find out how slipshod the Actions Runner Service was, and had to redesign to handle out-of-order or plain missing webhook events. jobs would start running before a workflow_job event would be delivered
we've got it now that we can detect a GitHub Actions outage and let them know by opening a support ticket, before the status page updates
That’s not hard, the status page is updated manually, and they wait for support tickets to confirm an issue before they update the status page. (Users are a far better monitoring service than any automated product.)
Webhook deliveries do suffer sometimes, which sucks, but that’s not the fault of the Actions orchestration.
I'm seeing wonky webhook deliveries for Actions service events, like dropping them completely, while other webhooks work just fine. I struggle to see what else could be responsible for that behaviour. it has to be the case that the Actions service emits events that trigger webhook deliveries & sometimes it messes them up.
Of course, if you can just fence in your competition and charge admission, it'd be silly to invest time in building a superior product.