there definitely are some areas in software-land where graphing data and/or directly eyeballing it genuinely helps to spot patterns where statistical methods might be cumbersome/tricky/otherwise annoying, like log analysis[1]
This sharp uptick in LLM in-context "learning" capabilities means I'm more excited than ever to try to get to grips with "new" languages like Nim or Gleam (but worried that using LLMs to help me get to a working end state will rob me of some of the experience of learning).
Every MCP vs CLI argument I've seen really glosses over _where_ the agent is running, and how that makes a difference. For individual users where you're running agents locally, I totally agree that CLIs cover the vast majority of use cases, where available.
I think something I've not seen anyone mention is that MCPs make much more sense to equip agents on 3rd party platforms with the tools they need - often installing specific CLIs isn't possible and there's the question of whether you trust the platform with your CLI authentication key.
We last submitted a SWE-Bench verified result in November 2024 - at the time I believe we were in the top 5 entrants.
We expect Engine to be as good as the other code-writing agents out there at the moment - we understand almost everyone in the space to be using very similar base models and agent scaffolding.
I know of https://modal.com/, which I believe is used by Codegen and Cognition.
Anecdotally-speaking, I hear that many companies in the LLM agent space roll their own sandbox solutions - I've heard of both Firecracker- and Kubernetes-based implementations.
I use this for work - but there are edge cases all over the place that I keep running into (e.g. Yarn being installed on Github-hosted runners, but not self-hosted ones or act - https://github.com/actions/setup-node/issues/182)
Same experience here. Edge cases everywhere, though most can be worked around.
You can specify different runners to use. The default images are a compromise to keep size down. There is a very large image that tries to include everything you might want. I would suggest trying that if you don’t mind the very large (15GB IIRC) image.
I definitely remember considering the larger images - I think we ended up not using them since my work's usecase for act is running user github workflows on-demand on temporary VMs. The hope was that most usage is covered by the smaller images - and in fairness that has been true so far.
I had a problem recently trying to send LLM-generated text between two web servers under my control, from AWS to Render - I was getting 403s for command injection from Render's Cloudflare protection which is opaque and unconfigurable to users.
The hacky workaround which has been stably working for a while now was to encode the offending request body and decode it on the destination server.
[1] https://jvns.ca/blog/2022/12/07/tips-for-analyzing-logs/