IMHO, jumping from Level 2 to Level 5 is a matter of:
- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly
Microservices should already be a last resort when you’ve either:
a) hit technical scale that necessitates it
b) hit organizational complexity that necessitates it
Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).
> Better documentation
If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.
Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).
Not yet unfortunately, but I'm in the process of building one.
This was my journey: I vibe-coded an Electron app and ended up with a terrible monolithic architecture, and mostly badly written code. Then, I took the app's architecture docs and spent a lot of my time shouting "MAKE THIS ARCHITECTURE MORE ORTHOGONAL, SOLID, KISS, DRY" to gpt-5-pro, and ended up with a 1500+ liner monster doc.
I'm now turning this into a Tauri app and following the new architecture to a T. I would say that it is has a pretty clean structure with multiple microservices.
Now, new features are gated based on the architecture doc, so I'm always maintaining a single source of truth that serves as the main context for any new discussions/features. Also, each microservice has its own README file(s) which are updated with each code change.
I vibe coded an invoice generator by first vibe coding a "template" command line tool as a bash script that substitutes {{words}} in a libre office writer document (those are just zipped xml files, so you can unpack them to a temp directory and substitute raw text without xml awareness), and in the end it calls libre office's cli to convert it to pdf. I also asked the AI to generate a documentation text file, so that the next AI conversation could use the command as a black box.
The vibe coded main invoice generator script then does the calendar calculations to figure out the pay cycle and examines existing invoices in the invoice directory to determine the next invoice number (the invoice number is in the file name, so it doesn't need to open the files). When it is done with the calculations, it uses the template command to generate the final invoice.
This is a very small example, but I do think that clearly defined modules/microservices/libraries are a good way to only put the relevant work context into the limited context window.
It also happens to be more human-friendly, I think?
I "vibe coded" a Gateway/Proxy server that did a lot of request enrichment and proprietary authz stuff that was previously in AWS services. The goal was to save money by having a couple high-performance servers instead of relying on cloud-native stuff.
I put "vibe coded" is in quotes because the code was heavily reviewed after the process, I helped when the agent got stuck (I know pedants will complain but ), and this was definitely not my first rodeo in this domain and I just wanted to see how far an agent could go.
In the end it had a few modifications and went into prod, but to be really fair it was actually fine!
One thing I vibe coded 100% and barely looked at the code until the end was a MacOS menubar app that shows some company stats. I wanted it in Swift but WITHOUT Xcode. It was super helpful in that regard.
I've been using claude on two codebases, one with good layering and clean examples, the other not so much. I get better output from the LLM with good context and clean examples and documentation. Not surprising that clarity in code benefits both humans and machines.
I think there will be a couple benefits of using agents soon. Should result in a more consistent codebase, which will make patterns easier to see and work with, and also less reinventing the wheel. Also migrations should be way faster both within and across teams, so a lot less struggling with maintaining two ways of doing something for years, which again leads to simpler and more consistent code. Finally the increased speed should lead to more serializability of feature additions, so fewer problems trying to coordinate changes happening in parallel, conflicts, redundancies, etc.
I imagine over time we'll restructure the way we work to take advantage of these opportunities and get a self-reinforcing productivity boost that makes things much simpler, though agents aren't quite capable enough for that breakthrough yet.
- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly