The "world model" is what we often refer to as the "context". But it is hard to anticipate bad assumptions that seem obvious because of our existing world model. One of the first bugs I scanned past from LLM generated code was something like:
if user.id == "id":
...
Not anticipating that it would arbitrarily put quotes around a variable name. Other time it will do all kinds of smart logic, generate data with ids then fail to use those ids for lookups, or something equally obvious.
The problem is LLMs guess so much correctly that it is near impossible to understand how or why they might go wrong. We can solve this with heavy validation, iterative testing, etc. But the guardrails we need to actually make the results bulletproof need to go far beyond normal testing. LLMs can make such fundamental mistakes while easily completing complex tasks that we need to reset our expectations for what "idiot proofing" really looks like.
In whatever way this is true, it has very little to do with sticking it to "coders" but is about magically solving/automating processes of any kind. Replacing programmers is small potatoes, and ultimately not a good candidate for jobs to replace. Programmers are ideal future AI operators!
What AI usage has underlined is that we are forever bound by our ability to communicate precisely what we want the AI to do for us. Even if LLMs are perfect, if we give it squishy instructions we get squishy results. If we give it a well-crafted objective and appropriate context and all the rest, it can respond just about perfectly. Then again, that is a lot of what programming has always been about in the first place - translate human goals into actionable code. Only the interface and abstraction level has changed.
Don't forget scuttling all the projects the staff has been working overtime to complete so that they can focus on "make it better!" waves hands frantically
For whatever reason, I can't get into Claude's approach. I like how Cursor handles this, with a directory of files (even subdirectories allowed) where you can define when it should use specific documents.
We are all "context engineering" now but Claude expects one big file to handle everything? Seems luke a deadend approach.
CLAUDE.md should only be for persistent reminders that are useful in 100% of your sessions
Otherwise, you should use skills, especially if CLAUDE.md gets too long.
Also just as a note, Claude already supports lazy loaded separate CLAUDE.md files that you place in subdirectories. It will read those if it dips into those dirs
I think their skills have the ability to dynamically pull in more data, but so far i've not tested it to much since it seems more tailored towards specific actions. Ie converting a PDF might translate nicely to the Agent pulling in the skill doc, but i'm not sure if it will translate well to it pulling in some rust_testing_patterns.md file when it writes rust tests.
Eg i toyed with the idea of thinning out various CLAUDE.md files in favor of my targeted skill.md files. In doing so my hope was to have less irrelevant data in context.
However the more i thought through this, the more i realized the Agent is doing "everything" i wanted to document each time. Eg i wasn't sure that creating skills/writing_documentation.md and skills/writing_tests.md would actually result in less context usage, since both of those would be in memory most of the time. My CLAUDE.md is already pretty hyper focused.
So yea, anyway my point was that skills might have potential to offload irrelevant context which seems useful. Though in my case i'm not sure it would help.
This is good for the company, chances are you will eat more tokens. I liked Aider approach, it wasn't trying to be too clever, it used files added to chat and asks if it figure out that something more is needed (like, say, settings in case of Django application).
What will become apparent is that when coding costs go to 0, support and robustness costs will be the new "engineering" disciple. Which is in reality how things work already. It is why you can have open source code and companies built on providing enterprise support for that code to companies.
If you want to build a successful AI company, assume the product part is easy. Build the support network: guarantee uptime, fast responses, direct human support. These are the shovels desperately needed during the AI gold rush.
My take is that AI adoption is a gear shift to a higher level abstraction, not a replacement. So we will have a lull, then a return to hiring for situations just like this. Maybe there is a lot more runway for AI to take jerbs, but I think it will hit an equilibrium of "creating different jobs" at some point.
Almost every startup is a wrapper of some sort, and has been for a while. The reason a startup can startup is because it has some baked in competency by using new and underutilized tools. In the dot com boom, that was the internet itself.
Now it's AI. Only after doing this for 20+ years do I really appreciate that the arduous process and product winnowing that happens over time is the bulk of the value (and the moat, when none other exists).
Cant help myself and compare to frameworks, libraries and oop... cant we built so fast because of them?
I think of wrapper more as a very thin layer around. Thin layer is easy to reproduce. I do not question that a smart collection of wrappers can do great product. Its all about idea :)
However its if ones idea is based purely on wrappers there's really no moat, nothing stopping somebody else to copy it within a moment
This has seemed to me to be the natural next step to turn LLMs into more deterministic tools. Pushing the frontier is nice, but I think LLMs have a whole different gear when they are able to self-decompose in a reliable way. Most of my success creating reusable LLM products came from determining where requirements/outputs need to be "hard" vs. "soft".
I don't think any of these companies are that reductive and short-sighted to try to game the system. However, Goodhart's Law comes into play. I am sure they have their own metrics that arr much more detailed than these benchmarks, but the fact remains LLMs will be tuned according to elements that are deterministically measurable.
if user.id == "id": ...
Not anticipating that it would arbitrarily put quotes around a variable name. Other time it will do all kinds of smart logic, generate data with ids then fail to use those ids for lookups, or something equally obvious.
The problem is LLMs guess so much correctly that it is near impossible to understand how or why they might go wrong. We can solve this with heavy validation, iterative testing, etc. But the guardrails we need to actually make the results bulletproof need to go far beyond normal testing. LLMs can make such fundamental mistakes while easily completing complex tasks that we need to reset our expectations for what "idiot proofing" really looks like.
reply