Sorry, this still makes no sense. LLMs don't care about files. The way most codings systems work is that they simply provide the whole file to the LLM rather than a subset of it. That's just a choice in how you implemented your RAG search system and database. In this case the "record" is big, a file. No doubt that works for code, but it's nonsensical outside that.
E.g. for wikipedia the logical unit would likely be an article. For a book, maybe it's a chapter, or maybe it's a paragraph. You need to design the system around your content and feed the LLM an appropriate logically related set of data.
Oh but they do. These CLI agents are trained and specifically tuned to work with the filesystem. It’s not about the content or how it’s actually stored, it’s about the familiar access patterns.
I can’t begin to tell you how many times I’ve seen a coding agent figure out it can get some data directly from the filesystem instead of a dedicated, optimized tool it was specifically instructed to use for this purpose.
You basically can’t stop these things from messing with files, it’s in their DNA. You block one shell command, they’ll find another. Either revoke shell access completely or play whackamole. You cannot believe how badly they want to work with files.
Yeah, some of the uplift people are anecdotally seeing from “just using the filesystem” is, imo, on account of how difficult it is to take a principled approach to pre-chunking when implementing other approaches.
Empirically, agents (especially the coding CLIs) seem to be doing so much better with files, even if the tooling around them is less than ideal.
With other custom tools they instantly lose 50 IQ points, if they even bother using the tools in the first place.