> If anything, I think we'll see (another) splintering in the market. Companies with strong internal technical ability vs those that don't.
A tangent, I feel, again, unfortunately, the AI is going to divide society into people who can use the most powerful tools of AI vs those who will be only be using chatGPT at most (if at all).
I don't know why I keep worrying about these things. Is it pointless?
I do feel this divide, but from what I've read, and ehat I've observed, it's more a divide between people who understand the limited use-cases where machine learning is useful, and people who believe it should be used wherever possible.
For software engineering, it is useless unless you're writing snippets that already exist in the LLMs corpus.
> For software engineering, it is useless unless you're writing snippets that already exist in the LLMs corpus.
If I give something like Sonnet the docs for my JS framework, it can write code "in it" just fine. It makes the occasional mistake, but if I provide proper context and planning up front, it can knock out some fairly impressive stuff (e.g., helping me to wire up a shipping/logistics dashboard for a new ecom business).
That said, this requires me policing the chat (preferred) vs. letting an agent loose. I think the latter is just opening your wallet to model providers but shrug.
If you need a shipping dashboard, then yeah, that's a very common, very simple use-case. Just hook up an API to a UI. Even then I don't think you'll make a very maintainable app that way, especially if you have multiple views (because the LLMs are not consistent in how they use features, they're always generating from scratch and matching whatever's closest).
What I'm saying is that whenever you need to actually do some software design, i.e. tackle a novel problem, they are useless.
Since this can be a significant security issue for the state, why doesn't the government sponsor a security audit of the software. Does it upload the data or everything is done on the device? (Also, will have to keep up with the updates)
Because regulation is bad, according to the current executive?
Politics aside, the FDA applies a very generous amount of regulation (mostly justifiable), not sure we want to pay multiples for our consumer electronics, as it (mostly) shows acceptable behavior and rearely kills anybody.
It is bad. Regulations have been historically hijacked to benefit corporate interests. See Intuit and tax policy for example.
Voters on the right naively thought he'd work to fix it. (Wrong!) But it is very much bad for a very large number of issues. Maybe next executive will fix it? (Wrong!)
The NSA has a bad historical reputation for this sort of thing - intentionally weakening crypto standards to make things easier for themselves to break, while keeping them "strong enough" that other agencies outside of NSA/GCHQ/GRU can't. The Crypto AG scandal [0] was pretty bad, with Clipper/Skipjack & Dual_EC_DRBG [1] being more recent ones. The NSA could do what you are asking to do, but they probably won't let us know what the really bad holes are because they want to keep using them.
If anyone wants to use skills with any other model or tool like Gemini CLI etc. I had created open-skills, which lets you use skills for any other llm.
Caveat: needs mac to run
Bonus: it runs it locally in a container, not on cloud nor directly on mac
Surprising that they haven't made a podcast (NotebookLM-esque) based on the repo - that one can listen to on a bus ride. Something I had created a while back https://gitpodcast.com
Remember that things a "ceo" of anything says is just what he hears from people he has talked to. Now it doesn't make it obviously wrong, it's just then begs the question who he has been talking to that week. I doubt gary is doing any of the coding these days. For what it's worth, it's completely fine to ignore what he is saying - no offense.
Except he's right in this case, and it is contrary to the hypemongering we'd expect
It's 100% accurate to say that "MCP barely works" and it's meaningful to hear that even from the head of YC which is pushing through massive amount of businesses based on MCP or using it some way
It's two words with no qualifiers from someone we don't think it's technical. If it was, say, Karpathy, then sure, let's waste a whole thread discussing his farts, but I'm sure I'm not alone in having Claude Code having created an MCP, and I use it most times I use Claude Code. To move the conversation forwards though, what limitations and issues have you run into with MCPs? I wouldn't say mine are 100% bug free, but I wouldn't say it "barely works" either. Mostly works?
> The key to changing everyday behaviour is to make the evaluation of costs (effort) and benefits (rewards) a habit that doesn’t seem too much like hard work. Even for the most apathetic among us, this holds out the hope of turning a kneejerk “no” into an ability to consider saying “yes”.
and this
> But left to his own devices he did nothing. Studies in people who develop apathy have shown that many of them just don’t find it sufficiently rewarding to take action. The cost of making the effort doesn’t seem worth the potential benefit.
It is a cycle that feeds into the next, constantly strengthening itself. Whether that is positive feedback or negative feedback is really important. It is worth a large disruption to your life to get it working for you. The deadlock is very real.
Sooner or later I believe, there will be models which can be deployed locally on your mac and are as good as say Sonnet 4.5. People should shift to completely local at that point. And use sandbox for executing code generated by llm.
Edit: "completely local" meant not doing any network calls unless specifically approved. When llm calls are completely local you just need to monitor a few explicit network calls to be sure.
Unlike gemini then you don't have to rely on certain list of whitelisted domains.
>Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
I've been repeating something like 'keep thinking about how we would run this in the DC' at work. The cycles of pushing your compute outside the company and then bringing it back in once the next VP/Director/CTO starts because they need to be seen as doing something, and the thing that was supposed to make our lives easier is now very expensive...
I've worked on multiple large migrations between DCs and cloud providers for this company and the best thing we've ever done is abstract our compute and service use to the lowest common denominator across the cloud providers we use...
Can't find 4.5, but 3.5 Sonnet is apparently about 175 billion parameters. At 8-bit quantization that would fit on a box with 192 gigs of unified RAM.
The most RAM you can currently get in a MacBook is 128 gigs, I think, and that's a pricey machine, but it could run such a model at 4-bit or 5-bit quantization.
As time goes on it only gets cheaper, so yes this is possible.
The question is whether bigger and bigger models will keep getting better. What I'm seeing suggests we will see a plateau, so probably not forever. Eventually affordable endpoint hardware will catch up.
That's not easy to accomplish. Even a "read the docs at URL" is going to download a ton of stuff. You can bury anything into those GETs and POSTs. I don't think that most developers are going to do what I do with my Firefox and uMatrix, that is whitelisting calls. And anyway, how can we trust the whitelisted endpoint of a POST?
> Edit: "completely local" meant not doing any network calls unless specifically approved. When llm calls are completely local you just need to monitor a few explicit network calls to be sure.
The problem is that people want the agent to be able to do "research" on the fly.
Because the article shows it isn't Gemini that is the issue, it is the tool calling. When Gemini can't get to a file (because it is blocked by .gitignore), it then uses cat to read the contents.
I've watched this with GPT-OSS as well. If the tool blocks something, it will try other ways until it gets it.
How can an LLM be at fault for something? It is a text prediction engine. WE are giving them access to tools.
Do we blame the saw for cutting off our finger?
Do we blame the gun for shooting ourselves in the foot?
Do we blame the tiger for attacking the magician?
The answer to all of those things is: no. We don't blame the thing doing what it is meant to be doing no matter what we put in front of it.
It was not meant to give access like this. That is the point.
If a gun randomly goes off and shoots someone without someone pulling the trigger, or a saw starts up when it’s not supposed to, or a car’s brakes fail because they were made wrong - companies do get sued all the time.
But the LLM can't execute code. It just predicts the next token.
The LLM is not doing anything. We are placing a program in front of it that interprets the output and executes it. It isn't the LLM, but the IDE/tool/etc.
So again, replace Gemini with any Tool-calling LLM, and they will all do the same.
When people say ‘agentic’ they mean piping that token to various degrees of directly into an execution engine. Which is what is going on here.
And people are selling that as a product.
If what you are describing was true, sure - but it isn’t. The tokens the LLM is outputting is doing things - just like the ML models driving Waymo’s are moving servos and controls, and doing things.
It’s a distinction without a difference if it’s called through an IDE or not - especially when the IDE is from the same company.
That causes effects which cause liability if those things cause damage.
Because it misses the point. The problem is not the model being in a cloud. The problem is that as soon as "untrusted inputs" (i.e. web content) touch your LLM context, you are vulnerable to data exfil. Running the model locally has nothing to do with avoiding this. Nor does "running code in a sandbox", as long as that sandbox can hit http / dns / whatever.
The main problem is that LLMs share both "control" and "data" channels, and you can't (so far) disambiguate between the two. There are mitigations, but nothing is 100% safe.
Sorry, I didn't elaborate. But "completely local" meant not doing any network calls unless specifically approved. When llm calls are completely local you just need to monitor a few explicit network calls to be sure.
The LLM cannot actually make the network call. It outputs text that another system interprets as a network call request, which then makes the request and sends that text back to the LLM, possibly with multiple iterations of feedback.
You would have to design the other system to require approval when it sees a request. But this of course still relies on the human to understand those requests. And will presumably become tedious and susceptible to consent fatigue.
I would argue that lot of the tools will be hosted on GitHub - infact, most of the existing repos are potentially a tool (in future). And the discovery is just a GitHub search
btw gh repos are already part of training the llm
So you don't even need internet to search for tools, let alone TEO
The example given by Anthropic of tools filling valuable context space is a result of bad design.
If you pass the tools below to your agent, you don't need "search tool" tool, you need good old fashion architecture: limit your tools based on the state of your agent, custom tool wrappers to limit MCP tools, routing to sub-agents, etc.
Don't see whats wrong in letting llm decide which tool to call based on a search on long list of tools (or a binary tree of lists in case the list becomes too long, which is essentially what you eluded to with sub-agents)
I was referring to letting LLM's search github and run tools from there. That's like randomly searching the internet for code snippets and blindly running them on your production machine.
Sure to protect your machine, but what about data security?
Do I want to allow unknown code to be run on my private/corporate data?
Sandbox all you want but sooner or later your data can be exfiltrated. My point is giving an LLM unrestricted access to random code that can be run is a bad idea.
Curate carefully is my approach.
> Skills are the actualization of the dream that was set out by ChatGPT Plugins .. But I have a hypothesis that it might actually work now because the models are actually smart enough for it to work.
and earlier Simon Willison argued[1] that Skills are even bigger deal than MCP.
But I do not see as much hype for Skills as it was for MCP - it seems people are in the MCP "inertia" and having no time to shift to Skills.
Skills are less exciting because they're effectively documentation that's selectively loaded.
They are a bigger deal in a sense because they remove the need for all the scaffolding MCPs require.
E.g. I needed Claude to work on transcripts from my Fathom account, so I just had it write a CLI script to download them, and then I had it write a SKILL.md, and didn't have to care about wrapping it up into an MCP.
At a client, I needed a way to test their APIs, so I just told Claude Code to pull out the client code from one of their projects and turn it into a CLI, and then write a SKILL.md. And again, no need to care about wrapping it up into an MCP.
But this seems a lot less remarkable, and there's a lot less room to build big complicated projects and tooling around it, and so, sure, people will talk about it less.
Skills are good for context management as everything that happens while executing the skill remains “invisible” to the parent context, but they do inherit the parent context. So it’s pretty effective for a certain set of problems.
MCP is completely different, I don’t understand why people keep comparing the two. A skill cannot connect to your Slack server.
Skills are more similar to sub-agents, the main difference being context inheritance. Sub-agents enable you to set a different system prompt for those which is super useful.
Are you sure, i thought skill were loaded into the main context, unlike (sub)agents. According to Claude they're loaded into the main context.
Do you have link?
Unless claude decides a skill is needed, then it loads the additional details into the main context to use. It's basically lazy loading into main context.
I agree with you. I don't see people hyping them and I think a big part of this is that we have sort of hit an LLM fatigue point right now. Also Skills require that your agent can execute arbitrary code which is a bigger buy-in cost if your app doesn't have this already.
I still don't get what is special about the skills directory - since like forever I instructed Claud Code - "please read X and do Y" - how skills are different from that?
They're not. They are just a formalization of that pattern, with a very tiny extra feature where the model harness scans that folder on startup and loads some YAML metadata into the system prompt so it knows which ones to read later on.
It's more that they are embracing that the LLM is smart enough that you don't need to build-in this functionality beyond that very minimal part.
A fun thing: Claude Code will sometimes fail to find the skill the "proper" way, and will then in fact sometimes look for the SKILL.md file with tools, and read the file with tools, showing that it's perfectly capable of doing all the steps.
You could probably "fake" skills pretty well with instructions in CLAUDE.md to use a suitable command to extract the preamble of files in a given directory, and tell it to use that to decide when to read the rest.
It's the fact that it's such a thin layer that is exciting - it means we need increasingly less special logic other than relying on just basic instructions to the model itself.
No, skills are a set of manifested and tested 'skills' which reduce the 'mental load' of the LLM and reduces the context the LLM needs to do things reproducable.
But we are still reliant on the LLM correctly interpreting the choice to pick the right skill. So "known to work" should be understood in the very limited context of "this sub-function will do what it was designed to do reliably" rather than "if the user asks to use this sub-function it will do was it was designed to do reliably".
Skills feel like a non-feature to me. It feels more valuable to connect a user to the actual tool and let them familiarize themselves with it (and not need the LLM to find it in the future) rather than having the tool embedded in the LLM platform. I will carve out a very big exception of accessibility here - I love my home device being an egg timer - it's a wonderful egg timer (when it doesn't randomly play music) and I could buy an egg timer but having a hands-free egg timer is actually quite valuable to me while cooking. So I believe there is real value in making these features accessible through the LLM over media that the feature would normally be difficult to use in.
This is no different to an MCP, where you rely on the model to use the metadata provided to pick the right tool, and understand how to use it.
Like with MCP, you can provide a deterministic, known-good piece of code to carry out the operation once the LLM decides to use it.
But a skill can evolve from pure Markdown via inlining some shell commands, up to a large application. And if you let it, with Skills the LLM can also inspect the tool, and modify it if it will help you.
All the Skills I use now have evolved bit by bit as I've run into new use-cases and told Claude Code to update the script the skills references or the SKILL.md itself. I can evolve the tooling while I'm using it.
Choice to pick right tool -- there is a benchmark which tracks the accuracy of this.
"Known to work" -- if it has a hardcoded code, it will work 100% of the time - that's the point of Skills. If it's just markdown then yes, some sort of probability will be there and it will keep on improving.
Not really special, just officially supported and I'm guessing how best to use it baked in via RL. Claude already knows how skills work vs learning your own home-rolled solution.
I definitely see the value and versatility of Claude Skills (over what MCP is today), but I find the sandboxed execution to be painfully inefficient.
Even if we expect the LLMs to fully resolve the task, it'll heavily rely on I/O and print statements sprinkled across the execution trace to get the job done.
> but I find the sandboxed execution to be painfully inefficient
sandbox is not mandatory here. You can execute the skills on your host machine too (with some fidgeting) but it's a good practice and probably for the better to get in to the habit of executing code in an isolated environment for security purposes.
The better practice is, if it isn't a one-off, being introduced to the tool (perhaps by an LLM) and then just running the tool yourself with structured inputs when it is appropriate. I think the 2015 era novice coding habit of copying a blob of twenty shell scripts off of stack overflow and blindly running them in your terminal (while also not good for obvious reasons) was better than that essentially happening but you not being able to watch and potentially learn what those commands were.
I do think that if the agents can successfully resolve these tasks in a code execution environment, it can likely come up with better parametrized solutions with structured I/O - assuming these are workflows we want to run over and over again.
Skills are like the "end-user" version of MCP at best, where MCP is for people building systems. Any other point of view raises a lot of questions.
Aren't skills really just a collection of tagged MCP prompts, config resources, and tools, except with more lock-in since only Claude can use it? About that "agent virtual environment" that runs the scripts.. how is it customized, and.. can it just be a container? Aren't you going to need to ship/bundle dependencies for the tools/libraries those skills require/reference, and at that point why are we avoiding MCP-style docker/npx/uvx again?
Other things that jump out are that skills are supposed to be "composable", yet afaik it's still the case that skills may not explicitly reference other skills. Huge limiting factors IMHO compared to MCP servers that can just use boring inheritance and composition with, you know, programming languages, or composition/grouping with namespacing and such at the server layer. It's unclear how we're going to extend skills, require skills, use remote skills, "deploy" reusable skills etc etc, and answering all these questions gets us most of the way back to MCP!
That said, skills do seem like a potentially useful alternate "view" on the same data/code that MCP is covering. If it really catches on, maybe we'll see skill-to-MCP converters for serious users that want to be able do the normal stuff (like scaling out, testing in isolation, doing stuff without being completely attached to the claude engine forever). Until there's interoperability I personally can't see getting interested though
Tell your agent of choice to read the preamble of all the documents in the skills directory, and tell it that when it has a task that matches one of the preambles, it should read the rest of the relevant file for full instructions.
There are far fewer dependencies for skills than for MCP. Even a model that knows nothing about tool use beyond how to run a shell command, and has no support for anything else can figure out skills.
I don't know what you mean regarding explicitly referencing other skills - Claude at least is smart enough that if you reference a skill that isn't even properly registered, it will often start using grep and find to hunt for it to figure out what you meant. I've seen this happen regularly while developing a plugin and having errors in my setup.
> There are far fewer dependencies for skills than for MCP.
This is wrong and an example magical thinking. AI obviously does not mean that you can ship/use software without addressing dependencies? See for example https://github.com/anthropics/skills/blob/main/slack-gif-cre... or worse, the many other skills that just punt on this and assume CLI tools and libraries are already available
It is categorically not wrong. With an MCP you have at a minimum all the same dependencies and on top of that a dependency on your agent supporting MCP. With skills, a lot of the time you don't need to ship code at all - just an explanation to the agent of how to use standard tools to access an API for example, but when you do need to ship code, you don't need to ship any more code than with an MCP.
The trivial evidence of this, is that if you have an MCP server available, the skill can simply explain to the agent how to use the MCP server, and so even the absolute worst case for skills is parity.
It's definitely not vendor locked. For instance, I have made it work with Gemini with Open-Skills[1].
It is after all a collection of instructions and code that any other llm can read and understand and then do a code execution (via tool call / mcp call)
A tangent, I feel, again, unfortunately, the AI is going to divide society into people who can use the most powerful tools of AI vs those who will be only be using chatGPT at most (if at all).
I don't know why I keep worrying about these things. Is it pointless?
reply