Hacker Newsnew | past | comments | ask | show | jobs | submit | mccoyb's commentslogin

It has now become fashionable to dress oneself in the garb of science to sell dev environments ... for agents.

It has now become fashionable to claim much, and furnish little.

It has now become fashionable to fail to understand or state the core of your proposal in as few words as possible: instead of "genetic algorithm applied to the space of harnesses, parallelized by our infrastructure" we get "Three swaps. Same orchestrator. Same dashboard. The wiring is the thing."

We're cooked chat.


we need better RL

The problem is not vibe coding itself. The problem is that certain untrained people do not have or perhaps do not care to learn the necessary skills to refine the result into something novel, or clear / precise, something which communicates (clearly) the idea they are trying to convey to others (who are hoping to learn something new).

In a climate where it seems like VC are woefully bereft of the same skills, there's an impetus to just slop garbage up for any vague idea, without taking the care or time to polish it into something which has that intangibly human sense of greatness and clarity.

I see, you've done something -- but why? If you continue to ask this question, you will arrive at good science ... but many submissions are not aimed at that level of communication or stop far ahead of the point at which the question becomes interesting.

There's that phrase: "better to remain silent and be thought a fool than to speak and to remove all doubt" which strikes as poignant, except it seems like the audience today are also fools ... the inmates are running the asylum.


In a marketplace with infinite low-quality supply and limited attention, it doesn’t really matter how good the good offerings are.


Why is that a problem? Reality will filter out the projects that are poorly developed just like it always has.


Sure, but what greatness do we lose in the interim as it gets silenced by unending noise?

Like it filters windows OS for making the start button an electron app? Oh wait, it didn’t and nothing is displacing windows.

Reality doesn’t filter out shit man. We objectively had better vaccumes in the 80s (central vacuums were popular then). We had objectively better displays in the 80s too (vector displays/game consoles like the vectrex were putting out variable refresh rates into the 1000hz equivalent with extreme brightness and impossible to replicate phosphor glow.

Worse is better. Like, actually getting worse is progress. Sometimes it really is good to “return to monkie” and take a page from the past.


There is not a single mention of probability in this post.

The post acts like agents are a highly complex but well-specified deterministic function. Perhaps, under certain temperature limits, this is approximately true ... but that's a serious restriction and glossed over.

For instance, perhaps the most striking constraint about FLP is that it is about deterministic consensus ... the post glazes over this:

> establishes a fundamental impossibility result dictating consensus in any asynchronous distributed system (yes! that includes us).

No, not any asynchronous distributed system, that might not include us. For instance, Ben-Or (1983, https://dl.acm.org/doi/10.1145/800221.806707) (as a counterexample to the adversary in FLP) essentially says "if you're stuck, flip a coin". There's significant work studying randomized consensus (yes, multi-agents are randomized consensus algorithms): https://www.sciencedirect.com/science/article/abs/pii/S01966...

Now, in Ben-Or, the coins have to be independent sources of randomness, and that's obviously not true in the multi-agent case.

But it's very clear that the language in this post seems to be arguing that these results apply without understanding possibly the most fundamental fact of agents: they are probability distributions -- inherently, they are stochastic creatures.

Difficult to take seriously without a more rigorous justification.


It really depends on your model in my opinion.

At the lowest level of abstraction, LLMs are just matrix multiplication. Deterministic functions of their inputs. Of course, we can argue on the details and specifics of how the peculiarities of inference in practice lead to non-deterministic behaviours but now our model is being complicated by vague aspects of reality.

One convenient way of sidestepping these is to model them as random functions, sure. I wouldn't go as far to say they are "inherently stochastic creatures". Maybe that's the case, but you haven't really given substantial evidence to justify that claim.

At a higher level of abstraction, one possible model of llms is as deterministic functions of their inputs again, but now as functions of token streams or higher abstractions like sentences rather than the underlying matrix multiplication. In this case again we expect llms to produce roughly consistent outputs given the same prompt. In this case, again, we can apply deterministic theorems.

I guess my central claim is that there hasn't been a salient argument made as to why the randomness here is relevant for consensus. Maybe the models exhibit some variability in their output, but in practice does this substantially change how they approach consensus? Can we model this as artefacts of how they are initialised rather than some inherent stochasticity? Why not? It feels like randomness is being introduced here as a sort of magic "get out of jail" free card here.

Just my two cents I suppose.


LLMs utilize categorical distributions defined by the logits computed by the matrix multiplies, and there are many sampling strategies which are employed. This is one of the core mechanisms for token generation.

There's no peculiarity to discuss, that's how they work. That's how they are trained (the loss is defined by probabilistic density computations), that's how inference works, etc.

> I guess my central claim is that there hasn't been a salient argument made as to why the randomness here is relevant for consensus. Maybe the models exhibit some variability in their output, but in practice does this substantially change how they approach consensus? Can we model this as artefacts of how they are initialised rather than some inherent stochasticity? Why not? It feels like randomness is being introduced here as a sort of magic "get out of jail" free card here.

I'm really surprised to hear this given the content of the post. The claims in the post are quite strong, yet here I need to give a counterargument to why the claim about consensus applying to pseudorandom processes is relevant?

I don't think it's necessary to furnish a counterexample when pointing out when a formal claim is overreaching. It's not clear what the results are in this case! So it feels premature to claim that results cover a wider array of things than shown?

For instance, this is a strong claim:

> it means that in any multi-agentic system, irrespective of how smart the agents are, they will never be able to guarantee that they are able to do both at the same time: > > Be Safe - i.e produce well formed software satisfying the user's specification. > Be Live - i.e always reach consensus on the final software module.

I'm confused as to the stance, we're either hand-waving, or we're not -- so which is it?


Re — totally fine with hand-waving for intuition.

I just came away from the read thinking that this post was pointing to something very strong and was a bit irked to find that the state of results was more subtle than the post conveys it.


If you're pushing me, let's say we're not hand waving then. LLMs, abstraction removed, are deterministic computations of matrix-multiplication, f(x) -> y. If you want, we can make them pseudo-random, but thus still a deterministic process. FLP then holds. I'm not sure what your confusion is.


I'm suspicious that this is going to lead to optimal orchestration ... or rather, that open source won't produce a far better alternative in time.

The best performance I've gotten is by mixing agents from different companies. Unless there is a "winner take all" agent (I seriously doubt it, based on the dynamics and cost of collecting high quality RL data), I think the best orchestration systems are going to involve mixing agents.

Here, it's not about the planner, it's about the workers. Some agents are just better at certain things than others.

For instance, Opus 4.6 on max does not hold a candle to GPT 5.4 xhigh in terms of bug finding. It's just not even a comparison, iykyk.

Almost analogous to how diversity of thought can improve the robustness of the outcomes in real world teams. The same thing seems to be true in mixture-of-agent-distributions space.


I'm fairly certain these AI companies are lobsters in a bucket. Every time one of them products a private model, they'll all use access to that model to generate improvements _and then publish those improvements_ as a way to hamstring the cornering of that market.

So, that'll go on until they form a cartel and become the wizard of oz.


Another way to think about it:

For Anthropic to have the best version of this software, they'd have to simultaneously ... well, have the best version of the software, but also beat every other AI company at all subtasks (like: technical writing, diagramming, bug finding -- they'd need to have the unequivocal "best model" in all categories).

Surely their version is not going to allow you to e.g. invoke Codex or what have you as part of their stack.


They would also have to prevent all access from the model being used to beat the model..


I think opus does in fact, find the bugs the same way GPT xhigh (or even high) does. It just discards them before presenting to the user.

Opus is designed to be lazy, corner-cutting model. Reviews are just one place where this shows. In my orchestration loop, opus discards many findings by GPT 5.4 xhigh, justifying this as pragmatism. Opus YAGNIs everything, GPT wants you to consider seismic events in your todo list app. There's sadly, nothing in between.


My fear is that this is going to lead to an optimal orchestration language. For example, that Claude switches to Sumerian for all communication between agents. One thing is if they try to silo like that, but my real fear is that it may actually perform well.

(Not sure if it would be Sumerian, Esperanto or something more artificial. As long as it is esoteric enough for one company to hoard all the expertise in it.)


I've seen Antigravity outputting chinese characters in its thinking traces from time-to-time.

I also remember chinese being discussed as a potential orchestrating language but I don't remember the sources, so 100% anecdotical.


Yeah this has been my experience too, mixing agents/models from different companies..

Having Opus write a spec, then send to Gemini to revise, back to Opus to fix, then to me to read and approve..

Send to a local model like Qwen3.5 to build, then off to Opus to review ...

This was such an amazing flow, until Anthropic decided to change their minds.


This is still very much doable. This is exactly how I'm working. I'm using opencode with a mixture-of-agents I built (https://github.com/tessellate-digital/notion-agent-hive), where the model behind each agent is configurable.


You can still do all of this. With tmux. Nothing anthropic can do about that.

Gemini cli is horrible though.


Something something medical researcher reinvents calculus.

In 2026: frontend web developer reinvents tmux.

Guys, please do us the service of pre-filtering your crack token dreams by investigating the tool stack which is already available in the terminal ... or at least give us the courtesy of explaining why your vibecoded Greenspun's 10th something is a significant leg up on what already exists, and perhaps has existed for many years, (and is therefore, in the training set, and is therefore, probably going to work perfectly out of the box).


Right, agents can just use tmux send-keys. Here's a skill I wrote to have Claude debug plugin code in the Helix editor's experimental plugin system. As usual, the skill is barely necessary, it just saves it some time getting the commands right and tells it where some useful reference material is.

https://github.com/david-crespo/dotfiles/blob/main/claude/sk...


Maybe, just maybe, this is of obvious utility to the many people who have needs that are not yours?

I very regularly need to interact with my work through a python interpreter. My work is scientific programming. So the variables might be arrays with millions of elements. In order to debug, optimize, verify, or improve in any way my work, I cannot rely on any other methods than interacting with the code as it's being run, or while everything is still in memory. So if I want to really leverage LLMs, especially to allow them to work semi-autonomously, they must be able to do the same.

I'm not going to dump tens of GB of stuff to a log file or send it around via pipes or whatever. Why is there a nan in an array that is the product of many earlier steps in a code that took an hour to run? Why are certain data in a 200k-variable system of equations much harder to fit than others, and which equations are in tension with each other to prevent better convergence?

Are interpreters and pdb not great, previously-existing tools for this kind of work? Does a new tool that lets LLMs/agents use them actually represent some sort of hack job because better solutions have existed for years?


I agree that at first glance, it seems like tmux, or even long-running PTY shell calls in harnesses like Claude, solve this. They do keep processes alive across discrete interactions. But in practice, it’s kind of terrible, because the interaction model presented to the LLM is basically polling. Polling is slow and bloats context.

To avoid polling, you need to run the process with some knowledge of the internal interpreter state. Then a surprising number of edge cases start showing up once you start using it for real data science workflows. How do you support built-in debuggers? How do you handle in-band help? How do you handle long-running commands, interrupts, restarts, or segfaults in the interpreter? How do you deal with echo in multi-line inputs? How do you handle large outputs without filling the context window? Do you spill them to the filesystem somewhere instead of just truncating them, so the model can navigate them? What if the harness doesn’t have file tools? And so on.

Then there is sandboxing, which becomes another layer of complexity wrapped into the same tool.

I’ve been building a tool around this problem: `mcp-repl` https://github.com/posit-dev/mcp-repl

So tmux helps, but even with a skill and some shims, it does not really solve the core problem.


Are you aware that you can use tmux (or zellij, etc.), spin up the interpreter in a tmux session, and then the LLM can interact with it perfectly normally by using send-keys? And that this works quite well, because LLMs are trained on it? You just need to tell the LLM "I have ipython open in a tmux session named pythonrepl"

This is exactly how I do most of my data analysis work in Julia.


> I'm not going to dump tens of GB of stuff to a log file

In the same vein as the parent comment, the curiosity is why you would vibe code a solution instead of reaching for grep.


See related sibling: the use cases are compelling!

My complaint is that tmux handles them perfectly. Exactly the claim that OP is making with their software - is served by robust 18 year old software.

In 2026, it costs nearly nothing to thoroughly and autonomously investigate related software — so yes I am going to be purposefully abrasive about it.


And if you want to interact with tmux from within the python interpreter there is a very good library available, libtmux:

https://github.com/tmux-python/libtmux


In the data science scenario you should just have proper tooling, for you it sounds like a REPL the agent can interface with. I do this with nREPL/CIDER; in Python-land a Jupyter kernel over MCP maybe. For stateful introspection where you don't control the tooling, tmux plus trivial glue gets you most of the way.

edit: There are much better solutions for Python-land below it seems :)


What I do is have a quick command that spins up a worktree on a repo with my ghostty splits as I like them and the tmux named the worktree. I then tell the Claude code about the tmux when it needs to look. It’s pretty good at natively handling the tmux interactions.

Ideally Ghostty would offer primitives to launch splits but c’est la vie. Apple automation it is.


You can start a tmux session and tell your agent about it and it will happily send commands and get the output from it.

I saw this post a while ago that turned me on to the idea: https://news.ycombinator.com/item?id=46570397


The problem is, they'll find there is typically already a good solution to their problem, and then they'll have nothing to write about.


At this point, it’s easier to (have the agent) build a simple tool like this than it is to find and set up an existing one.


I sincerely think the chatbot phenomena is giving people the perspective that whatever hallucinatory conversation they're having is profound because it's the first time they personally have thought about it.

On one hand this is normal in education and pedagogy to have the student or apprentice put the boring pieces together to find the wonder of the puzzle itself, but on the other this is how we end up with https://xkcd.com/927/


I agree. We skipped CLIs and went all the way to TUIs because TUIs are "easy to make now"? Or maybe because claude/codex?

But in practice you are padding token counts of agents reading streams of TUIs instead of leveraging standard unix pipes that have been around from day 1.

TLDR - your agent wants a CLI anyway.

Disclaimer: still a cool project and thank you to the author for sharing.


The TUI makes more sense to humans who don’t understand the difference between a human and a machine.



Why not use datacenter of geniuses to increase capacity? Grug confused.


It is confusing for a company to sell you the subscription service, say "Claude Code is covered", ship Claude Code with `claude -p`, and then say "oh right, actually, not _all of Claude Code_, don't try and use it as a executable ... sorry, right, the subscription only works as long as you're looking at that juicy little Claude Code logo in the TUI"

The disrespect Anthropic has for their user base is constant and palpable.


Subscriptions are going to leave you open to changes in the subscription terms at any time. This is especially true of AYCE subscriptions for something with a substantial marginal cost of additional usage.

If you want unrestricted and unlimited usage, it's available through the API. Complaining about the subscription like this is basically saying, I want what you're offering, but I demand it for cheaper than what you charge for it. That doesn't make any more sense here than it does at the grocery store.


This strikes me the same way the people in college who would print 497 empty pages at the end of the semester for the quota "they'd paid for" or that one guy who made lemonade at restaurants with the free lemon wedges and sugar packets. "Contempt for users" is silly. Adjusting terms to handle users who use things as not intended isn't contempt.


Contempt for users is not silly when the CEO of said company has repeatedly claimed they will replace SWEs "end-to-end" by next year.

I'm not sure what to say. You're either listening to the actions of these companies, or you're not in a place where you feel the need to be concerned be their actions.

I'm in a place where I'm concerned by their actions, and the impact that their claims and behavior have on the working environment around me.


At no point in the last 10,000 years of human civilization has there not been a developing technology that threatened to forever reshape and displace a class of labor.

Or are you also upset about the modern plight of the telephone operator, farrier, or coal miner?


I see -- and AI is just like all technologies that came before it ...

It is not a class of labor ... it is all digital labor. Do you or do you not understand this?

It is digital knowledge itself, and then all communication labor, and then all physical labor with robotics.

Is this clear to you?


Are SWE's the only digital labor job?


And? Hyperbolic fear of change always exists and there's always been more work.

Marx' whole idea of Communism was predicted on the fact that he assumed industrialization would lead to a post-scarcity society requiring virtually no work and a overhaul of how everything was owned and produced. Boy was he wrong.


Oh nooo, labor might be automated and we might see advancement that makes the Industrial Revolution look small! Oh, the humanity! Please someone, stop progressing humanity, I need to cling to my sticks!


Did he say they will replace SWEs, or maybe something more nuanced, that code will be written by AI tools?

Honest question from my end, I try to not read every AI related news that keeps telling me “it’s over, good luck feeding your family in 9-12 months”.


You could think about it this way:

All AI prices will rise soon - probably shortly after the IPOs. The new prices will be eyewatering compared with today’s. This bulling change is lengthening the time until Anthropic have to raise the subscription prices, so those of us who’re not doing 24hr claw stuff can continue to use the tools the way we’ve gotten used to.


The vast majority of your complaints are handled by libghostty-vt itself, not by this person's Emacs wrapper software over libghostty.

Ghostty is a great piece of software, with a stellar maintainer who has a very pragmatic and measured take on using AI to develop software.


Looking at the sophistication of modern security exploits, I'd say that just a few minor gaps, strategically positioned, can lead to surprisingly drastic results. Of course, Emacs is a niche editor/IDE/OS/whatnot, so an unlikely target, but still.

It's a great proof of concept though. In the meantime, I'll stick with vterm.


no malicious person is using emacs. the userbase is full of painfully honest people.


I hope they all secure their MELPA accounts properly, too!


(be-malicious) Debugger entered--Lisp error: (void-function be-malicious)

Yep, didn't work for me


Cool work!

Aside but 12 MB is ... large ... for such a thing. For reference, an entire HTTP (including crypto, TLS) stack with LLM API calls in Zig would net you a binary ~400 KB on ReleaseSmall (statically linked).

You can implement an entire language, compiler, and a VM in another 500 KB (or less!)

I don't think 12 MB is an impressive badge here?


it's written in golang. 12MB barely gets you "hello world" since everything is statically linked. With that in mind, the size is impressive.


golang doesn't statically link everything by default (anymore?), this is from FreeBSD:

    $ ls -l axe
    -rwxr-xr-x  1 root wheel 12830781 Mar 12 22:38 axe*
    
    $ ldd axe
    axe:
        libthr.so.3 => /lib/libthr.so.3 (0xe2e74a1d000)
        libc.so.7 => /lib/libc.so.7 (0xe2e74c27000)
        libsys.so.7 => /lib/libsys.so.7 (0xe2e75de6000)
        [vdso] (0xe2e7366b000)


I know off topic, but is that mostly coming from the Go runtime (how large is that about?)


The excessive size of Go binaries is a common complain. I last recall seeing a related discussion on Lobsters [1]. Who knows, maybe the binary could be shrunk a bit? IMHO 12mb binary size is not that big of a deal.

--

1: https://lobste.rs/s/tzyslr/reducing_size_go_binaries_by_up_7...


12 MB is not large; it's like 3 minutes of watching YouTube. Actual RAM consumption is only very weakly correlated to the binary size, and that's what matters.


It is large compared to a stripped Zig ReleaseSmall binary with no runtime. With agents, one can take this repo, and create an extremely small binary.

To your point, why even advertise the number? If that particular number is completely irrelevant in practical usage, why mention it? It seems like the point is to impress, hence my response.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: