More

tibbar · 2026-03-25T20:00:37 1774468837

The point is that ideally the models keep improving until they can solve problems people care about. Which is already partly true, but there are lots of problems that are still out of reach.

tibbar · 2026-03-13T22:41:11 1773441671

I think OpenAI is more of an aesthetic. Very... Apple-like, polished, with an eye towards making really cool stuff. And aesthetics are a type of philosophy.

This is less noble than how Anthropic presents themselves but still much more attractive to many than XAI.

tokioyoyo · 2026-03-14T00:15:17 1773447317

The feeling on the street is that Anthropic IS the Apple of the AIs.

tibbar · 2026-03-14T03:09:43 1773457783

Come now, surely Anthropic is a premium Linux distribution.

ysleepy · 2026-03-14T06:57:04 1773471424

And Apple a premium Unix derivative?

energy123 · 2026-03-14T04:08:25 1773461305

To a researcher, the aesthetic is more like Bell Labs, with many research teams working with some autonomy, which is why the public naming of model releases appears chaotic. Very different to the top-down approach of Apple.

j_maffe · 2026-03-13T23:11:24 1773443484

> aesthetics are a type of philosophy.

What philosophy is that?

dbspin · 2026-03-13T23:18:23 1773443903

It's literally called aesthetics, the philosophical approach is the original meaning of the word - https://en.wikipedia.org/wiki/Aesthetics

Properly, focusing on aesthetics as an ethic would be practicing the philosophy of aestheticism - https://en.wikipedia.org/wiki/Aestheticism

hoppyhoppy2 · 2026-03-13T23:17:48 1773443868

https://en.wikipedia.org/wiki/Aesthetics

tibbar · 2026-03-13T19:36:24 1773430584

How can cursor be worth more than a few billion? Claude/Codex are already better autonomous SWE-lite replacements. Cognition surely has a better internal harness. Cursor does have a lot of users, I'll give it that.

ok_dad · 2026-03-13T19:47:50 1773431270

I like Cursor a lot more than Claude Code. It works better for me overall. I like the way they integrate it into the IDE so the agent is my tool rather than a 'partner' or something like that. I'm pretty sad that they lost some engineers, I hope these folks weren't integral to Cursor in any way.

serial_dev · 2026-03-13T19:47:23 1773431243

Distribution is also important. Cursor is a great normie tool (I’m one of them), with probably more enterprise deals than the competition.

SV_BubbleTime · 2026-03-13T20:26:32 1773433592

Moats are weird right now… but Cursor doesn’t have one at all so I agree it can’t really be worth much.

tibbar · 2026-03-10T21:28:12 1773178092

i think closer to tens-of-thousands-of-dollars, by my napkin math!

tibbar · 2026-03-10T19:45:18 1773171918

a lot of the value of tests is confirming that the system hasn't regressed beyond the behavior at the original release. It's bad if the original release is wrong, but a separate issue is if the system later accidentally stops behaving the way it did originally.

InsideOutSanta · 2026-03-10T20:29:13 1773174553

The issue I see is that the high test coverage created by having LLMs write tests results in almost all non-trivial changes breaking tests, even if they don't change behavior in ways that are visible from the outside. In one project I work, we require 100% test coverage, so people just have LLMs write tons of tests, and now every change I make to the code base always breaks tests.

So now people just ignore broken tests.

> Claude, please implement this feature.

> Claude, please fix the tests.

The only thing we've gained from this is that we can brag about test coverage.

hinkley · 2026-03-11T00:58:38 1773190718

My best unit tests are 3 lines, one of them whitespace, and they assert one single thing that's in the requirements.

These are the only tests I've witnessed people delete outright when the requirements change. Anything more complex than this, they'll worry that there's some secondary assertion being implied by a test so they can't just delete it.

Which, really is just experience telling them that the code smells they see in the tests are actually part of the test.

meanwhile:

    it("only has one shipping address", ...

is demonstrably a dead test when the story is, "allow users to have multiple shipping addresses", as is a test that makes sure balances can't go negative when we decide to allow a 5 day grace period on account balances. But if it's just one of six asserts in the same massive tests, then people get nervous and start losing time.

ForHackernews · 2026-03-10T21:52:16 1773179536

Unit tests vs acceptance tests. You shouldn't be afraid to throw away unit tests if the implementation changes, and acceptance tests should verify behavior at API boundaries, ignoring implementation details.

hinkley · 2026-03-11T01:00:51 1773190851

BDD helps with this as it can allow you to get the setup out of the tests making it even cheaper for someone to yeet a defunct test.

mattmanser · 2026-03-10T20:55:24 1773176124

I feel it end up a massive drag on development velocity and makes refactoring to simpler designs incredibly painful.

But hey, we're just supposed to let the AIs run wild and rewrite everything every change so maybe that's a heretic view.

duskdozer · 2026-03-11T11:49:54 1773229794

>simpler designs

Some complex design might just be hacks on hacks, but some are chesterton's fences

tibbar · 2026-03-10T17:27:38 1773163658

I think the first enlightenment is that software engineers should be able to abstract away these algorithms to reliable libraries.

The second enlightenment is that if you don't understand what the libraries are doing, you will probably ship things that assemble the libraries in unreasonably slow/expensive ways, lacking the intuition for how "hard" the overall operation should be.

tibbar · 2026-03-09T23:05:32 1773097532

That is the claim of the post. I also don't see confirmation elsewhere

nextos · 2026-03-09T23:06:21 1773097581

There is very little information around, this is the most authoritative post I could find. There are some comments on X as well.

According to this blogpost, he sadly passed away last Thursday, March 5th.

jlhawn · 2026-03-09T23:08:37 1773097717

There were a few recent edits about this on Tony Hoar's Wikipedia page which were reverted because there was no substantial evidence: https://en.wikipedia.org/w/index.php?title=Tony_Hoare&action...

nextos · 2026-03-09T23:16:11 1773098171

It was edited again a few minutes ago and now displays Sunday, March 8th as his date of death.

codethief · 2026-03-09T23:46:55 1773100015

And it's gone again!

tibbar · 2026-03-02T01:57:00 1772416620

I have no skin in the game here, but this seems a bit "sharp-edged", do you have something against the guy? He just seems deep into his influencer/retired hobbyist arc to me...

refulgentis · 2026-03-02T02:37:30 1772419050

No, and me too. Just had been sitting in my chest a while when I see people expecting non-hobbyist work from him. And had been worried to post it because things you and I understand become sharp-edged when spoken out loud to other people who don't.

tibbar · 2026-03-02T01:43:02 1772415782

I think the big frustration I've had in learning modern ML is that the entire owl is just so complicated. A poor explainer reads like "black box is black boxing the other black box", completely undecipherable. A mediocre-to-above-average explanation will be like "(loosely introduced concept) is (doing something that sounds meaningful) to black box", which is a little better. However, when explanations start getting more accurate, you run into the sheer volume of concepts/data transforms taking place in a transformer, and there's too much information to be useful as a pedagogical device.

tibbar · 2026-02-24T20:10:58 1771963858

Today I got a feature request from another team in a call. I typed into our slack channel as a note. Someone typed @cursor and moments later the feature was implemented (correctly) and ready to merge.

The tools are good! The main bottleneck right now is better scaffolding so that they can be thoroughly adopted and so that the agents can QA their own work.

I see no particular reason not to think that software engineering as we know it will be massively disrupted in the next few years, and probably other industries close behind.

JohnMakin · 2026-02-24T20:49:55 1771966195

It really doesn't matter how "good" these tools feel, or whatever vague metric you want - they hemorrhage cash at a rate perhaps not seen in human history. In other words, that usage you like is costing them tons of money - the bet is that energy/compute will become vastly cheaper in a matter of a couple of years (extremely unlikely), or they find other ways to monetize that don't absolutely destroy the utility of their product (ads, an area we have seen google flop in spectacularly).

And even say the latter strategy works - ads are driven by consumption. If you believe 100% openAI's vision of these tools replacing huge swaths of the workforce reasonably quickly, who will be left to consume? It's all nonsense, and the numbers are nonsense if you spend any real time considering it. The fact SoftBank is a major investor should be a dead giveaway.

df2dd · 2026-02-25T01:14:30 1771982070

Indeed. Many of the posts I see on here are hilarious.

Have any of you tried re-producing an identical output, given an identical set of inputs? It simply doesn't happen. Its like a lottery.

This lack of reproducibility is a huge problem and limits how far the thing can go.

tvbusy · 2026-02-25T07:08:46 1772003326

LLMs have randomness baked into every single token it generates. You can try running LLMs locally and set the temperature to low and it immediately feels boring to always have the same reply every time. It's the randomness that makes them feel "smart". Put it another way, randomness is required for the illusion of intelligence.

df2dd · 2026-02-25T13:46:35 1772027195

Im fully aware of that. However, this illusion is a dangerous mirage. It doesnt equate to reality. In some cases thats OK. But in most cases its not, especially so in the context of business operations.

tibbar · 2026-02-25T07:05:07 1772003107

Determinism in agents is a complex topic because there are several different layers of abstraction, each of which may introduce its own non-determinism. But yeah, it is going to be difficult to induce determinism in a commercial coding agent, for reasons discussed below.

However, we can start by claiming that non-determinism is not necessarily a bad thing - non-greedy token sampling helps prevent certain degenerate/repetitive states and tends to produce overall higher quality responses [0]. I would also observe that part of the yin-yang of working with the agents is letting go of the idea that one is working with a "compiler" and thinking of it more as a promising but fallible collaborator.

With that out of the way, what leads to non-determinism? The classic explanation is the sampling strategy used to select the next token from the LLM. As mentioned above, there are incentives to use a non-zero temperature for this, which means that most LLM APIs are intentionally non-deterministic by default. And, even at temperature zero LLMs are not 100% deterministic [1]. But it's usually pretty close; I am running a local LLM as we speak with greedy sampling and the result is predictably the same each time.

Proprietary reasoning models are another layer of abstraction that may not even offer temperature as knob anymore[2]. I think Claude still offers it, but it doesn't guarantee 100% determinism at temperature 0 either. [3]

Finally, an agentic tool loop may encounter different results from run to run via tool calls -- it's pretty hard to force a truly reproducible environment from run to run.

So, yeah, at best you could get something that is "mostly" deterministic if you coded up your own coding agent that focused on using models that support temperature and always forced it to zero, while carefully ensuring that your environment has not changed from run to run. And this would, unfortunately, probably produce worse output than a non-deterministic model.

[0] https://arxiv.org/abs/2007.14966 [1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in... [2] https://learn.microsoft.com/en-us/azure/ai-foundry/openai/ho... [3] https://platform.claude.com/docs/en/about-claude/glossary

df2dd · 2026-02-25T13:59:14 1772027954

Appreciate the response. I agree that non-determinism isnt a bad thing. However LLMs are being pushed as the thing to replace much of the deterministic things that exist in the world - and anyone seen to be thinking otherwise gets punished e.g. in the stock market.

This world of extremes is annoying for people who have the ability to think more broadly and see a world where deterministic systems and non-deterministic systems can work together, where it makes sense.

tibbar · 2026-02-25T21:58:27 1772056707

Yeah, I think you're right that LLMs are overused. In most cases where a deterministic system is feasible and desirable, it's also much faster and cheaper than using an LLM, too..

nfg · 2026-02-24T21:00:48 1771966848

> In other words, that usage you like is costing them tons of money

Evidence? I’m sure someone will argue, but I think it’s generally accepted that inference can be done profitably at this point. The cost for equivalent capability is also plummeting.

JohnMakin · 2026-02-24T21:04:39 1771967079

I didn't think there would need to be more evidence than the fact they are saying they need to spend $600 billion in 4 years on $13bn revenue currently, but here we are.

Here you go: https://www.wsj.com/livecoverage/stock-market-today-dow-sp-5...

tibbar · 2026-02-24T21:12:06 1771967526

Right, but if OpenAI wanted to stop doing research and just monetize its current models, all indications are that it would be profitable. If not, various adjustments to pricing/ads/ etc could get it there. However, it has no reason to do this, and like all the other labs is going insanely into debt to develop more models. I'm not saying that it's necessarily going to work out, but they're far from the first company to prioritize growth over profitability

mike_hearn · 2026-02-25T09:07:59 1772010479

This meme needs to go in the bin. Loss making companies love inventing strange new accounting metrics, which is one reason public companies are forced to report in standardized ways.

There's no such thing as "profitable inference". A company is either profitable or it isn't.

Let's for a second assume all the labs somehow manage to form a secret OPEC-style cartel that agrees to slow training to a halt, and nobody notices or investigates. This is already hard to imagine with the amount of scrutiny they're under and given that China views this as a military priority. But let's pretend they manage it. These firms also have lots of other costs:

• Staffing and comp! That's huge!

• User subsidies to allow flat rate plans

• Support (including abuse control and handling the escalations from their support bots)

• Marketing

• Legal fees and data licensing

• Corporate/enterprise sales, which is expensive as hell even though it's often worth it

• Debt servicing (!!)

• Generating returns for investors

Inferencing margins have to cover all of those, even if progress stops tomorrow and the RoI to investors has to be likewise very large, so margins can't be trivial. Yet what these firms have said about their margins is very ambiguous. As they're arriving at this statement by excluding major cost components like training, it's not clear what they think the cost of inferencing actually is. Are they excluding other things too like hw depreciation and upgrades? Are they excluding the cost of the corporate sales/support infrastructure around the inferencing?

tibbar · 2026-02-25T16:37:53 1772037473

To be clear, it's absolutely impossible for OpenAI and the others to stop. The valuation and honestly the global markets depend on them staying leveraged to the hilt. So they're not going to stop. However, the point is that the models are genuinely useful and people pay for them, and if we reset the timeline with a company that has just the current proprietary models, they could turn a profit. That might involve charging more than they do now, etc. But this is much different than OpenAI, specifically, trying to turn a profit today, which wouldn't work for many reasons.

But also, "profitable inference" IS a thing! "Gross margin" is important and meaningful, even if a company has other obligations that mean it's overall not profitable.

rodonn · 2026-03-01T04:59:24 1772341164

"profitable on inference" means "marginal costs of inference are lower than revenue". It is very common to distinguish between upfront costs vs. marginal costs when judging the economic viability of a business.

You mention "debt servicing", but OpenAI has no debt. All the money they have raised is equity not debt.

zippothrowaway · 2026-02-24T21:51:49 1771969909

Nope. The only "all indications" are that they say so. They may be making a profit on API usage, but even that is very suspect - compare against how much it actually costs to rent a rack of B200s from Microsoft. But for the millions of people using Codex/Claude Code/Copilot, the costs of $20-$30-$200 clearly don't compare to the actual cost of inference.

javascriptfan69 · 2026-02-24T20:54:12 1771966452

What was the feature and what was the note?

tibbar · 2026-02-24T21:08:58 1771967338

It was a modest update to a UX ... certainly nothing world-changing. (It's also had success with some backend performance refactors, but this particular change was all frontend.) The note was basically just a transcription of what I was asked to do, and did not provide any technical hints as to how to go about the work. The agent figured out what codebase, application, and file to modify and made the correct edit.

javascriptfan69 · 2026-02-24T22:40:32 1771972832

That's pretty neat! Thanks for elaborating.

tapoxi · 2026-02-24T22:57:22 1771973842

Yeah but was Cursor using Claude? What's the moat that any of these companies have that prevents me from using another LLM?

nemooperans · 2026-02-24T21:35:05 1771968905

[flagged]

rodonn · 2026-03-01T04:56:26 1772340986

There's been a huge amount of improvement in coding agent effectiveness since they ran that experiment. In a more recent follow up experiment, METR found 20% speed up from AI assistance and says they believe that is likely an underestimate of the impact. https://metr.org/blog/2026-02-24-uplift-update/

They are working on making a new measurement approach that will be more accurate.

tibbar · 2026-02-24T22:55:21 1771973721

Respectfully, was this comment AI generated? It has all the signs.

And scaffolding does matter a lot, but mostly because the models just got a lot better and the corresponding scaffolding for long running tasks hasn't really caught up yet.