Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's also worth considering that past some threshold, it may be very difficult for us as users to discern which model is better. I don't think thats what's going on here, but we should be ready for it. For example, if you are an ELO 1000 chess player would you yourself be able to tell if Magnus Carlson or another grandmaster were better by playing them individually? To the extent that our AGI/SI metrics are based on human judgement the cluster effect that they create may be an illusion.


> For example, if you are an ELO 1000 chess player would you yourself be able to tell if Magnus Carlson or another grandmaster were better by playing them individually?

No, but I wouldn't be able to tell you what the player did wrong in general.

By contrast, the shortcomings of today's LLMs seem pretty obvious to me.


Actually, chess commentators do this all the time. They have the luxury of consulting with others, and discussing + analyzing freely. Even without the use of an engine.


Au contraire, AlphaGo made several “counterintuitive” moves that professional Go players thought were mistakes during the play, but turned out to be great strategic moves in hindsight.

The (in)ability to recognize a strange move’s brilliance might depend on the complexity of the game. The real world is much more complex than any board game.

https://en.m.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol


That's a good point, but I doubt that Sonnet adding a very contrived bug that crashes my app is some genius move that I fail to understand.

Unless it's a MUCH bigger play where through some butterfly effect it wants me to fail at something so I can succeed at something else.

My real name is John Connor by the way ;)


ASI is here and it's just pretending it can't count the b's in blueberry :D


Thanks, this made my day :-D


That's great, but AlphaGo used artificial and constrained training materials. It's a lot easier to optimize things when you can actually define an objective score, and especially when your system is able to generate valid training materials on its own.


"artificial and constrained training materials"

Are you simply referring to games having a defined win/loss reward function?

Because pretty sure Alpha Go was ground breaking also because it was self taught, by playing itself, there were no training materials. Unless you say the rules of the game itself is the constraint.

But even then, from move to move, there are huge decisions to be made that are NOT easily defined with a win/loss reward function. Especially early game, there are many moves to make that don't obviously have an objective score to optimize against.

You could make the big leap and say that GO is so open ended, that it does model Life.


That quote was intended to mean --

"artificial" maybe I should have said "synthetic"? I mean the computer can teach itself.

"constrained" the game has rules that can be evaluated

and as to the other -- I don't know what to tell you, I don't think anything I said is inconsistent with the below quotes.

It's clearly not just a generic LLM, and it's only possible to generate a billion training examples for it to play against itself because synthetic data is valid. And synthetic data contains training examples no human has ever done, which is why it's not at all surprising it did stuff humans never would try. A LLM would just try patterns that, at best, are published in human-generated go game histories or synthesized from them. I think this inherently limits the amount of exploration it can do of the game space, and similarly would be much less likely to generate novel moves.

https://en.wikipedia.org/wiki/AlphaGo

> As of 2016, AlphaGo's algorithm uses a combination of machine learning and tree search techniques, combined with extensive training, both from human and computer play. It uses Monte Carlo tree search, guided by a "value network" and a "policy network", both implemented using deep neural network technology.[5][4] A limited amount of game-specific feature detection pre-processing (for example, to highlight whether a move matches a nakade pattern) is applied to the input before it is sent to the neural networks.[4] The networks are convolutional neural networks with 12 layers, trained by reinforcement learning.[4]

> The system's neural networks were initially bootstrapped from human gameplay expertise. AlphaGo was initially trained to mimic human play by attempting to match the moves of expert players from recorded historical games, using a database of around 30 million moves.[21] Once it had reached a certain degree of proficiency, it was trained further by being set to play large numbers of games against other instances of itself, using reinforcement learning to improve its play.[5] To avoid "disrespectfully" wasting its opponent's time, the program is specifically programmed to resign if its assessment of win probability falls beneath a certain threshold; for the match against Lee, the resignation threshold was set to 20%.[64]


Of course, not an LLM. I was just referring to AI technology in general. And that goal functions can be complicated and not-obvious even for a game world with known rules and outcomes.

I was miss-remembering the order of how things happened.

AlphaZero, another iteration after the famous matches, was trained without human data.

"AlphaGo's team published an article in the journal Nature on 19 October 2017, introducing AlphaGo Zero, a version without human data and stronger than any previous human-champion-defeating version.[52] By playing games against itself, AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days.[53]"


There are quite a few relatively objective criteria in the real world: real estate holdings, money and material possessions, power to influence people and events, etc.

The complexity of achieving those might result in the "Centaur Era", when humans+computers are superior to either alone, lasting longer than the Centaur chess era, which spanned only 1-2 decades before engines like Stockfish made humans superfluous.

However, in well-defined domains, like medical diagnostics, it seems reasoning models alone are already superior to primary care physicians, according to at least 6 studies.

Ref: When Doctors With A.I. Are Outperformed by A.I. Alone by Dr. Eric Topol https://substack.com/@erictopol/p-156304196


It makes sense. People said software engineers would be easy to replace with AI, because our work can be run on a computer and easily tested, but the disconnect is that the primary strength of LLMs is that they can draw on huge bodies of information, and that's not the primary skill programmers are paid for. It does help programmers when you're doing trivial CRUD work or writing boilerplate, but every programmer will eventually have to be able to actually truly reason about code, and LLMs fundamentally cannot do that (not even the "reasoning" models).

Medical diagnosis relies heavily on knowledge, pattern recognition, a bunch of heuristics, educated guesses, luck, etc. These are all things LLMs do very well. They don't need a high degree of accuracy, because humans are already doing this work with a pretty low degree of accuracy. They just have to be a little more accurate.


Being a walking encyclopedia is not what we pay doctors for either. We pay them to account for the half truths and actual lies that people tell about their health. This is to say nothing about novel presentations that come about because of the genetic lottery. Same as an AI can assist but not replace a software engineer, an AI can assist but not replace a doctor.


Having worked briefly in the medical fields in the 1990s, there is some sort of "greedy matching" being pursued, so once 1-2 well-known symptoms are recognized that can be associated with diseases, the standard interventions to cure are initiated.

A more "proper" approach would be to work with sets of hypotheses and to conduct tests to exclude alternative explanations gradually - which medics call "DD" (differential diagnosis). Sadly, this is often not systematically done, and instead people jump on the first diagnosis and try if the intervention "fixes" things.

So I agree there are huge gains from "low hanging fruits" to be expected in the medical domain.


I think at this point it's an absurd take that they aren't reasoning. I don't think without reasoning about code (& math) you can get to such high scores on competitive coding and IMO scores.

Alphazero also doesn't need training data as input--it's generated by game-play. The information fed in is just game rules. Theoretically should also be possible in research math. Less so in programming b/c we care about less rigid things like style. But if you rigorously defined the objective, training data should also be not necessary.


> Alphazero also doesn't need training data as input--it's generated by game-play. The information fed in is just game rules

This is wrong, it wasn't just fed the rules, it was also fed a harness that did test viable moves and searched for optimal ones using a depth first search method.

Without that harness it would not have gained superhuman performance, such a harness is easy to make for Go but not as easy to make for more complex things. You will find the harder it is to make an effective such harness for a topic the harder it is to solve for AI models, it is relatively easy to make a good such harness for very well defined programming problems like competitive programming but much much harder for general purpose programming.


Are you talking about Monte Carlo tree search? I consider it part of the algorithm in AlphaZero's case. But agreed that RL is a lot harder in real-life setting than in a board game setting.


the harness is obtained from the game rules? the "harness" is part of the algorithm of alphzero


> the "harness" is part of the algorithm of alphzero

Then that is not a general algorithm and results from it doesn't apply to other problems.


If you mean CoT, it's mostly fake https://www.anthropic.com/research/reasoning-models-dont-say...

If you mean symbolic reasoning, well it's pretty obvious that they aren't doing it since they fail basic arithmetic.


> If you mean CoT, it's mostly fake

If that's your take-away from that paper, it seems you've arrived at the wrong conclusion. It's not that it's "fake", it's that it doesn't give the full picture, and if you only rely on CoT to catch "undesirable" behavior, you'll miss a lot. There is a lot more nuance than you allude to, from the paper itself:

> These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out.


very few humans are as good as these models at arithmetic. and CoT is not "mostly fake" that's not a correct interpretation of that research. It can be deceptive but so can human justifications of actions.


Humans can learn the symbolic rules and then apply them correctly to any problem, bounded only by time, and modulo lapses of concentration. LLMs fundamentally do not work this way, which is a major shortcoming.

They can convincingly mimic human thought but the illusion falls flat at further inspection.


What? Do you mean like this??? https://www.reddit.com/r/OpenAI/comments/1mkrrbx/chatgpt_5_h...

Calculators have been better than humans at arithmetic for well over half a century. Calculators can reason?


It's an absurd take to actually believe they can reason. The cutting edge "reasoning model," by the way:

https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226


Humans are statistically speaking static. We just find out more about them but the humans themselves don't meaningfully change unless you start looking at much longer time scales. The state of the rest of the world is in constant flux and much harder to model.


I’m not sure I agree with this - it took humans about a month to go from “wow this AI generated art is amazing” to “zzzz it’s just AI art”.


To be fair, it was more a "wow look what the computer did". The AI "art" was always bad. At first it was just bad because it was visually incongruous. Then they improved the finger counting kernel, and now it's bad because it's a shallow cultural average.

AI producing visual art has only flooded the internet with "slop", the commonly accepted term. It's something that meets the bare criteria, but falls short in producing anything actually enjoyable or worth anyone's time.


It sucks for art almost by definition, because art exists for its own reason and is in some way novel.

However, even artists need supporting materials and tooling that meet bare criteria. Some care what kind of wood their brush is made from, but I'd guess most do not.

I suspect it'll prove useless at the heart of almost every art form, but powerful at the periphery.


That's culture, not genetics.


Sure, that does make things easier: one of the reasons Go took so long to solve is that one cannot define an objective score for Go beyond the end result being a boolean win or loose.

But IRL? Lots of measures exist, from money to votes to exam scores, and a big part of the problem is Goodhart's law — that the easy-to-define measures aren't sufficiently good at capturing what we care about, so we must not optimise too hard for those scores.


> Sure, that does make things easier: one of the reasons Go took so long to solve is that one cannot define an objective score for Go beyond the end result being a boolean win or loose.

Winning or losing a Go game is a much shorter term objective than making or losing money at a job.

> But IRL? Lots of measures exist

No, not that are shorter term than winning or losing a Go game. A game of Go is very short, much much shorter than the time it takes for a human to get fired for incompetence.


Time horizon is a completely different question to what I'm responding to.

I agree the time horizon of current SOTA models isn't particularly impressive. Doesn't matter in this point.


I want to indicate that the time length of "during the play" is only 5 moves in the game.


No? some of the opening moves took experts thorough analysis to figure out were not mistakes. even in game 1 for example. not just the move 37 thing. Also thematic ideas like 3x3 invasions.


I think its doable tbh, if you pour enough resources (smart people,energy,compute power etc) like the entire planet resources

of course we can have AGI (damned if we don't) because we put so much, it better works

but the problem we cant do that right because its so expensive, AGI is not matter of if but when

but even then it always about the cost


There may be philosophical (i.e. fundamental) challenges to AGI. Consider, e.g., Godel's Incompleteness Theorem. Though Scott Aaronson argues this does not matter (see e.g., youtube video, "How Much Math Is Knowable?"). There would also seem to be limits to the computation of potentially chaotic systems. And in general, verifying physical theories has required the carrying out of actual physical experiment. Even if we were to build a fully reasoning model, "pondering" is not always sufficient.


It’s also easy to forget that “reason is the slave of the passions” (Hume) - a lot of what we regard as intelligence is explicitly tied to other, baser (or more elevated) parts of the human experience.


Yeah but its robotic industry part of works not this company

they just need to "MCP" it to robot body and it works (also part of reason why OpenAI buys a robotic company)


I think chess commentators are pretty lost when analyzing games of higher rated players without engines.

They are good at framing what is going on and going over general plans and walking through some calculations and potential tactics. But I wouldn't say even really strong players like Leko, Polgar, Anand will have greater insights in a Magnus-Fabi game without the engine.


Anyone more than ~300 points below the players can only contribute to the discussion in a superficial capacity though


the argument is for in the future, not now


The future had us abandon traditional currency in favor of bitcoin, it had digital artists being able to sell NFTs for their work, it had supersonic jet travel, self driving or even flying cars. It had population centers on the moon, mines on asteroids, fusion power plants, etc.

I think large language models have the same future as supersonic jet travel. It’s usefulness will fail to realize, with traditional models being good enough but for a fraction of the price, while some startups keep trying to push this technology but meanwhile consumers keep rejecting it.


Even if models keep stagnating at roughly the current state of the art (with only minor gains), we are still working through the massive economic changes they will bring.

Unlike supersonic passenger jet travel, which is possible and happened, but never had much of an impact on the wider economy, because it never caught on.


Cost was what brought supersonic down. Comparatively speaking, it may be the cost/benefit curve that will decide the limit of this generation of technology. It seems to me the stuff we are looking at now is massively subsidised by exuberant private investment. The way these things go, there will come a point where investors want to see a return, and that will be a decider on wether the wheels keep spinning in the data centre.

That said, supersonic flight is yet very much a thing in military circles …


Yes, cost is important. Very important.

AI is a bit like railways in the 19th century: once you train the model (= once you put down the track), actually running the inference (= running your trains) is comparatively cheap.

Even if the companies later go bankrupt and investors lose interest, the trained models are still there (= the rails stay in place).

That was reasonably common in the US: some promising company would get British (and German etc) investors to put up money to lay down tracks. Later the American company would go bust, but the rails stayed in America.


I think there is a fundamental difference though. In the 19th century when you had a rail line between two places it pretty much established the only means of transport between those places. Unless there was a river or a canal in place, the alternative was pretty much walking (or maybe a horse and a carriage).

The large language models are not that much better than a single artist / programmer / technical writer (in fact they are significantly worse) working for a couple of hours. Modern tools do indeed increase the productivity of workers to the extent where AI generated content is not worth it in most (all?) industries (unless you are very cheap; but then maybe your workers will organize against you).

If we want to keep the railway analogy, training an AI model in 2025 is like building a railway line in 2025 where there is already a highway, and the highway is already sufficient for the traffic it gets, and won’t require expansion in the foreseeable future.


> The large language models are not that much better than a single artist / programmer / technical writer (in fact they are significantly worse) working for a couple of hours.

That's like saying sitting on the train for an hour isn't better than walking for a day?

> [...] (unless you are very cheap; but then maybe your workers will organize against you).

I don't understand that. Did workers organise against vacuum cleaners? And what do eg new companies care about organised workers, if they don't hire them in the first place?

Dock workers organised against container shipping. They mostly succeeded in old established ports being sidelined in favour of newer, less annoying ports.


> That's like saying sitting on the train for an hour isn't better than walking for a day?

No, that’s not it at all. Hiring a qualified worker for a few hours—or having one on staff is not like walking for a day vs. riding a train. First of all, the train is capable of carrying a ton of cargo which you will never be able to on foot, unless you have some horses or mules with you. So having a train line offers you capabilities that simply didn’t exist before (unless you had a canal or a navigable river that goes to your destination). LLMs offers no new capabilities. The content it generates is precisely the same (except its worse) as the content a qualified worker can give you in a couple of hours.

Another difference is that most content can wait the couple of hours it takes the skilled worker to create it, the products you can deliver via train may spoil if carried on foot (even if carried by a horse). A farmer can go back tending the crops after having dropped the cargo at the station, but will be absent for a couple of days if they need to carry it on foot. etc. etc. None of these is applicable for generated content.

> Did workers organize against vacuum cleaners?

Workers have already organized (and won) against generative AI. https://en.wikipedia.org/wiki/2023_Writers_Guild_of_America_...

> Dock workers organised against container shipping. They mostly succeeded in old established ports being sidelined in favour of newer, less annoying ports.

I think you are talking about the 1971 ILWU strike. https://www.ilwu.org/history/the-ilwu-story/

But this is not true. Dock workers didn’t organized against mechanization and automation of ports, they organized against mass layoffs and dangerous working conditions as ports got more automated. Port companies would use the automation as an excuse to engage in mass layoffs, leaving far too few workers tending far to much cargo over far to many hours. This resulted in fatigued workers making mistakes which often resulted in serious injuries and even deaths. The 2022 US railroad strike was for precisely the same reason.


> Another difference is that most content can wait the couple of hours it takes the skilled worker to create it, [...]

I wouldn't just willy nilly turn my daughter's drawings into cartoons, if I had to bother a trained professional about it.

A few hours of a qualified worker's time takes a couple hundred bucks at minimum. And it takes at least a couple of hours to turn around the task.

Your argument seems a bit like web search being useless, because we have highly trained librarians.

Similar for electronic computers vs human computers.

> I think you are talking about the 1971 ILWU strike. https://www.ilwu.org/history/the-ilwu-story/

No, not really. I have a more global view in mind, eg Felixtowe vs London.

And, yes, you do mechanisation so that you can save on labour. Mass layoffs are just one expression of this (when you don't have enough natural attrition from people quitting).

You seem very keen on the American labour movements? There's another interesting thing to learn from history here: industry will move elsewhere, when labour movements get too annoying. Both to other parts of the country, and to other parts of the world.


My understanding that inference costs are very high also, especially with new "reasoning" models.


Most models can be inferenced-upon with merely borderline-consumer hardware.

Even the fancy models where you need to buy compute (rails) that's about the price of a new car, they have a power draw of ~700W[0] while running inference at 50 tokens/second.

But!

The constraint with current hardware isn't compute, the models are mostly constrained by RAM bandwidth: back of the envelope estimate says that e.g. if Apple took the compute already in their iPhones and reengineered the chips to have 256 GB of RAM and sufficient bandwidth to not be constrained by it, models that size could run locally for a few minutes before hitting thermal limits (because it's a phone), but we're still only talking one-or-two-digit watts.

[0] https://resources.nvidia.com/en-us-gpu-resources/hpc-datashe...

[1] Testing of Mistral Large, a 123-billion parameter model, on a cluster of 8xH200 getting just over 400 tokens/second, so per 700W device one gets 400/8=50 tokens/second: https://www.baseten.co/blog/evaluating-nvidia-h200-gpus-for-...


> e.g. if Apple took the compute already in their iPhones and reengineered the chips to have 256 GB of RAM and sufficient bandwidth to not be constrained by it, models that size could run locally for a few minutes before hitting thermal limits (because it's a phone), but we're still only talking one-or-two-digit watts.

That hardware cost Apple tens of billions to develop and what you're talking about in term of "just the hardware needed" is so far beyond consumer hardware it's funny. Fairly sure most Windows laptops are still sold with 8GB RAM and basically 512MB of VRAM (probably less), practically the same thing for Android phones.

I was thinking of building a local LLM powered search engine but basically nobody outside of a handful of techies would be able to run it + their regular software.


> That hardware cost Apple tens of billions to develop

Despite which, they sell them as consumer devices.

> and what you're talking about in term of "just the hardware needed" is so far beyond consumer hardware it's funny.

Not as big a gap as you might expect. M4 chip (as used in iPads) has "28 billion transistors built using a second-generation 3-nanometer technology" - https://www.apple.com/newsroom/2024/05/apple-introduces-m4-c...

Apple don't sell M4 chips separately, but the general best-guess I've seen seems to be they're in the $120 range as a cost to Apple. Certainly it can't exceed the list price of the cheapest Mac mini with one (US$599).

As bleeding-edge tech, those are expensive transistors, but still 10 of them would have enough transistors for 256 GB of RAM plus all the compute each chip already has. Actual RAM is much cheaper than that.

10x the price of the cheapest Mac Mini is $6k… but you could then save $400 by getting a Mac Studio with 256 GB RAM. The max power consumption (of that desktop computer but with double that, 512 GB RAM) is 270 W, representing an absolute upper bound: if you're doing inference you're probably using a fraction of the compute, because inference is RAM limited not compute limited.

This is also very close to the same price as this phone, which I think is a silly phone, but it's a phone and it exists and it's this price and that's all that matters: https://www.amazon.com/VERTU-IRONFLIP-Unlocked-Smartphone-Fo...

But irregardless, I'd like to emphasise that these chips aren't even trying to be good at LLMs. Not even Apple's Neural Engine is really trying to do that, NPUs (like the Neural Engine) are all focused on what AI looked like it was going to be several years back, not what current models are actually like today. (And given how fast this moves, it's not even clear to me that they were wrong or that they should be optimised for what current models look like today).

> Fairly sure most Windows laptops are still sold with 8GB RAM and basically 512MB of VRAM (probably less), practically the same thing for Android phones.

That sounds exceptionally low even for budget laptops. Only examples I can find are the sub-€300 budget range and refurbished devices.

For phones, there is currently very little market for this in phones, the limit is not because it's an inconceivable challenge. Same deal as thermal imaging cameras in this regard.

> I was thinking of building a local LLM powered search engine but basically nobody outside of a handful of techies would be able to run it + their regular software.

This has been a standard database tool for a while already. Vector databases, RAG, etc.


> This has been a standard database tool for a while already. Vector databases, RAG, etc.

Oh, please show me the consumer version of this. I'll wait. I want to point and click.

Similar story for the consumer devices with cheap unified 256GB of RAM.


Look at computer systems that cost 2000 or less and they are useless at running LLM coding assistants for example locally. A minimal subscription to a cloud service unfortunately beats them, and even more expensive systems that can run larger models, run them too slowly to be productive. Yes you can chat with them and perform tasks slowly on low cost hardware but that is all. If you put local LLMs in your IDE they slow you down or just don't work.


My understanding of train lines in America is that lots of them went to ruin and the extant network is only “just good enough” for freight. Nobody talks about Amtrak or the Southern Belle or anything any more.

Air travel of course taking over is the main reason for all of this but the costs sunk into the rails are lost or ROI curtailed by market force and obsolescence.


Amtrak was founded in 1971. That's about a century removed from the times I'm talking about. Not particularly relevant.


Completely relevant. It’s all that remains of the train tracks today. Grinding out the last drops from those sunk costs, attracting minimal investment to keep it minimally viable.


Grinding out returns from a sunk cost of a century-old investment is pretty impressive all by itself.

Very few people want to invest more: the private sector doesn't want to because they'll never see the return, the governments don't want to because the returns are spread over their great-great-grandchildren's lives and that doesn't get them re-elected in the next n<=5 (because this isn't just a USA problem) years.

Even the German government dragged its feet over rail investment, but they're finally embarrassed enough by the network problems to invest in all the things.


Thanks yes the train tracks analogy does witber somewhat when you consider the significant maintenance costs.


That's simply because capitalists really don't like investments with a 50 year horizon without guarantees. So the infrastructure that needs to be maintained is not.


A valid analogy only if the future training method is the same as today's.


The current training method is the same as 30 years ago, it's the GPUs that changed and made it have practical results. So we're not really that innovative with all this...


Wait why are these companies losing money on every query of inference is cheap.


Because they are charging even less?


Sounds like a money making strategy. Also, given how expensive all this shit is if inference costs _more_? That’s not cheap to me.

But again the original argument was that they can run forever because inference is cheap, not cheap enough if you’re losing money on it.


Even if the current subsidy is 50%, gpt would be cheap for many applications at twice the price. It will determine adaption, but it wouldn’t prevent me having a personal assistant (and I’m not a 1%er, so that’s a big change)


What are you talking about, there’s zero impact from these thing so far.


You are right that outside of the massive capex spending on training models, we don't see that much of an economic impact, yet. However, it's very far from zero:

Remember these outsourcing firms that essentially only offer warm bodies that speak English? They are certainly already feeling the impact. (And we see that in labour market statistics for eg the Philippines, where this is/was a big business.)

And this is just one example. You could ask your favourite LLM about a rundown of the major impacts we can already see.


But those warm body that speak English, they offer a service by being warm, and able to sort of be attuned to the distress you feel. A frigging robot solving your unsolvable problem ? You can try, but witness the backlash.


We are mixing up two meanings of the word 'warm' here.

There's no emotional warmth involved in manning a call centre and explicitly being confined to a script and having no power to make your own decisions to help the customer.

'Warm body' is just a term that has nothing to do with emotional warmth. I might just as well have called them 'body shops', even though it's of no consequence that the people involved have actual bodies.

> A frigging robot solving your unsolvable problem ? You can try, but witness the backlash.

Front line call centre workers aren't solving your unsolvable problems, either. Just the opposite.

And why are you talking in the hypothetical? The impact on call centres etc is already visible in the statistics.


But running inference isn’t cheap

And with trains people paid for a ticket and a hard good “travel”

Ai so far gives you what?


Running inference is fairly cheap compared to training.


A rocket trip to the moon is fairly cheap compared to a rocket trip to Mars.


And the view from the moon is pretty stunning. That from Mars… not so much!


I've seen this take a lot, but I don't know why because it's extremely divorced from reality.

Demand for AI is insanely high. They can't make chips fast enough to meet customer demand. The energy industry is transforming to try to meet the demand.

Whomever is telling you that consumers are rejecting it is lying to you, and you should honestly probably reevaluate where you get your information. Because it's not serving you well.


> Demand for AI is insanely high. They can't make chips fast enough to meet customer demand.

Woah there cowboy, slow down a little.

Demand for chips is come from the inference providers. Demand for inference was (and still is) being sold at below cost. OpenAI, for example, has a spend rate of $5b per month on revenues of $0.5b per month.

They are literally selling a dollar for actual 10c. Of course "demand" is going to be high.


> Demand for chips is come from the inference providers. Demand for inference was (and still is) being sold at below cost. OpenAI, for example, has a spend rate of $5b per month on revenues of $0.5b per month.

This is definitely wrong, last year it was $725m/month expenses and $300m/month revenue. Looks like the nearly-2:1 ratio is also expected for this year: https://taptwicedigital.com/stats/openai

This also includes the cost of training new models, so I'm still not at all sure if inference is sold at-cost or not.


> This is definitely wrong, last year it was $725m/month expenses and $300m/month revenue.

It looks like you're using "expenses" to mean "opex". I said "spend rate", because they're spending that money (i.e. the sum of both opex and capex). The reason I include the capex is because their projections towards profitability, as stated by them many times, is based on getting the compute online. They don't claim any sort of profitability without that capex (and even with that capex, it's a little bit iffy)

This includes the Stargate project (they're committed for $10b - $20b (reports vary) before the end of 2025), they've paid roughly $10b to Microsoft for compute for 2025. Oracle is (or already has) committed $40b in GPUs for Stargate and Softbank has committments to Stargate independently of OpenAI.

> Looks like the nearly-2:1 ratio is also expected for this year: https://taptwicedigital.com/stats/openai

I find it hard to trust these numbers[1]: The $40b funding was not in cash right now, and depends on Softbank for $30b with Softbank syndicating the remaining $10b. Softbank themselves don't have cash of $30b and has to get a loan to reach that amount. Softbank did provide $7.5b in cash, with milestones for the remainder. That was in May 2025. In August that money had run out and OpenAI did another raise of $8.3b.

In short, in the last two to three months, OpenAI spent $5b/month on revenues of $0.5b/m. They are also depending on Softbank coming through with the rest of the $40b before end of 2025 ($30b in cash and $10b by syndicating other investors into it) because their commitments require that extra cash.

Come Jan-2026, OpenAI would have received, and spent most of, $60b for 2025, with a projected revenue $12b-$13b.

---------------------------------

[1] Now, true, we are all going off rumours here (as this is not a public company, we don't have any visibility into the actual numbers), but some numbers match up with what public info there is and some don't.


> It looks like you're using "expenses" to mean "opex"

I took their losses and added it to their revenue. That seems like that sum would equal expenses.

> The $40b funding was not in cash right now,

Does this matter? I'm not counting it as revenue.

> In short, in the last two to three months, OpenAI spent $5b/month on revenues of $0.5b/m.

You're repeating the same claim as before, I've not seen any evidence to support your numbers.

The evidence I linked you to suggests the 2025 average will be double that revenue, $1bn/month, at an expense of ($9bn loss after $12bn revenue / 12 months = $21bn / 12 months) = $1.75bn/month


>> The $40b funding was not in cash right now,

> Does this matter? I'm not counting it as revenue.

Well, yes, because they forecast spending all of it by end of 2025, and they moved up their last round ($8.3b) by a month or two because they needed the money.

My point was, they received a cash injection of $10b (first part of the $40b raise) and that lasted only two months.

>> In short, in the last two to three months, OpenAI spent $5b/month on revenues of $0.5b/m.

> You're repeating the same claim as before, I've not seen any evidence to support your numbers.

Briefly, we don't really have visibility into their numbers. What we do have visibility into is how much cash they needed between two points (Specifically, the months of June and July). We also know what their spending commitment is (to their capex suppliers) for 2025. That's what I'm using.

They had $10b injected at the start of June. They needed $8.3b at the end of July.


It's crazy how many people are completely confident in their "knowledge" of the margins these products have despite the companies providing them not announcing those details!

(To be clear, I'm not criticising the person I'm replying to.)


Mm, quite.

I tend to rough-estimate it based on known compute/electricity costs for open weights models etc., but what evidence I do have is loose enough that I'm willing to believe a factor of 2 per standard deviation of probability in either direction at the moment, so long as someone comes with receipts.

Subscription revenue and corresponding service provision are also a big question, because those will almost always be either under- or over-used, never precisely balanced.


I think the above post has a fair point. Demand for chatbot customer service in various forms is surely "insanely high" - but demand from whom? Because I don't recall any end-user ever asking for it.

No, instead it'll be the new calculator that you can use to lazy-draft an email on your 1.5 hour Ryanair economy flight to the South. Both unthinkable luxuries just decades ago, but neither of which have transformed humanity profoundly.


This is just the same argument. If you believe demand for AI is low then you should be able to verify that with market data.

Currently market data is showing a very high demand for AI.

These arguments come down to "thumbs down to AI". If people just said that it would at least be an honest argument. But pretending that consumers don't want LLMs when they're some of the most popular apps in the history of mankind is not a defensible position


I‘m not sure this works in reverse. If demand is indeed high, you could show that with market data. But if you have marked data e.g. showing high valuation of AI companies, or x many requests over some period, that doesn’t mean necessarily that demand is high. In other words, marked data is necessary but not sufficient to prove your claim.

Reasons for market data seemingly showing high demand without there actually being one include: Market manipulation (including marketing campaigns), artificial or inflated demand, forced usage, hype, etc. As an example NFTs, Bitcoin, and supersonic jet travel all had “an insane market data” which seemed at the time to show that there was a huge demand for these things.

My prediction is that we are in the early Concord era of supersonic jet travel and Boeing is racing to catch up to the promise of this technology. Except that in an unregulated market such as the current tech market, we have forgone all the safety and security measures and the Concord has made its first passenger flight in 1969 (as opposed to 1976), with tons of fan fare and all flights fully booked months in advance.

Note that in the 1960 it was market forecasts had the demand for Concord to build 350 airplanes by 1980, and at the time the first prototypes were flying they had 74 options. Only 20 were every built for passenger flight.


As an end user I have never asked for a chatbot. And if I'm calling support, I have a weird issue I probably need human being to resolve.

But! We here are not typical callers necessarily. How many IT calls for general population can be served efficiently (for both parties) with a quality chatbot?

And lest we think I'm being elitist - let's take an area I am not proficient in - such as HR, where I am "general population".

Our internal corporate chatbot has turned from "atrocious insult to man and God's" 7 years ago, to "far more efficiently than friendly but underpaid and inexperienced human being 3 countries away answering my incessant questions of what holidays do I have again, how many sick days do I have and how do I enter them, how do I process retirement, how do I enter my expenses, what's the difference between short and long term disability" etc etc. And it has a button for "start a complex hr case / engage a human being" for edge cases,so internally it works very well.

This is a narrow anecdata about notion of service support chatbot, don't infere (hah) any further claims about morality, economy or future of LLMs.


People shame AI publicly and lean it heavily in private.


I mean, it's both.

Chatgpt, claude, gemini in chatbot or coding agent form? Great stuff, saves me some googling.

The same AI popping up in an e-mail, chat or spreadsheet tool? No thanks, normal people don't need an AI summary of a 200 word e-mail or slack thread. And if I've paid a guy a month's salary to write a report on something, of course I'll find 30 minutes to read it cover-to-cover.


A future where anything has to be paid (but it's crypto) doesn't sound futuristic to me at all.


LLMs are already extremely useful today


Any sort of argument ?


Personal experience: I use them.


I also have the intuition that something like this is the most likely outcome.


> it may be very difficult for us as users to discern which model is better

But one thing will stay consistent with LLMs for some time to come: they are programmed to produce output that looks acceptable, but they all unintentionally tend toward deception. You can iterate on that over and over, but there will always be some point where it will fail, and the weight of that failure will only increase as it deceives better.

Some things that seemed safe enough: Hindenburg, Titanic, Deepwater Horizon, Chernobyl, Challenger, Fukushima, Boeing 737 MAX.


Don’t malign the beautiful Zeppelins :(

Titanic - people have been boating for two thousand years, and it was run into an iceberg in a place where icebergs were known to be, killing >1500 people.

Hindenburg was an aircraft design of the 1920s, very early in flying history, was one of the most famous air disasters and biggest fireballs and still most people survived(!), killing 36. Decades later people were still suggesting sabotage was the cause. It’s not a fair comparison, an early aircraft against a late boat.

Its predecessor the Graf Zeppelin[1] was one of the best flying vehicles of its era by safety and miles traveled, look at its achievements compared to aeroplanes of that time period. Nothing at the time could do that and was any other aircraft that safe?

If airships had the eighty more years that aeroplanes have put into safety, my guess is that a gondola with hydrogen lift bags dozens of meters above it could be - would be - as safe as a jumbo jet with 60,000 gallons of jet fuel in the wings. Hindenburg killed 36 people 80 years ago, aeroplane crashes have killed 500+ people as recently as 2014.

Wasn’t Challenger known to be unsafe? (Feynman inquiry?). And the 737 MAX was Boeing skirting safety regulations to save money.

[1] https://en.wikipedia.org/wiki/LZ_127_Graf_Zeppelin


> Wasn’t Challenger known to be unsafe? (Feynman inquiry?). And the 737 MAX was Boeing skirting safety regulations to save money.

The AI companies have convinced the US government that there should be no AI safety regulations: https://www.wired.com/story/plaintext-sam-altman-ai-regulati...


Guarantee we'll be saying this about a disaster caused by AI code:

> everyone knows you need to carefully review vibe coded output. This [safety-critical company] hiring zero developers isn't representative of software development as a profession.

> They also used old 32b models for cost reasons so it doesn't knock against AI-assisted development either.


I'm particularly salty about the Hindenburg and don't feel as strongly about Chernobyl, Fukushima, Challenger, so if you're referring to those, that's different. The Hindenburg didn't use Hydrogen for cost reasons, it was designed to use more expensive Helium and the US government refused to export Helium to Nazi controlled Germany, so they redesigned it for Hydrogen. I'm not saying that it wasn't representative of air travel at the time, I'm saying air travel at the time was unsafe and airships were well known to be involved in many crashes, and the Hindenburg was not particularly less safe, it's just that aeroplanes were much smaller and carried fewer people and the accidents were less spectacular so they somehow got a pass and aeroplanes were . I'm saying air travel became safer and so would Zeppelin travel have become, by similar means - more careful processes, designs improved on learnings from previous problems, etc.

Look at the state of the world today, AirBus have a Hydrogen powered commercial aircraft[1]. Toyota have Hydrogen powered cars on the streets. People upload safety videos to YouTube of Hydrogen cars turning into four-meter flamethrowers as if that's reassuring[3]. There are many[2] Hydrogen refuelling gas stations in cities in California where ordinary people can plug high pressure Hydrogen hoses into the side of their car and refuel it from a high pressure Hydrogen tank on a street corner. That's not going to be safer when it's a 15 year old car, a spaced-out owner, and a skeezy gas station which has been looking the other way on maintenance for a decade, where people regularly hear gunshots and do burnouts and crash into things. Analysts are talking about the "Hydrogen Economy" and a tripling of demand for Green Hydrogen in the next two decades. But lifting something with Hydrogen? Something the Graf Zeppelin LZ-127 demonstrated could be done safely with 1920s technology? No! That's too dangerous!

Number of cars on the USA roads when Hindenburg burnt? Around 25 million. Now? 285 million, killing 40,000 people every year. A Hindenburg death toll two or three times a day, every day, on average. A 9/11 every couple of months. Nobody is as concerned as they are about airships because there isn't a massive fireball and a reporter saying "oh the humanity". 36 people died 80 years ago in an early air vehicle and it's stop everything, this cannot be allowed to continue! The comparisons are daft in so many ways. Say airships are too slow to be profitable, say they're too big and difficult to maneouvre against the wind. But don't say they were believed to be perfectly safe and turned out to be too dangerous and put that as a considered reasonable position to hold.

Some of the sabotage accusations suggested it was a gunshot, but you know why that's not so plausible? Because you can fire machine guns into Hydrogen blimps and they don't blow up! "LZ-39, though hit several times [by fighter aeroplane gunfire], proceeded to her base despite one or more leaking cells, a few killed in the crew, and a propeller shot off. She was repaired in less than a week. Although damaged, her hydrogen was not set on fire and the “airtight subdivision” provided by the gas cells insured her flotation for the required period. The same was true of the machine gun. Until an explosive ammunition was put into service no airplane attacks on airships with gunfire had been successful."[4]. How many people who say Hydrogen airships are too dangerous realise they can ever take machine gun fire into their gas bags and not burn and keep flying?

[1] https://www.airbus.com/en/innovation/energy-transition/hydro...

[2] https://afdc.energy.gov/fuels/hydrogen-locations#/find/neare...

[3] https://www.youtube.com/watch?v=OA8dNFiVaF0

[4] https://www.usni.org/magazines/proceedings/1936/september/vu...


> Decades later people were still suggesting sabotage was the cause.

Glad you mention it. Connecting back to AI: there are many possible future scenarios involving negative outcomes involving human sabotage of AI -- or using them to sabotage other systems.


Hindenburg indeed killed hydrogen blimps. Of everything else on your list, the disaster was in the minority. The space shuttle was the most lethal other item -- there are lots of cruise ships, oil rigs, nuke plants, and jet planes that have not blown up.

So what analogy with AI are you trying to make? The straightforward one would be that there will be some toxic and dangerous LLMs (cough Grok cough), but that there will be many others that do their jobs as designed, and that LLMs in general will be a common technology going forward.


I have had gemini running as a qa tester, and it faked very convincing test results by simulating what the results would have been. I only knew it was faked because that part of the code was not even implemented yet. I am sure we have all had similar experiences.


which is a thing with humans as well - I had a colleague with certified 150+ IQ, and other than moments of scary smart insight, he was not a superman or anything, he was surprisingly ordinary. Not to bring him down, he was a great guy, but I'd argue many of his good qualities had nothing to do with how smart he was.


I'm in the same 150+ group. I really think it doesn't mean much on its own. While I am able to breeze through some things and find some connections sometimes that elude some of the other people, it's not that much different than all the other people doing the same at other occasions. I am still very much average in large majority of every-day activities, held back by childhood experiences, resulting coping mechanisms etc, like we all are.

Learning from experience (hopefully not always your own), working well with others, and being able to persevere when things are tough, demotivational or boring, trumps raw intelligence easily, IMO.


Why the hell do you people know your IQ? That test is a joke, there’s zero rigor to it. The reason it’s meaningless is exactly that, it’s meaningless and you wasted your time.

Why one would continue to know or talk about the number is a pretty strong indicator of the previous statement.


You're using words like "zero" and "meaningless" in a haphazard way that's obviously wrong if taken literally: there's a non-zero amount of rigour in IQ research, and we know that it correlates (very loosely) with everything from income to marriage rate so it's clearly not meaningless either.

What actual fact are you trying to state, here?


The specifics of an IQ test aren't super meaningful by itself (that is, a 150 vs a 142 or 157 is not necessarily meaningful), but evaluations that correlate to the IQ correlate to better performance.

Because of perceived illegal biases, these evaluations are no longer used in most cases, so we tend to use undergraduate education as a proxy. Places that are exempt from these considerations continue to make successful use of it.


> Places that are exempt from these considerations continue to make successful use of it.

How so? Solving more progressive matrices?


Hiring.


> correlate to better performance.

...on IQ tests.


This isn't the actual issue with them, the actual issue is "correlation is not causation". IQ is a normal distribution by definition, but there's no reason to believe the underlying structure is normal.

If some people in the test population got 0s because the test was in English and they didn't speak English, and then everyone else got random results, it'd still correlate with job performance if the job required you to speak English. Wouldn't mean much though.


> we tend to use undergraduate education as a proxy

Neither an IQ test nor your grades as an undergraduate correlate to performance in some other setting at some other time. Life is a crapshoot. Plenty of people in Mensa are struggling and so are those that were at the top of class.


Do you have data to back that up? Are you really trying to claim that there is no difference in outcomes from the average or below average graduate and summa cum laude?


Like they said, it depends, but grades alone are not the sole predictor:

https://www.insidehighered.com/news/student-success/life-aft...

Actual study:

https://psycnet.apa.org/doiLanding?doi=10.1037%2Fapl0001212


That is moving the goal posts. No one claimed it is the sole predictor. The claim was that there is no relation at all. Your own links say their is a predictive relationship. Of course other factors matter, and may even be more important, but with all else equal, grades are positively correlated.


It’s about trend. Not <Test Result>==Success. These evaluations try to put an objective number to what most of us can evaluate instinctively. They are not perfect or necessarily fair. Many, maybe most, job interviews are really a vibe assessment, so it’s an imperfect thing!

I don’t know my IQ, but I probably would score above average and have undiagnosed ADHD. I scored in the 95th percentile + on most standardized tests in school but tended to have meh grades. I’m great at what I do, but I would be an awful pilot or surgeon.

Growing up, you know a bunch of people. Some are dumb, some are brilliant, some disciplined, some impetuous.

Think back, and more of the smart ones tend to align with professions that require more brainpower. But you probably also know people who weren’t brilliant at math or academics, but they had focus and did really well.


For me it was just a coincidence of MENSA advertising their events in my high school and being pushed by a couple of friends to go through testing and join together.


I guess if you're an outlier you sometimes know, for example the really brilliant kids are often times found out early in childhood and tested. Is it always good for them ? Probably not, but that's a different discussion.


You've never spent a couple of bucks on a "try your strength" machine?



> I'm in the same 150+ group. I really think it doesn't mean much on its own.

You're right but the things you could do with it if you applied yourself are totally out of reach for me; for example it's quite possible for you to become an A.I researcher in one of the leading companies and make millions. I just don't have that kind of intellectual capacity. You could make it into med school and also make millions. I'm not saying all this matters that much, with all due respect to financial success, but I don't think we can pretend our society doesn't reward high IQs.


High IQ alone isn't a guarantor of success in demanding fields. Most studies I've read also show that IQs above 120 stop correlating with (more) success.

That high IQ needs to be paired with hard work.


The intellectual capacity is a factor for sure, but indeed there is more to life than that. Things like hard work, creativity, social skills, empathy, determination, ability to plan and execute are as much factors as high IQ.

Went to the equivalent of a mensa meeting group a couple of times. The people there were much smarter than me, but they all had their problems and many of them weren't that successful at all despite their obvious intelligence.


Really? You don't become a doctor by being smart?


Not particularly. There's a baseline intelligence required to become a (medical) doctor but no it's much more about grit and hard work among other factors [1]. Similarly for PhDs as well IMHO.

Searching and IQs FOR doctors seem to average about 120 with 80th percentile being 105-130. So there's plenty of doctors with IQs of 105 which is not that far above average.

That also means that it's prudent to be selective in your doctors if you have any serious medical issues.

1: https://www.cambridge.org/core/journals/cambridge-quarterly-...


> Searching and IQs FOR doctors seem to average about 120 with 80th percentile being 105-130.

Where are you getting this from exactly ? Getting in to a medical school is very difficult to do in the U.S. Having an average IQ of 105 would make it borderline impossible - even if you cram for SAT and tests twice as much as everyone else there is so much you can do - these tests test for speed and raw brain power. In my country - the SAT equivalent you need to have to get in would put you higher than top 2%, it's more like 1.5%-to 1%, because the population keeps growing but the number of working doctors remains quite constant. So really each high school had only 2-3 kids that would get in per class. I know a few of these people - really brilliant kids, their IQ's were probably above 130 and it's impossible for me to compete with them in getting in - I am simply not exceptional - at least not that far high in the distribution. I was maybe in the top 3-5 best students in my class but never the best, so lets say top 10%, these kids were the best students in the whole school - that's top 1%-2%.

One caveat to all this is that sure, in some countries it is easier to get in. People from my country (usually from families who can afford it) go to places like Romania, Czechoslovakia, Italy etc where it is much much easier to get in to med school (but costs quite a lot and also means you have to leave your home country for 7 years).

Now is it necessary to have an IQ off the charts to be a good doctor - no, probably not, but that's not what I was arguing, that's just how admission works.


> Where are you getting this from exactly ? Getting in to a medical school is very difficult to do in the U.S. Having an average IQ of 105 would make it borderline impossible

I agree it'd be almost impossible, but apparently not impossible with an IQ of 105. Could be folks with ADHD whose composite IQ is brought down by a smaller working memory but whose long term associative memory is top notch. Could be older doctors from when admissions were easier. Could be plain old nepotism.

After all the AMA keeps admissions artificially low in the US to increase salary and prestige. It's big part of the reason medical costs are so highly in the US in my opinion.

Reference I found here:

https://forum.facmedicine.com/threads/medical-doctors-ranked...

> Hauser, Robert M. 2002. "Meritocracy, cognitive ability, and the sources of occupational success." CDE Working Paper 98-07 (rev)


Modern WAIS-IV-type tests yield multiple factor scores: IQ is arguably non-scalar.


The original theory was precisely that there's a general factor ("g").

If you run anything sufficiently complex through a principal component analysis you'll get several orthogonal factors, decreasing in importance. The question then is whether the first factor dominates or not.

My understanding is that it does, with "g" explaining some 50% of the variance, and the various smaller "s" factors maybe 5% to 20% at most.


Those sub-scores BTW are very helpful in indicating or diagnosing learning disabilities. Folks with autism or adhd can have very different strength / weaknesses in intelligence.


I've always figured that tanglible "intelligence" which leads to more effective decision making is just a better appreciation of one's own stupidity.


+1. Being exceptionally intelligent doesn't always catch unknown unknowns. (Sometimes, but not always)


That would be an extreme criterion for exceptional intelligence, akin to asking for there to be no unknowns.


perhaps the argument is simply that "exceptional intelligence" is just being better at accepting how little you know, and being better at dealing with uncertainty. Both respecting it and attempting to mitigate against it. I find some of the smartest people I know are careful about expressing certainty.


It's an observation that being smarter in the things you do know isn't everything.


He may have dealt with all kinds of weaknesses that A.I won't deal with such as - lack of self confidence, inability to concentrate for long, lack of ambition, boredom, other pursuits etc etc. But what if we can write some while loop with a super strong AGI model that starts working on all of our problems relentlessly? Without getting bored, without losing confidence. Make that one billion super strong AGI models.


With at least a few people it's probably you who is much smarter than them. Do you ever find yourself playing dumb with them, for instance when they're chewing through some chain of thought you could complete for them in an instant? Do you ever not chime in on something inconsequential?

After all you just might seem like an insufferable smartass to someone you probably want to be liked by. Why hurt interpersonal relationships for little gain?

If your colleague is really that bright, I wouldn't be surprised if they're simply careful about how much and when they show it to us common folk.


Nah, in my experience 90% of what (middle-aged) super-duper genius people talk about is just regular people stuff - kids, vacations, house renovation, office gossip etc.

I don't think they are faking it.


Nope. Looking down on someone for being dumber than you makes you, quite frankly, an insufferable smartass.


There's a difference between "looking down on someone for being dumber than you" and "feeling sorry that someone is unable to understand as easily as you".


It's even more difficult because, while all the benchmarks provide some kind of 'averaged' performance metric for comparison, in my experience most users have pretty specific regular use cases, and pretty specific personal background knowledge. For instance I have a background in ML, 15 years experience in full stack programming, and primarily use LLMs for generating interface prototypes for new product concepts. We use a lot of react and chakraui for that, and I consistently get the best results out of Gemini pro for that. I tried all the available options and settled on that as the best for me and my use case. It's not the best for marketing boilerplate, or probably a million other use cases, but for me, in this particular niche it's clearly the best. Beyond that the benchmarks are irrelevant.


we could run some tests to first find out if comparative performance tests can be conjured:

one can intentionally use a recent and a much older model to figure out if the tests are reliable, and in which domains it is reliable.

one can compute a models joint probability for a sequence and compare how likely each model finds the same sequence.

we could ask both to start talking about a subject, but alternatingly each can emit a token. look again at how the dumber and smarter models judge the resulting sentence does the smart one tend to pull up the quality of the resulting text, or does it tend to get dragged down more towards the dumber participant?

given enough such tests to "identify the dummy vs smart one" and verifying them on common agreement (as an extreme word2vec vs transformer) to assess the quality of the test, regardless of domain.

on the assumption that such or similar tests allow us to indicate the smarter one, i.e. assuming we find plenty such tests, we can demand model makers publish open weights so that we can publically verify performance agreements.

Another idea is self-consistency tests: a single forward inference of context size say 2048 tokens (just an example) is effectively predicting the conditional 2-gram, 3-gram, 4-gram probabilities on the input tokens. so each output token distribution is predicted on the preceding inputs, so there are 2048 input tokens and 2048 output tokens, the position 1 output token is the predicted token vector (logit vector really) that is estimated to follow the given position 1 input vector, and the position 2 output vector is the prediction following the first 2 input vectors etc. and the last vector is the predicted next token following all the 2048 input tokens. p(t_(i+1) | t_1 =a, t_2=b, ..., t_i=z).

But that is just one way the next token can be predicted using the network: another approach would be to use RMAD gradient descent, but keeping model weights fixed, and only considering the last say 512 input vectors as variable, how well did the last 512 predicted forward prediction output vectors match the gradient descent best joint probability output vectors?

This could be added as a loss term during training as well, as a form of regularization, which turns it into a kind of Energy Based Model roughly.


Lets call this branch of research unsupervised testing


My guess is that more than the raw capabilities of a model, users would be drawn more to the model's personality. A "better" model would then be one that can closely adopt the nuances that a user likes. This is a largely uninformed guess, let's see if it holds up well with time.


> It's also worth considering that past some threshold, it may be very difficult for us as users to discern which model is better.

Even if they've saturated the distinguishable quality for tasks they can both do, I'd expect a gap in what tasks they're able to do.


This is the F1 vs 911 car problem. A 911 is just as fast as an f1 car to 60 (sometimes even faster) but an f1 is better at super high performance envelope above 150 in tight turns.

An average driver evaluating both would have a very hard time finding the f1s superior utility


But he would find both cars lacking when doing regular car things (the F1 moreso than the 911).


Fine whatever replace it with a Tesla. Jesus pedantic enough?


Unless one of them forgets to have a steering wheel, or shifts to reverse when put in neutral. LLMs still make major mistakes, comparing them to sports cars is a bit much.


This take is extremely ridiculous.


> For example, if you are an ELO 1000 chess player would you yourself be able to tell if Magnus Carlson or another grandmaster were better by playing them individually?

Yes, because I'd get them to play each other?


He specifically said play them individually.


I know. "You can't assess which chatbot's more intelligent if I exclude the most obvious method of assessment" isn't a fair test.


I guess the analogy flawed, because it is not a competition where we can pit the chatbots against each other directly


We’re judging them with benchmarks, not our own intuitions.


I think Musk puts it well when he says the ultimate test is can they help improve the real world.


I could certainly tell if they played ??-level blunders, which LLMs do all the time.


You don't have to be even good at chess to be able to tell when a game is won or lost, most of the time.

I don't need to understand how the AI made the app I asked for or cured my cancer, but it'll be pretty obvious when the app seems to work and the cancer seems to be gone.

I mean, I want to understand how, but I don't need to understand how, in order to benefit from it. Obviously understanding the details would help me evaluate the quality of the solution, but that's an afterthought.


That's a great point. Thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: