(1) is absolutely not true if you actually use these models on a regular basis and include Google in here too. The difference in reliability beyond basic tasks is night and day. Their reward function is just so much better, and there are many nuanced reasons for this.
(2) is probably true but with caveats. Top-tier models will never run on desktop machines, but companies should (and do) host their own models. The future is open-weight though, that much is for sure.
(3) This is so ignorant that others have already responded to it. Look outside of your own bubble, please.
If we get to the point where a local model can reliably do the coding for a good majority of cases, then the economic landscape changes significantly. And we are not that far from having big open weight models that can do that, which is a first step
Larger, yes, absolutely. Better? Right now it seems that bigger is better, but if we are thinking about long term future, it's not obvious that there isn't a point of diminishing returns with regards to size. I can also imagine a breakthrough, where models become much smaller, with the same or better capabilities as the current, very large ones.
You are always going to get the same scaling laws in model size regardless of what else you do, so the same degree of improvement seen now relative to the smaller models will be achievable in the future. Yes, small models may be on par with previous generation large models, but the same is true for processors and you don't see supercomputers going away. It's the same principle.
This feels like nitpicking. Idiocracy was never supposed to be accurate; the important part is how it seems less ridiculous as time goes on.
The primary point of idiocracy was to imagine a world where people were acting in increasingly stupid ways over time. The source of this is irrelevant. In reality, it turned out that the source of the stupidity was an increasingly poor education system, increasing inequality, and carefully designed injection of addictive technologies and medicines into the general populace.
Where idiocracy really failed in its predictions was in the development of AI, as that appears to increasingly substitute for lack of common understanding.
Also all of this only really holds for the US and maybe the UK.
> The primary point of idiocracy was to imagine a world where people were acting in increasingly stupid ways over time.
...and in doing so, it depicts a world that is not at all reminiscent of the one in which we live.
The white house is not occupied by idiots, but by thieves and murders and sexual predators. The American landscape is not a Brawndo-dustbowl, but a highly profitable, productive, and delicious-but-toxic bounty of subsidized factory farms, stemming not from a misunderstanding of botany, but a misapplication of that understanding.
The same is true of the medical industry, the justice system - literally every institution portrayed in the entire film, with the possible exception of waste disposal / the trash avalanche.
It isn't really the fact that the president is a movie actor that's damning there. It's the fact that the president uses wrestling catch phrases and behavior (Trump was even in WWE). Back when Idiocracy came out, the very idea of President Camacho seemed absurd. Nobody would bat an eye at that anymore.
Yes, the movie was a satire and took the current observations to their logical extreme. The point is that we're pretty darn close to the extreme right now.
As a professional mathematician, I would say that a good proof requires a very good representation of the problem, and then pulling out the tricks. The latter part is easy to get operating using LLMs, they can do it already. It's the former part that still needs humans, and I'm perfectly fine with that.
But are you ok with the trendline of ai improvement? The speed of improvement indicates humans will only get further and further removed from the loop.
I see posts like your all the time comforting themselves that humans still matter, and every-time people like you are describing a human owning an ever shrinking section of the problem space.
It used to be the case that the labs were prioritising replacing human creativity, e.g. generative art, video, writing. However, they are coming to realise that just isn't a profitable approach. The most profitable goal is actually the most human-oriented one: the AI becomes an extraordinarily powerful tool that may be able to one-shot particular tasks. But the design of the task itself is still very human, and there is no incentive to replace that part. Researchers talk a bit less about AGI now because it's a pointless goal. Alignment is more lucrative.
Basically, executives want to replace workers, not themselves.
On the contrary the depth and breadth we're becoming able to handle agentically now in software is growing very rapidly, to the point where in the last 3 months the industry has undergone a big transformation and our job functions are fundamentally starting to change. As a software engineer I feel increasingly like AGI will be a real thing within the next few years, and it's going to affect everyone.
If you look at those operating at the bleeding edge, it doesn't look anything like yesteryear. It's a real step change. Fully autonomous agentic software engineering is becoming a reality. While still in its infancy, some results are starting to be made public, and it's mind boggling. We're transitioning to a full agent-only workflow in my team at work. The engineering task has shifted from writing code to harness engineering, and essentially building a system that can safely build itself to a high quality given business requirements.
Up until recently I kinda feel like the scepticism was warranted, but after building my own harness that can autonomously produce decent quality software (at least for toy problem scale, granted), and getting hands on with autoresearch via writing a set of skills for it https://github.com/james-s-tayler/lazy-developer, I feel fundamentally different about software engineering than I did until relatively recently.
If you look at the step change from Sonnet 4.5 to Opus 4.5 and what that unlocked, and consider the rumoured Mythos model is apparently not just an incremental improvement, but another step change. Then pair it with infrastructure for operating agents at scale like https://github.com/paperclipai/paperclip and SOTA harnesses like the ones being written about on the blogs of the frontier labs... I mean... you tell me what you think is coming down the pipe?
Humans needing to ask new question due to curiosity push the boundaries further, find new directions, ways or motivations to explore, maybe invent new spaces to explore. LLMs are just tools that people use. When people are no longer needed AI serves no purpose at all.
People can use other people as tools. An LLM being a tool does not preclude it from replacing people.
Ultimately it’s a volume problem. You need at least one person to initialize the LLM. But after that, in theory, a future LLM can replace all people with the exception of the person who initializes the LLM.
This argument, that LLMs can develop new crazy strategies using RLVR on math problems (like what happened with Chess), turns out to be false without a serious paradigm shift. Essentially, the search space is far too large, and the model will need help to explore better, probably with human feedback.
Yes but "the search space is too large" is something that has been said about innumerable AI-problems that were then solved. So it's not unreasonable that one doubts the merit of the statement when it's said for the umpteenth time.
I should have been more specific then. The problem isn't that the search space is too large to explore. The problem is that the search space is so large that the training procedure actively prefers to restrict the search space to maximise short term rewards, regardless of hyperparameter selection. There is a tradeoff here that could be ignored in the case of chess, but not for general math problems.
This is far from unsolvable. It just means that the "apply RL like AlphaGo" attitude is laughably naive. We need at least one more trick.
I agree that LLMs are a bad fit for mathematical reasoning, but it's very hard for me to buy that humans are a better fit than a computational approach. Search will always beat our intuition.
Yes and no. I think we have vastly underestimated the extent of the search space for math problems. I also think we underestimate the degree to which our worldview influences the directions with which we attempt proofs. Problems are derived from constructions that we can relate to, often physically. Consequently, the technique in the solution often involves a construction that is similarly physical in its form. I think measure theory is a prime example of this, and it effectively unlocked solutions to a lot of long-standing statistical problems.
That linked article says its about RLVR but then goes on to conflate other RL with it, and doesn't address much in the way of the core thinking that was in the paper they were partially responding to that had been published a month earlier[0] which laid out findings and theory reasonably well, including work that runs counter to the main criticism in the article you cited, ie, performance at or above base models only being observed with low K examples.
That said, reachability and novel strategies are somewhat overlapping areas of consideration, and I don't see many ways in which RL in general, as mainly practiced, improves upon models' reachability. And even when it isn't clipping weights it's just too much of a black box approach.
But none of this takes away from the question of raw model capability on novel strategies, only such with respect to RL.
I have no idea how you come to this conclusion, when the evidence on the ground for those training models suggests it is precisely the opposite.
We are much further along the path of writing code than writing new maths, since the latter often requires some degree of representational fluency of the world we live in to be relevant. For example, proving something about braid groups can require representation by grid diagrams, and we know from ARC-AGI that LLMs don't do great with this.
Programming does not have this issue to the same extent; arguably, it involves the subset of maths that is exclusively problem solving using standard representations. The issues with programming are primarily on the difficulty with handling large volumes of text reliably.
Grid Diagrams can be specified (hopefully) through algebraic equations.
The way that most math is currently done is that someone provides an extremely specified problem and then one has to answer that extremely specified problem.
The way that programming is currently done is through constructing abstractions and trying to create a specification of the problem.
Of course I'm not saying we're close to creating a silicon Grothendieck (I think that Bourbaki actually reads like a codebase) but I'm saying that we're much closer to constructing algorithms that can solve specified programs as opposed to specifying underspecified problems
Think about the difference in specificity of
Prove Fermat's last theorem vs Build a web browser
Nah, LLM's are solving unique problems in maths, whereas they're basically just overfitting to the vast amounts of training data with writing code. Every single piece of code AI writes is essentially just a distillation of the vast amounts of code it's seen in it's training - it's not producing anything unique, and it's utility quickly decays as soon as you even move towards the edge of the distribution of it's training data. Even doing stuff as simple as building native desktop UI's causes it massive issues.
It's finding constructions and counterexamples. That's different from finding new proof techniques, but still extremely useful, and still gives way to novel findings.
Funding a few PhDs for a year costs orders of magnitude more than it did to solve this problem in inference costs. Also, this has been active research for some time. Or I guess the people working on it are just not as good as a random bunch of students? It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
I take it you're not a mathematician. This is an achievement, regardless of whether you like LLMs or not, so let's not belittle the people working on these kinds of problems please.
>It's amazing the lengths that people go to maintain their worldview, even if it means belittling hardworking people.
This is the most baffling and ironic aspects of these discussions. Human exceptionalism is what drives these arguments but the machines are becoming so good you can no longer do this without putting down even the top percenter humans in the process. Same thing happening all over this thread (https://news.ycombinator.com/item?id=47006594). And it's like they don't even realize it.
How many math PhD students do you have? If you set the problem right, something like this per year on average is a good pace.
How are they cheaper? Your average grant where I am can pay for a couple of PhD students. I could afford to pay for inference costs out of my own salary, no grant needed. Completely different economic scales here. I like students better of course, but funding is drying up these days.
I was saying generally. I don't work in maths. PhD students do lots of other things than research. If we ask a PhD student to just solve these kinds of problems and nothing else, the student would do it without much difficulty.
I guess it's different in somewhere like Europe. But in Canada, most of the PhD students are paid for doing TAships, not primarily through grant. Average salary is 25k/year. Take 6-10k out for tuition, that's 15-19k/year. You get a student doing so many things for less pay. I guess, if your job only requires research then you can do it.
Inference costs are heavily subsidised. My point was that we've spent trillions collectively on ai, and so far we have a few new proofs. It's been active research but the problem estimates only 5-10 people are even aware that it is a problem. I wrote "math phd's" not "random students", but regardless, I wouldn't know how you interpreted my statement that people could have discovered without ai this as "belittling the people working on this". You seem like a stupid person with an out of control chatbot that can't comprehend basic arguments.
And now you're belittling me. Yeah, good one, that'll convince people.
> out of control chatbot that can't comprehend basic arguments
I don't see how it is out of control. It is a tool. It is being used for a job. For low-level jobs it often succeeds. For tougher jobs, it is succeeding sufficiently often to be interesting. I don't care if it understands worldview semantics, that's for humans to do.
> we've spent trillions collectively on ai
The economics around AI do not suggest that continuing to perform large training runs is sustainable. That's also not relevant to the discussion. Once the training is done, further costs are purely on inference, and that is the comparison I was making.
> Inference costs are heavily subsidised
Even if you pay to run inference on your own hardware, economics of scale dictate that it is still cheaper than students.
> It's been active research but the problem estimates only 5-10 people are even aware that it is a problem.
That sounds about right for most pure math problems. Were you expecting more?
Let's not pretend that society would have invested that kind of money into pure mathematics research. It is extraordinarily difficult to get funding for that kind of work in most parts of the world. Mathematicians are relatively cheap, yes, but the money coming into AI was from blind VCs with a sense of grandeur. It wasn't to do maths research. If it's here anyway, and causing nightmares for actually teaching new students, may as well try to make some good of it. It has only recently crossed the edge of being useful. Most researchers I know are only now starting to consider it, mostly as a search engine, but some for proof assistance. Experiences a year ago were highly negative. They're a lot more positive now.
I'm trying to give a perspective from someone who actually does do math research at a senior level, who actually does have a half dozen math PhD students to supervise, to say that your blind attitude toward this is not sensible or helpful. Your comments about the problem being trivial do belittle the actual effort people have put into the problem without success. If they could easily have discovered this without AI, they would have already done so. Researchers do not have unlimited time and there are many more problems than students, especially good ones (hence my random comment).
From various online estimates, i would estimate global ai spend just since 2020 at $2T. Some projections estimate that we might spend that per year starting next year. To the extent that many of these projects will be cancelled or shelved, capital is beginning to take stock of the feasibility of clawing back even the original investments. openai is apparently doubling its staff, but whether these are sales or (prompt?) engineering jobs, the biggest hypemongers are themselves unable to reduce headcount even with unlimited "at-cost" ai inference.
Comparing total ai spend to the value added of producing a few new maths/sciences proofs is unfair since ai is doing more than maths proofs, but for comparison one can estimate the total spent to date on mathematicians and associated costs (buildings, experiments etc). I would very roughly estimate that the total cost of all mathematics to date since 1600 is less than what we've spent on ai to date, and the results from investment in mathematicians are incomparable to a few derivative extensions of well-established ideas. For less than a few trillion we have all of mathematics. For an additional 2T dollars, we have trivial advancements that no one really cares about.
It's amazing that you find so many that are uncomfortable with this question. I literally teach a first-year data science course and I ask the students this very question. I spend half a lecture on it and put it in their assessment.
This is one of the most fundamental things to understand in statistics. If you don't have at least some degree of comfort with this, you have no business working with data in a professional capacity.
You can be comfortable about the concept, but not comfortable about the interview.
The way I understand it, OP asked this as a way to open the conversation, while candidates interpreted it as a math problem to solve, unintentionally getting their mind into "exam" mode.
I was thinking this too, but I don't believe this is the case, and I feel like it would not be a good idea either.
Most of these people are likely students; this should be a learning moment, but I don't think it is yet grounds for their entire academic career to be crippled by being unable to publish in a top-tier ML venue.
If this is tolerated, it sends exactly the wrong kind of message. The students, if they are, should be banned for life. Let them serve as an example for myriads of future students, this will be a better outcome in the long run.
This didn't trip for people who were merely bouncing ideas off a LLM, they caught people who copy and pasted straight from their LLM.
It's not a fully consensus view, but a majority of sociologists agree that high severity deterrence has limited effectiveness against crime. Instead, certainty of enforcement is the most salient factor.
Correct. We also have evidence both from cheating in sports and in academia that stiff punishments do not work. Many people hold the false belief that if it is easy to cheat then the punishments must be extremely severe to scare would be cheaters. It just does not work. Preventing cheating is way easier said than done.
> We also have evidence both from cheating in sports and in academia that stiff punishments do not work.
Maybe so, but there is evidence that lack of punishment also don't work.
Neither extreme "works". Just because terminal punishments do not prevent the worst cheating does not in any way imply that slap on the wrists reduce incidents of cheating.
Are you claiming that one of the extremes "works"? That the "light punishment" route reduces cheating? Or maybe has no effect on cheating?
There are two extremes; I am not arguing that the one extreme (terminal punishment) reduces cheating, I am saying that the other extreme (light punishment) does not reduce cheating!
You say that stiff punishments have no effect on the cheating rate, right? Compare to what exactly? Compared to no punishments? Compared to light punishments? Compared to medium punishments? Compared to heavy but non-terminal punishments?
Now that I've reread your comment, I'm extremely skeptical that terminal punishments have no effect on the cheating rate compared to light punishments or compared to medium punishments.
It's an extraordinary claim, so I want to see this "lots of evidence"; the evidence should basically show no correlation between cheating and punishments.
That's not true. People still pick up USB sticks from the street, people still fall for scam phone calls and people still click on links in mail.
Just because a method was successful once does not mean it was 'burned', none of these people will be checking each and every future pdf or passing it through a cleaner before they will do the same thing all over again and others are going to be 'virgin' and won't even be warned because this is not going to be widely distributed in spite of us discussing it here.
If anything you can take this as proof that this method is more or less guaranteed to work.
Yup, precisely this. Doing something bad is rarely a rational commitment and cost of benefits. Likelihood and celerity of getting caught seem to be the driving factors.
It makes honest people feel rewarded, valued and acknowledge. It teaches people who wish to follow the rules and conform to social norms what those norms are and where we actually draw the line in practice.
Looked at slightly differently, given a split between high trust and low trust preventing conversions from high to low is similarly important to inducing conversions from low to high.
Well, maybe they found themselves in the last hours of the deadline without the reviews done... in some cases due to procrastination, but in a few cases perhaps because life is hard and they just couldn't do it. So they used the LLM as a last resort to not go beyond deadline (which I assume maybe was penalized as well?)
To err is human, it makes sense that they are punished (and the harshest part of the punishment is not having a paper rejected, it's the loss of face with coauthors and others, BTW. Face is important in academia) but "for life" is way too much IMO.
This year, having their own submissions desk-rejected is strong enough of a signal that the policy has some teeth behind it. Let’s ban em for life next year.
I strongly feel that deterrence should be the goal here, not retribution IMO.
It has been shown time and again that, for most people, teaching them to be better and giving second chances is more effective than using forever-punishment as a warning for others.
This line of reasoning interests me because it seems to arise in other contexts as well.
Do very harsh punishments significantly reduce future occurrences of the offense in question?
I've heard opponents of the death penalty argue that it's generallynot the case. E.g., because often the criminals aren't reasoning in terms that factor in the death penalty.
On the other hand (and perhaps I'm misinformed), I've heard that some countries with death penalties for drug dealers have genuinely fewer problems with drug addiction. Lower, I assume, than the numbers you'd get from simply executing every user.
I'm not sure it was meant that way, but nice metaphor. For some students "academic death" might really be better than a life of being trapped in a system that they can only navigate by cheating.
My understanding is that something among those lines happened:
> All Policy A (no LLMs) reviews that were detected to be LLM generated were removed from the system. If more than half of the reviews submitted by a Policy A reviewer were detected to be LLM generated, then all of their reviews were deleted, and the reviewer themselves was removed from the reviewer pool.
Half is a bit lenient in my view, but I suppose they wanted to avoid even a single false positive.
Thank goodness we have you passing judgment on the internet; otherwise who else would be around for us to do it? I'm glad you're willing to destroy someone for a mistake rather than letting them learn and change. We all know that arbitrary and harsh punishments solve everything.
"Oops, you told me not to do this, and I volunteered to agree to these stricter standards yet I flagrantly disregarded them, please forgive me" doesn't seem like something you just accidentally do, it's a conscious choice.
I've been an AC (the person who manages the reviewing process and translates reviews into accept/reject decisions) at ICML and similar conferences a few times. In my experience, grad students tend to be pretty good reviewers. They have more time, they are less jaded, and they are keener to do a good job. Senior people are more likely to have the deep and broad field knowledge to accurately place a paper's value, but they are also more likely to write a short shallow review and move on. I think the worst reviews I've seen have been from senior people.
It's usually not "noob" students. Big conferences require reviewers to have at least one (usually more) published paper in major venues. For students, this usually means they went through the process of being the first author on a few papers.
Ok but you need peer reviewed publications to graduate with a PhD.
And if you retort that the whole academic system is obsolete, well, it still carries a lot of prestige and legitimacy that makes politicians interested in maintaining it, so it's not going anywhere soon.
(2) is probably true but with caveats. Top-tier models will never run on desktop machines, but companies should (and do) host their own models. The future is open-weight though, that much is for sure.
(3) This is so ignorant that others have already responded to it. Look outside of your own bubble, please.
reply