Hacker Newsnew | past | comments | ask | show | jobs | submit | CSMastermind's commentslogin

What's the rough equivalent of a local model? Are we talking GPT-4?

Qwen 3.6 which was released this month is a large but still smaller model. Supposedly it's at about sonnet level when configured correctly. It can be run on commodity hardware without purchasing a data center. https://www.reddit.com/r/LocalLLaMA/comments/1so1533/qwen36_...

Then there are middle size ones which require multiple gpus which are like gpts latest flagships.

Then there is kimi 2.6 which is a monster that is beating opus in some benchmarks. https://www.reddit.com/r/LocalLLaMA/comments/1sr8p49/kimi_k2...

It's basically whatever you can afford. Any trash heap laptop can run code auto complete models locally no problem. The rest require some level of investment, an idle gaming pc, or a serious investment


Depends on your VRAM or "unified" memory for how smart it is, and CPU/GPU for how quick it is.

128GB of RAM? Sure, the early to mid 4s releases, except maybe 4o. And on an M5 Max, about the same speed.

I wouldn't really bother under 64GB (meaning 32GB or less) except for entertainment value (chats, summaries, tasky read-only agent things).


GLM 5.1 and DeepSeek 4 are acceptable, but the cost of hardware and energy cost that depending on your use case you may as well purchase a Tokens. They get useless and stupid rapidilty if you quant enough to run on single 16-24GB GPU style.

No it's a big area of opportunity right now. All the existing solutions are pretty rough.

The latest chatgpt Image generation model is producing really nice results for turning sprites into sprite animations. Which is something a year ago felt impossible to get right. But 3d has been impossible for me to get anything good.

Yeah I almost mentioned this! The recent GPT upgrade might actually be the most helpful tool in the space.

For the uninitiated, Paul Erdős was a pretty famous but very eccentric mathematician who lived for most of the 1900s.

He had a habit of seeking out and documenting mathematical problems people were working on.

The problems range in difficulty from "easy homework for a current undergrad in math" to "you're getting a Fields Medal if you can figure this out".

There's nothing that really connects the problems other than the fact that one of the smartest people of the last 100 years didn't immediately know the answer when someone posed it to him.

One of the things people have been doing with LLMs is to see if they can come up with proofs for these problems as a sort of benchmark.

Each time there's a new model release a few more get solved.


> Each time there's a new model release a few more get solved.

I'm no expert, but based on the commentary from mathematicians, this Erdős proof is a unique milestone because the problem received previous attention from multiple professional mathematicians, and the proof was surprising, elegant, and revealed some new connections.

The previous ChatGPT Erdős proofs have been qualitatively less impressive, more akin to literature search or solving easier problems that have been neglected.

Reading the prompt[1], one wonders if stoking the model to be unconventional is part of the success: "this ... may require non-trivial, creative and novel elements"

[1] https://chatgpt.com/share/69dd1c83-b164-8385-bf2e-8533e9baba...


>one wonders if stoking the model to be unconventional is part of the success

I've long suspected that a lot of these model's real capabilities are still locked behind certain prompts, despite the big labs spending tons of effort on making default responses to simple prompts better. Even really dumb shit like "Answer this: ..." vs "Question: ..." vs "... you'll be judged by <competitor>" that should have zero impact in an ideal world can significantly impact benchmark results. The problem is that you can waste a ton of time finding the right prompt using these "dumb" approaches, while the model actually just required some very specific context that was obvious to you and not to it in many day-to-day situations. My go to method is still to have the model ask me questions as the very first step to any of these problems. They kind of tried that with deep research since the early o-series, but it still needs improvement.


Just the right "prompt" is exactly what happened here. Lean has been developed and incorporated into it's data set. Also, token responses only vaguely correlate to "human language" and it's been proven transformers develop their own internal representation that has created a whole field called machanistic interpretation. Being able to more correctly "parse", AKA using Lean and the right "Prompts, insights and suggestions", will take a whole new meaning in the future.

> machanistic interpretation

Awesome term/info, and (completely orthogonal to whether they’ll take err jerbs): I’m really excited about the social/civic picture that might be enabled by a defined and verifiable ontological and taxonomical foundation shared across humanity, particularly coupled with potential ‘legislation as code’ or ‘legal system as code’ solutions.

I’m thinking on a time horizon a bit past my own lifespan, but: even the possibility to objectively map out some specific aspect of a regional approach to social rights in a given time period and consider it with another social framework, alongside automated & verifiable execution of policy, irrespective of the language of origin is incredible.

Instead of hundreds and thousands of incommensurate legislative silos we might create a bazaar of shared improvement and governance efficiency. Turnkey mature governance and anti-corruption measures for newborn nations and countries trying to break out of vicious historical exploitation cycles. Fingers crossed.


Do you think the root cause of social/civic failures has been an inadequate policy repository and lack of a map between policy representations? If so, I have a bridge in Alaska for you to encode into your representation scheme.

Ah, yes, 2001 but on land.

I consider the scene with Dr. Chandra and SAL 9000 to be a fairly realistic predictive description of how experts interact with LLMs. SAL even has a somewhat obsequious personality.

Moldbug called, asked for his mold and bugs back.

Model output reflects on your input, and the effect is self reinforcing over the course of a whole conversation. Color you add around a problem influences the model behavior.

A "dumber"/vague framing will get a less insightful solution, or possibly no solution at all.

I don't even necessarily think this is a critical flaw - in general it's just the model tuning it's responses to your style of prompt. People utilize LLMs for all kinds of different tasks, and the "modes of thought" for responding to an Erdos problem versus software engineering versus a more human/soft skills topic are all very different. I think the "prompt sensitivity" issue is just coming bundled along with this general behavior.


Keeping a pristine context is so important that I used two separate conversations whenever doing something meaningful. One is the main task executor, and the other is for me to bounce random problems, thoughts, and ideas off of while doing everything to keep a pristine context in the executor instance.

It's sort of an agentic loop where I am one of the agents


Yes, it's extremely awkward! Why is a model that can solve problems in scientific literature the same model that can generate random code, write poems in pirate speech, and do all sorts of other random tasks?

It feels like there is a lot of untapped power for specialized LLM tasks if they were created for specialists instead of the general populace prompting from a smartphone.


They're tuned to target a certain customer demographic solving for certain problems. I've seen standard AI models to absolutely brilliant things sometimes. But the prompts to get it to perform like it did with GPT-3 seem to get lengthier and lengthier in time. At some point we'll probably just snip out smaller, specialized models to do certain things.

> “The raw output of ChatGPT’s proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say,” Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM’s key insight.

Interestingly, it was an elegant technique, but the proof still required a lot of work.


The article is about solving a previously unsolved one. This is a harder set of course.

No mention of how he was essentially homeless and collabed his way thru thousands of papers? Or the whole "You have set mathematics back a month" episode?

Absolute legend!


More context on what’s going on with LLMs solving Erdos problems:

https://www.dwarkesh.com/p/terence-tao

TLDR, most of what is getting solved so far is “easy” problems that were not seriously looked at by experts, and where there isn’t a new insight, just trying all the existing techniques from the toolbox. Essentially the low hanging fruit for automation. Raw count solved is a problematic eval due to its difficulty lumpiness.

Seems this problem might be different, having some new insight as part of the solution.


Worth mentioning, though, that people have already tried running all of them through LLMs at this point.

So this is proof of the models actually getting stronger (previous generations of LLMs were unable to solve this one).


Not definitively. LLMs are stochastic with respect to input, temperature and the exact prompt. It's possible that the model was already capable of it but never received the exact right conditions to produce this output.

Every model is able to solve each problem, given the right prompt. (Worst case, the prompt contains the solution.)

Interesting... Exhaustive brute force prompting might expose previously unknown capabilities in existing models. Seems like a whole can of worms.

Exhaustive brute force prompting is completely unfeasible. The number of potential prompts is impossibly large.

It "exhaustive brute forcing" approach does not need an LLM in the loop. Just brute force the possible outputs instead. They will contain all the most beautiful novels you can imagine!

> So this is proof of the models actually getting stronger (previous generations of LLMs were unable to solve this one).

No, it's not.

While I don't dispute that new models may perform better at certain tasks, the fact that someone was able to use them to solve a novel problem is not proof of this.

LLM output is nondeterministic. Given the same prompt, the same LLM will generate different output, especially when it involves a large number of output tokens, as in this case. One of those attempts might produce a correct output, but this is not certain, and is difficult if not impossible for a human not expert in the domain to determine this, as shown in this thread.


As others have pointed out, a key part of the prompt used here may have been "don't search the internet" as it would most likely have defaulted to starting off with existing approaches to that problem...

Minor aside, these models do not return the same answer every time you prompt it. Makes it harder to reason over their effectiveness.

You don't need to say "Minor aside" either. Thankfully language is a creative endeavour not a scientific one.

Context: parent originally said "you should not say 'worth mentioning', if it's worth mentioning you can just say it". That sentence has now been edited out so my comment looks weird.

Your reply was so rude it convinced me to edit. Your second reply is a distortion of my original message too.

Well I'm glad it had the desired effect. Your comment was ruder.

I disagree, you have quoted me in a way that is not the tone or content of what I wrote.

That's literally what the Erdős problems are. This post is about one of them being solved.

Except that Erdős problems are solved all the time, so many of them are already solved. Quite sure the last time I saw an article about an LLM solving an Erdős problem someone even tracked down a solution published by Erdős himself.

Sort of. Not saying that I think anyone should do this but just explaining for the sake of general knowledge.

I'm simplifying things quite a bit, but almost all military contracts are 8-year (typically split into a 4-year active and 4-year reserve period). If you leave on your own volition during this period, you typically have to repay the cost to the government to train you. And any contract that you're on where you received a signing bonus you have to pay back.

The actual mechanism for doing this is a different between officers and enlisted and they're some paperwork but functionally you can leave if you're really motivated to and for the most part people won't stop you (outside of a few conversations where people advise you against it).

The type of discharge you receive depends on the circumstances but generally there's a way to still get an honorable discharge (hardship, education, family, conscientious objector).

There's also the more practical quitting special forces vs leaving the military entirely. Tier 1 units only want people who want to be there and if you don't you can get transferred to some other job in the military in like a day if you really wanted to.


> It still struggles to create shaders from scratch

Oh just like a real developer


Much respect for shader developers, it's a different way of thinking/programming

Demographic change is the obvious explanation.

Score changes look about the same across performance percentiles and ethnicities. That suggests it's a systemic issue unrelated to population makeup. While I'd be interested to see regional and economic breakdowns, it's certainly far from obvious it's a result of demographic change, especially after such a short period of time.

Mississippi is one of the recent education success stories though.

I wouldn’t be so sure. Anecdotally all the kids these days seem equally messed up. It could be that the Chinese and Indian kids are propping up the locals.

I've heard too many horror stories so I'm waiting.

We truly need a constitutional amendment granting a formal right to privacy.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: