Hacker Newsnew | past | comments | ask | show | jobs | submit | coder543's commentslogin

Your “benchmark” is invalid. Penalizing the model because the hosting environment is being DDoSed by users a few hours after launch is utter nonsense.

I see that you tried to justify this lower in the thread, but no… it completely invalidates your benchmark. You are not testing the model. You are conflating one specific model host and model performance, and then claiming you are benchmarking the model. All major models are hosted by multiple different services.

In the real world, clients will just retry if there is a server error, and that will not impact response quality at all, and the workflow the model is being used in will not fail. If a workflow is so poorly coded that it doesn’t even have retry logic, then that workflow is doomed no matter which host you use. But again, reliability of the host is separate from the model.

You can make your benchmark valid by having separate leaderboards for model quality and host reliability. I’m not saying to throw the whole thing away. But the current claim is not valid.

And you’re also making an unsourced claim that everyone else has already determined this model sucks? Nah. The first result from Artificial Analysis shows good things: https://x.com/ArtificialAnlys/status/2047547434809880611

But I am still waiting to see the results from the full suite of AA benchmarks.


Their benchmark is full of nonsense like this and I'm amazed the fact most of their interactions on the site are promoting it hasn't gotten the account banned for spam.

They have Gemini 2.5 Flash ahead of Opus 4.6: https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Absolutely worthless benchmark but every release has a comment linking to this nonsense.


The description specifically says:

"Kimi-K2.6 adopts the same native int4 quantization method as Kimi-K2-Thinking."


From the page:

> Import from anywhere. Start from a text prompt, upload images and documents (DOCX, PPTX, XLSX), or point Claude at your codebase. You can also use the web capture tool to grab elements directly from your website so prototypes look like the real product.


Thank you, I should RTFA next time.

Artificial Analysis hasn't posted their independent analysis of Qwen3.6 35B A3B yet, but Alibaba's benchmarks paint it as being on par with Qwen3.5 27B (or better in some cases).

Even Qwen3.5 35B A3B benchmarks roughly on par with Haiku 4.5, so Qwen3.6 should be a noticeable step up.

https://artificialanalysis.ai/models?models=gpt-oss-120b%2Cg...

No, these benchmarks are not perfect, but short of trying it yourself, this is the best we've got.

Compared to the frontier coding models like Opus 4.7 and GPT 5.4, Qwen3.6 35B A3B is not going to feel smart at all, but for something that can run quickly at home... it is impressive how far this stuff has come.


Qwen models commonly get accused of benchmaxxing though. Just something to keep in mind when weighing the standard benchmarks.

Every model release gets accused of that, including the flagship models.

Less so for Gemma-4 because it falls behind Qwen on benchmarks. Tests for benchmaxxing are also strongly suggestive: https://x.com/bnjmn_marie/status/2041540879165403527

No… seriously. Every model release is accused. Including Opus, GPT-5.4, whatever. And yes, including smaller models that are not the top in every benchmark.

My own experiences with Gemma 4 have been quite mediocre: https://www.reddit.com/r/LocalLLaMA/comments/1sn3izh/comment...

I would almost be tempted to call it benchmaxed if that term weren’t such a joke at this point. It is a deeply unserious term these days.

Gemma 4 is worse than its benchmarks show in terms of agentic workflows. The Qwen3.x models are much better; not benchmaxed. I have tested this extensively for my own workflows. Google really needs to release Gemma 4.1 ASAP. I really hope they’re not planning to just wait another calendar year like they did for Gemma 3 -> 4 with no intermediate updates.

And the lead author on the paper replied to that tweet to say that the scores would need to be greater than 80 to show actual contamination: https://x.com/MiZawalski/status/2043990236317851944?s=20


Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.


That is an extremely strange article, in my opinion. They test Gemma 4 31B, but they use Qwen3 32B, DeepSeek R1, and Kimi K2, which are all outdated models whose replacements were released long before Gemma 4? Qwen3.5 27B would have done far better on these tests than Qwen3 32B, and the same for DeepSeek V3.2 and Kimi K2.5. Not to mention the obvious absence of GLM-5.1, which is the leading open weight model right now.

The article also seems to brush over the discovery phase, which seems very important. If it were as easy as they say, then the models should have been let loose and we would see if they actually found these bugs, and how many false positives they marked as critical. Instead, they pointed the models at the flawed code directly.


Gemma 4 31B has now wiped out several of those models from the pareto frontier, now that it has pricing. Gemma 4 26B A4B has an Elo, but no pricing, so it still isn't on that chart. The Gemma 4 E2B/E4B models still aren't on the arena at all, but I expect them to move the pareto frontier as well if they're ever added, based on how well they've performed in general.


If you search the model card[0], there is a section titled "Code for processing Audio", which you can probably use to test things out. But, the model card makes the audio support seem disappointing:

> Audio supports a maximum length of 30 seconds.

[0]: https://huggingface.co/google/gemma-4-26B-A4B-it#getting-sta...


The E2B and E4B models support 128k context, not 256k, and even with the 128k... it could take a long time to process that much context on most phones, even with the processor running full tilt. It's hard to say without benchmarks, but 128k supported isn't the same as 128k practical. It will be interesting to see.


That Pareto plot doesn't seem include the Gemma 4 models anywhere (not just not at the frontier), likely because pricing wasn't available when the chart was generated. At least, I can't find the Gemma 4 models there. So, not particularly relevant until it is updated for the models released today.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: