Hacker Newsnew | past | comments | ask | show | jobs | submit | fzysingularity's commentslogin

> It's like going to the grocery store and buying tabloids, pretending they're scientific journals.

This is pure gold. I've always found this approach of evals on a moving-target via consensus broken.


I'd love to see Claude Code remove more lines than it added TBH.

There's a ton of cruft in code that humans are less inclined to remove because it just works, but imagine having LLM doing the clean up work instead of the generation work.


Here's a short cookbook exploring an agentic approach to vision–language tasks: detection, segmentation, OCR, generation, and combining classical CV tools with VLM reasoning.

Happy to run examples if you leave a comment.

[1] IPython notebook: https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

[2] Colab: https://colab.research.google.com/github/vlm-run/vlmrun-cook...


What is photopea built on?


Author does yearly AMAs on reddit, you should look it up.


This is why arenas are generally a bad idea for assessing correctness in visual tasks.


FYI one of the models on the battle was pretty slow to load. Are these also being rated on latency or just quality?


Ultimately, there’s some intersection of accuracy x cost x speed that’s ideal, which can be different per use case. We’ll surface all of those metrics shortly so that you can pick the best model for the job along those axes.


ideally we want people to rate based on quality - but i imagine some of the results are biased rn based on loading time


That's an easy fix if you wait for the slowest one and pop them both in at the same time, no?


I definitely see the value and versatility of Claude Skills (over what MCP is today), but I find the sandboxed execution to be painfully inefficient.

Even if we expect the LLMs to fully resolve the task, it'll heavily rely on I/O and print statements sprinkled across the execution trace to get the job done.


> but I find the sandboxed execution to be painfully inefficient

sandbox is not mandatory here. You can execute the skills on your host machine too (with some fidgeting) but it's a good practice and probably for the better to get in to the habit of executing code in an isolated environment for security purposes.


The better practice is, if it isn't a one-off, being introduced to the tool (perhaps by an LLM) and then just running the tool yourself with structured inputs when it is appropriate. I think the 2015 era novice coding habit of copying a blob of twenty shell scripts off of stack overflow and blindly running them in your terminal (while also not good for obvious reasons) was better than that essentially happening but you not being able to watch and potentially learn what those commands were.


I do think that if the agents can successfully resolve these tasks in a code execution environment, it can likely come up with better parametrized solutions with structured I/O - assuming these are workflows we want to run over and over again.


Claude does image generation in surprising ways - we did a small evaluation [1] of different frontier models for image generation and understanding, and Claude is by far the most surprising in results.

[1] https://chat.vlm.run/showdown

[2] https://news.ycombinator.com/item?id=45996392


We ran a small visual benchmark [1] of GPT, Gemini, Claude, and our new visual agent Orion [2] on a handful of visual tasks: object detection, segmentation, OCR, image/video generation, and multi-step visual reasoning.

The surprising part: models that ace benchmarks often fail on seemingly trivial visual tasks, while others succeed in unexpected places. We show concrete examples, side-by-side outputs, and how each model breaks when chaining multiple visual steps.

We go into more details in our technical whitepaper [3]. Play around with Orion for free here [4].

[1] Showdown: https://chat.vlm.run/showdown

[2] Learn about Orion: https://vlm.run/orion

[3] Technical whitepaper: https://vlm.run/orion/whitepaper

[4] Chat with Orion: https://chat.vlm.run/

Happy to answer questions or dig into specific cases in the comments.


SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!

[1] https://chat.vlm.run

[2] https://vlm.run/orion


Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...



Nice, that's pretty neat.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: