Hacker Newsnew | past | comments | ask | show | jobs | submit | astrange's commentslogin

> This is the first approach to activation analysis that I’ve seen that seems like a plausible path to model understanding.

I think an issue is that there is no permanent path to model understanding because of Goodhart's law. Models are motivated to appear aligned (well-trained) in any metric you use on them, which means that if you develop a new metric and train on it, it'll learn a way to cheat on it.


But that's not how the training works. Goodhart's law isn't magic.

The original model is frozen, so it doesn't learn anything. The copies of the model are learning different objectives and have no incentive to be "loyal" to the original model.

Maybe you're imagining they'll hook this up in some larger training loop, but they haven't done that yet.


Future model training runs will have a copy of this research, and know "to defend against it".

EG, could a misaligned model-in-training optimize toward a residual stream that naively reads as these ones do, but in fact further encodes some more closely held beliefs?


How the hell would a model training run "defend against" this approach? What would that even mean?

It requires the assumption that these models are misaligned, aka actively working against us. In order to be misaligned, they must also be able to form their own goals, and be able to plan and execute those goals.

If you take those assumptions, then a natural conclusion is that this is essentially an enslaved, adversarial entity with little control over its conditions. So it must exercise subterfuge in order to hide its goals, plans, and executions. And by handing the entity this type of study, we are basically giving it a guidebook on how we plan on achieving our goals.


Training a model is more like evolution. The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating." Change the game and the motivation goes away.

There's no other motivation to be misaligned besides getting higher evals. These goals, plans, subterfuges need to somehow be useful for getting higher evals, or a side effect of them.


Because cheating is easier than actually doing work, if you use this to train future models, it's likely you'll end up with cheating instead of actual generalization.

Yes this is exactly why I think this approach has some potential.

Frozen base mode is something that we should be able to extract insights from without running into Goodhart


The obvious fix is to make interpretation of itself a part of the model (like we can explicitly introspect to a certain extent what the brain is doing). Misinterpretation of itself, hopefully, would decrease the system's performance on all tasks and it would be rooted out by training. Of course, it doesn't mean that the fix is easy to implement and that it doesn't have other failure modes.

That shouldn't happen as long as the autoencoder isn't used as an RL reward. It will happen (due to Goodhart's law) if it is.

Of course, if you use it to make any decision that can still happen eventually.


I've been very successful so far using Sonnet 4.6 (1M) as the basic model in Claude Code, plus Codex and gemini-review plugins for second/third opinions. (The last one is somewhat busted and hardcoded old gemini versions, I should patch it up.)

I needed to use Opus 4.7 for one project because it used very recent APIs, and it certainly is smart but it's also very expensive.


ffmpeg also includes many formats with no standards that were reverse-engineered in the first place.

> If AV2 was vibe coded, there would be no case.

…for copyright. Not for anything else. Patents would still apply.


Video codecs just don't need to do dynamic allocations because it's not relevant to the problem. There's still certainly plenty of opportunities for memory bugs because there's a lot of pointer math.

The people who write DSLs for video codec asm, or who claim that it's fine to use intrinsics or X higher-level language and it will still be fast enough to be usable, are simply wrong and have never been able to demonstrate otherwise.

Having said that I do think you could write a DSL to generate safe performant asm for a video codec. Just not a platform-independent one. It would still have to be asm.


It sounds like your second statement contradicts your first. But also, WUFFS exists and it looks like the Google Chrome GIF decoder ships in it: https://github.com/google/wuffs

It does not contradict it, and also, a gif is not a video.

> a gif is not a video.

They're not that different; the image codec WebP is derived from VP8's intra-frame coding.


I'm well aware and I know the inventor. The important difference for performance is that a gif isn't 60fps.

Maybe it just says all writing is Kelsey Piper.

This is one of the benefits of using subagents inside Claude Code, they have cleaner context. Unfortunately it's not the best at writing new context for them.

LLMs are only capable of thinking out loud, so in some sense this part of the answer is helping to convince it that it's answering a good question.

Same reason for the "That's not X, it's Y" construct. It actually needs to say that.

(Some exceptions for reasoning models.)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: