More

entee · 2025-09-28T14:16:55 1759069015

A lot of this post relies on the recent open ai result they call GDPval (link below). They note some limitations (lack of iteration in the tasks and others) which are key complaints and possibly fundamental limitations of current models.

But more interesting is the 50% win rate stat that represents expert human performance in the paper.

That seems absurdly low, most employees don’t have a 50% success rate on self contained tasks that take ~1 day of work. That means at least one of a few things could be true:

1. The tasks aren’t defined in a way that makes real world sense

2. The tasks require iteration, which wasn’t tested, for real world success (as many tasks do)

I think while interesting and a very worthy research avenue, this paper is only the first in a still early area of understanding how AI will affect with the real world, and it’s hard to project well from this one paper.

https://cdn.openai.com/pdf/d5eb7428-c4e9-4a33-bd86-86dd4bcf1...

drc500free · 2025-09-28T14:36:00 1759070160

That's not 50% success rate at completing the task, that's the win rate of a head-to-head comparison of an algorithm and an expert. 50% means the expert and the algorithm each "win" half the time.

viernullvier · 2025-09-28T19:23:43 1759087423

For the METR rating (first half of the article), it is indeed 50% success rate at completing the task. The win rate only applies to the GDPval rating (second half of the article).

entee · on Nov 5, 2023

I think what’s missing is what the software allows. It could be BMW/Merc etc are way more conservative on what the allow the system to do and when they force the driver to take over. In certain contexts Merc is actually willing to assert and stand by a higher level of autonomy than any other manufacturer: (https://www.motortrend.com/news/mercedes-benz-drive-pilot-le...). Taking that at face value it’s possible they can do it and choose not to because they don’t want the liability. Whatever systems are in regular cars are then either borked or deliberately have less hardware.

Tesla is uniquely risk tolerant for better or worse. You also don’t hear about people getting into accidents in a BMW on self driving because they don’t make the same claims and have tons of safeguards.

wilg · on Nov 5, 2023

The Mercedes thing is bullshit.

> Mercedes says that Drive Pilot will only operate during daylight hours at speeds up to 40 mph on “suitable freeway sections and where there is high traffic density.”

> While the system is active, drivers must keep their faces visible to the vehicle’s in-car cameras at all times, but can turn their head to talk to a passenger or play a game on the vehicle’s infotainment screen. Drivers can’t crawl into the back seat to take a nap, for instance. The system will disengage if the driver’s face is obscured or an attempt is made to block access to the in-car cameras. Presumably the system will deactivate itself if it detects the driver is sleeping or operating the car while impaired.

<40 mph, specific freeways only, does not make any kind of lane change or exit autonomously. I think any carmaker with a decent off-the-shelf lane keeping feature could make a liability claim in this scenario. It's not a measure of the technology.

Mawr · on Nov 5, 2023

Maybe any automaker could take liability too, maybe not. It's all just words in the wind until they actually do it. Mercedes put their money where their mouth is and I respect them for it. It's the opposite of bullshit.

wilg · on Nov 5, 2023

As long as you clearly understand what they are actually taking liability for, and what the capabilities of their system are, feel however you like.

IMO it's a misleading marketing tactic to position themselves competitively as having any kind of self-driving technology by recognizing that you can play games with the SAE levels to make the system sound impressive.

Mawr · on Nov 5, 2023

"SAE Level 3 DRIVE PILOT

It initiates a radical paradigm shift that permits the vehicle to take over the dynamic driving task under certain conditions in heavy traffic or congestion situations on suitable sections of freeway currently up to a speed of 60 km/h. This ultimate luxury experience enables customers to win back precious time when in the car through relaxation or productivity. For instance, they can communicate with work colleagues via in-car office tools, write messages and emails via the head unit, browse the internet or just sit back, relax and watch a movie." [1]

I'm confused where you see the opportunity for any ambiguity or misunderstanding. Even the name "SAE Level 3 DRIVE PILOT" tells you the limitations. If you want misleading, look at what Tesla's pulling with their "Full Self Driving".

In the end, users only care about what a feature enables them to do, not how impressive the tech behind it is. Being able to relax and watch a movie while sitting in busy traffic is a great value proposition.

P.S. Found a good article on it: https://arstechnica.com/cars/2023/09/mercedes-benzs-level-3-...

[1] https://group.mercedes-benz.com/innovation/case/autonomous/d...

wilg · on Nov 5, 2023

I don't think it's worth discussing this any further together.

seanmcdirmid · on Nov 5, 2023

It’s basically just useful for traffic jams…which isn’t bad idea. Most cars with smart cruise control could easily do something like this. I guess Mercedes is just adding a layer of security (driver’s face must show) and are then enabling it?

wilg · on Nov 5, 2023

It's not a bad idea. I just think Mercedes has been very clever at ginning up a "Level 3" "self-driving" feature out of commodity lane keeping systems, restricted use cases, and a cheap legal liability waiver that will almost never come into play.

entee · on Aug 9, 2022

Transcription factors often are partially disordered, just to name one. A bunch of others here:

https://www.nature.com/articles/nrm3920

entee · on July 29, 2022

> It’s only marginally less useful to actual biology than full on X-ray structures anyway.

I'm not sure what you're implying here. Are you saying both types of structures are useful, but not as useful as the hype suggests, or that an X-Ray Crystal (XRC) and low confidence structures are both very useful with the XRC being marginally more so?

An XRC structure is great, but it's a very (very) long way from getting me to a drug. Observe the long history of fully crystalized proteins still lacking a good drug. Or this piece on the general failure of purely structure guided efforts in drug discovery for COVID (https://www.science.org/content/blog-post/virtual-screening-...). I think this tech will certainly be helpful, but for most problems I don't see it being better than a slightly-more-than-marginal gain in our ability to find medicines.

Edit: To clarify, if the current state of the field is "given a well understood structure, I often still can't find a good medicine without doing a ton of screening experiments" then it's hard to see how much this helps us. I can also see several ways in which a less than accurate structure could be very misleading.

FWIW I can see a few ways in which it could be very useful for hypothesis generation too, but we're still talking pretty early stage basic science work with lots of caveats.

Source: PhD Biochemist and CEO of a biotech.

entee · on May 31, 2022

Such a database would be hugely helpful across chemistry. Right now it’s extremely expensive to access databases like Reaxys or Scifinder, and they’re not usually programmatically searchable at scale. Some databases do exist based on the patent literature (https://depth-first.com/articles/2019/01/28/the-nextmove-pat...) but they’re not as well curated or complete. A pubchem like database for reactions would be really awesome.

entee · on Feb 16, 2022

As a fellow perfectionist who has started a company, one thing that has helped me is realizing that most decisions are a lot more reversible than they appear. Even in the legal and financial domain, most things that you might obsess over are fixable if you make a mistake, and decent lawyers will tell you which ones you really have to avoid. Sometimes it'll cost you money and time, but the biggest cost is avoiding making decisions.

Always remember, no decision is a decision. Usually that's the worst choice because almost any decision, even a wrong one, at least moves the ball in some direction, allowing you to gather more information. The only guarantee in this game is that stasis will kill you, so bias towards action. When in doubt, try to evaluate "most probable bad outcome" which is different from "worst possible outcome".

Good luck!

valdiorn · on Feb 16, 2022

I feel like this is actually really good advice, thanks for taking the time to respond!

entee · on Feb 15, 2022

There's a lot that can be learned with building-block based experiments. If you do a building block based experiment then train a model, then predict new compounds, the models do generalize meaningfully outside the original set of building blocks into other sets of building blocks (including variations on different ways of linking the building blocks). Granted that's not the "fully novel scaffold" test, however it suggests that there should be some positive predictive value on novel scaffolds.

We've done work in this area and will be publishing some results later in the year.

entee · on Feb 15, 2022

This is true. Getting datasets with the necessary quality and scale for molecular ML is hard and uncommon. Experimental design is also a huge value add, especially given the enormous search space (estimates suggest there are more possible drug-like structures than there are stars in the universe). The challenge is figuring out how to do computational work in a tight marriage with the lab work to support and rapidly explore the hypotheses generated by the computational predictions. Getting compute and lab to mesh productively is hard. Teams and projects have to be designed to do so from the start to derive maximum benefit.

Also shameless plug: I started a company to do just that, anchored to generating custom million-to-billion point datasets and using ML to interpret and design new experiments at scale.

entee · on Oct 26, 2021

Not a chiphead, but saw this in the article that might be a reason ARM is better for this kind of thing:

"The theory goes that arm64’s fixed instruction length and relatively simple instructions make implementing extremely wide decoding and execution far more practical for Apple, compared with what Intel and AMD have to do in order to decode x86-64’s variable length, often complex compound instructions."

Not sure it's true, not an expert. But it doesn't sound wrong!

throwawaylinux · on Oct 26, 2021

The fixed width decoders have always been a commonly cited advantage of fixed width, and to some degree it must be true. But this is not a recent thing, the "common wisdom" about instruction format not mattering too much still very much applies here.

Pre-decode lengths or stop-bits and more recently micro-op caches have been techniques that x86 has used to mitigate this and improve front end widths, for example.

People like Jim Keller (who has actually worked and lead teams implementing these very processors at Apple, Intel, and AMD!) basically say as much (while acknowledging decode is a little harder, in the large scheme of things on modern large cores it's not such a big deal):

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

Andy Glew, one of the architects for Intel's first out of order x86 core (P6) among other things, is another who has said similar.

https://groups.google.com/g/comp.arch/c/elke1FHfYr0/m/SwW9NT...

mhh__ · on Oct 26, 2021

> X86 tax might be 5%

A consistent 5% win is pretty huge for certain industries.

throwawaylinux · on Oct 26, 2021

> A consistent 5% win is pretty huge for certain industries.

Are you referring to Andy Glew's thread? He said perhaps 5%, but he also went on to say probably less than 5% for basically the lowest-end out of order processor that was fielded (A-9), not what you would call a high performance core (even then 10 years ago). On today's high performance cores? Not sure, extrapolating naively from what he said would suggest even less. Which is backed up by what Jim Keller says later.

So << 5%, which is significantly less than process node generational increases.

I'm not saying ARM won't leapfrog x86, I'm just asking what the basis is for that belief, and what those who believe it think they know that the likes of Jim Keller does not.

If it's an argument about something other than details of instruction set implementation (e.g., economics or process technology) then that would be something. That is exactly how Intel beat the old minicomputer companies' RISCs despite having "x86 tax", is that they got revenues to fund better process technology and bigger logic design teams. Although that's harder to apply to Apple vs AMD/Intel because x86 PC and server units and revenues are also huge, and TSMC gives at least AMD a common manufacturing base even if Apple is able to pay for some advantage there.

TomVDB · on Oct 26, 2021

Yes, but people aren't impressed by the M1 chips because of 5% KPI differences.

entee · on Sept 26, 2021

Most of the methane comes from cattle belching, so it's basically impossible to harvest at scale. The lagoons of manure also do, and that can be harvested though it's hard.

https://climate.nasa.gov/faq/33/which-is-a-bigger-methane-so...

entrep · on Sept 26, 2021

I came up with this solution: https://i.imgur.com/8BNktVY.png

xchaotic · on Sept 26, 2021

I nominated you for IgNobel 2022