Stubsack: weekly thread for sneers not worth an entire post, week ending 1st March 2026

BlueMonday1984@awful.systems · edit-2 29 days ago

Stubsack: weekly thread for sneers not worth an entire post, week ending 1st March 2026

lurker@awful.systems · 1 month ago

the metr graph has gotten weird https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ the 50% success rate graph went from 6 hours to 14 hours, but the 80% success rate graph only went from 55 minutes to 1 hour and 3 minutes. I have an itch that it’s a fluke or outlier but it’s also very possible that LLM coding’s just weird like that

scruiser@awful.systems · edit-2 1 month ago

You’re giving them too much credit. The entire methodology of “determine how long it takes humans to do a task and use that as a proxy for difficulty” was somewhat abstract and questionable in the first place, but with good rigorous implementation, it might have still been worthwhile.

However, their actual methodology is awful. Most of their tasks only have 3 or so human attempts to do them to create a baseline (from a relatively small pool of baseliners), and for longer tasks, they entirely went with a guess-estimate on task completion time. The error bars they show are just for the model trying to do the task (and they are already absurdly big, especially for this most recent jump), if you added in error bars accounting for variability in the task baseline itself, the error bars would get even bigger.

This blog goes into more details explaining the nuances of the problems with their methodology: https://arachnemag.substack.com/p/the-metr-graph-is-hot-garbage

To give a simple example, if the numerous problems resulted in a systematic bias on task estimation, linear improvement could easily look exponential. To give a simple example of how that is possible if they had 5 tasks that had a true baseline (putting aside questions of methodology validity such that true is even meaningful) of 15 minutes, 30 minutes, 45 minutes, 1 hour, and an hour and 15 minutes (respectively) but flaws with human baseliners (for example, lacking specialized skills for longer tasks, phoning it in because they are paid by the hour, metr guesstimating the task time), they had numbers for those 5 tasks of 15 minutes, 1 hour, 2 hours, 4 hours, and 8 hours, successive improvements to get to 50% success on each task would look exponential even though they are actually linear improvements.

METR maybe deserves a tiny bit of credit for trying something even vaguely related to practically meaningful task (compared to all the completely irrelevant bs benchmarks that would be worthless even if they were accurate). But I wouldn’t give them any more credit than that, its just that the bar is so low.

JFranek@awful.systems · 1 month ago

Broke: The METR studies are the best research on impacts of AI productivity available today.

Woke: The METR studies are hot garbage.

Bespoke: Both. It’s both.

scruiser@awful.systems · 1 month ago

That a great summary and an accurate indictment of the “study” of LLMs.

scruiser@awful.systems · 1 month ago

Doing what METR tried to do right would in fact be really expensive and hard, but for something that the fate of the world allegedly depends on (according to both boosters and doomers) you think they would manage to find the money for it. But the LLM companies don’t actually want accurate numbers, they want hype.

lurker@awful.systems · edit-2 1 month ago

oh yeah I 100% agree that their methodology is flawed, and that blog does a pretty good job of outlining the issues. I just thought the absolutely huge gap was both interesting and funny. Their absolutely huge error bars are not a good sign, between that and the gap it really feels like someone screwed up