Want to wade into the snowy surf of the abyss? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid.
Welcome to the Stubsack, your first port of call for learning fresh Awful youāll near-instantly regret.
Any awful.systems sub may be subsneered in this subthread, techtakes or no.
If your sneer seems higher quality than you thought, feel free to cutānāpaste it into its own post ā thereās no quota for posting and the bar really isnāt that high.
The post Xitter web has spawned so many āesotericā right wing freaks, but thereās no appropriate sneer-space for them. Iām talking redscare-ish, reality challenged āculture criticsā who write about everything but understand nothing. Iām talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyāre inescapable at this point, yet I donāt see them mocked (as much as they should be)
Like, there was one dude a while back who insisted that women couldnāt be surgeons because they didnāt believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canāt escape them, I would love to sneer at them.
(Credit and/or blame to David Gerard for starting this. If youāre wondering why this went up late, I was doing other shit)
(EDIT: Changed ā29th Februaryā to ā1st Marchā - its not a leap year)


the metr graph has gotten weird https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ the 50% success rate graph went from 6 hours to 14 hours, but the 80% success rate graph only went from 55 minutes to 1 hour and 3 minutes. I have an itch that itās a fluke or outlier but itās also very possible that LLM codingās just weird like that
Youāre giving them too much credit. The entire methodology of ādetermine how long it takes humans to do a task and use that as a proxy for difficultyā was somewhat abstract and questionable in the first place, but with good rigorous implementation, it might have still been worthwhile.
However, their actual methodology is awful. Most of their tasks only have 3 or so human attempts to do them to create a baseline (from a relatively small pool of baseliners), and for longer tasks, they entirely went with a guess-estimate on task completion time. The error bars they show are just for the model trying to do the task (and they are already absurdly big, especially for this most recent jump), if you added in error bars accounting for variability in the task baseline itself, the error bars would get even bigger.
This blog goes into more details explaining the nuances of the problems with their methodology: https://arachnemag.substack.com/p/the-metr-graph-is-hot-garbage
To give a simple example, if the numerous problems resulted in a systematic bias on task estimation, linear improvement could easily look exponential. To give a simple example of how that is possible if they had 5 tasks that had a true baseline (putting aside questions of methodology validity such that true is even meaningful) of 15 minutes, 30 minutes, 45 minutes, 1 hour, and an hour and 15 minutes (respectively) but flaws with human baseliners (for example, lacking specialized skills for longer tasks, phoning it in because they are paid by the hour, metr guesstimating the task time), they had numbers for those 5 tasks of 15 minutes, 1 hour, 2 hours, 4 hours, and 8 hours, successive improvements to get to 50% success on each task would look exponential even though they are actually linear improvements.
METR maybe deserves a tiny bit of credit for trying something even vaguely related to practically meaningful task (compared to all the completely irrelevant bs benchmarks that would be worthless even if they were accurate). But I wouldnāt give them any more credit than that, its just that the bar is so low.
Broke: The METR studies are the best research on impacts of AI productivity available today.
Woke: The METR studies are hot garbage.
Bespoke: Both. Itās both.
That a great summary and an accurate indictment of the āstudyā of LLMs.
Doing what METR tried to do right would in fact be really expensive and hard, but for something that the fate of the world allegedly depends on (according to both boosters and doomers) you think they would manage to find the money for it. But the LLM companies donāt actually want accurate numbers, they want hype.
oh yeah I 100% agree that their methodology is flawed, and that blog does a pretty good job of outlining the issues. I just thought the absolutely huge gap was both interesting and funny. Their absolutely huge error bars are not a good sign, between that and the gap it really feels like someone screwed up