[Paper] Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in SOTA Large Language Models - eviltoast

“Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?”

The problem has a light quiz style and is arguably no challenge for most adult humans and probably to some children.

The scientists posed varying versions of this simple problem to various State-Of-the-Art LLMs that claim strong reasoning capabilities. (GPT-3.5/4/4o , Claude 3 Opus, Gemini, Llama 2/3, Mistral and Mixtral, including very recent Dbrx and Command R+)

They observed a strong collapse of reasoning and inability to answer the simple question as formulated above across most of the tested models, despite claimed strong reasoning capabilities. Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses.

This breakdown can be considered to be dramatic not only because it happens on such a seemingly simple problem, but also because models tend to express strong overconfidence in reporting their wrong solutions as correct, while often providing confabulations to additionally explain the provided final answer, mimicking reasoning-like tone but containing nonsensical arguments as backup for the equally nonsensical, wrong final answers.

  • rufus@discuss.tchncs.deOP
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    5 months ago

    I’m currently reading the paper. I occasionally debate here on Lemmy whether LLMs are just stochastic parrots, or if they actually grasp the concepts they’re talking about. There’s also evicence for that.

    Ultimately I wonder if and when we’ll get LLMs that address ‘hallucinations’ and expose a setting to adjust the factuality of the answer. I suppose that’s somewhere in the model or at least possible to learn for the model. But certainly not controlled or factored in in the current generation of LLMs.

    • Audalin@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      5 months ago

      Don’t know much of the stochastic parrot debate. Is my position a common one?

      In my understanding, current language models don’t have any understanding or reflection, but the probabilistic distributions of the languages that they learn do - at least to some extent. In this sense, there’s some intelligence inherently associated with language itself, and language models are just tools that help us see more aspects of nature than we could earlier, like X-rays or a sonar, except that this part of nature is a bit closer to the world of ideas.

      • barsquid@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        5 months ago

        I don’t know about common but you and I agree on a lot. LLMs are not a breakthrough in artificial cognition but more like a breakthrough in linguistics that coherent English can be produced with unexpectedly small mathematical structures. Hubris on our part imagining human language is more complex than it is or that our ideas are more unique than they are.

      • rufus@discuss.tchncs.deOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        edit-2
        5 months ago

        Well, I’d say there is information in language. That’s kinda the point of it and why we use it. And language is powerful. We can describe and talk about a lot of things. (And it’s an interesting question what can not be described with language.)

        I don’t think the stochastical parrot thing is a proper debate. It’s just that lots of people don’t know what AI is and what it can and cannot do. And it’s neither easy to understand nor are the consequences always that obvious.

        Training LLMs involves some clever trickery, limit their size etc so they can’t just memorize everything, but instead are forced to learn concepts behind those texts.

        I think they form models of the world inside of them. At least of things they’ve learned from the dataset. That’s why they can for example translate text. They have some concept of a cat stored inside of them and can apply that to a different language that uses entirely different characters to name that animal.

        I wouldn’t say they are “tools to learn more aspects about nature”. They aren’t a sensor or something. And they can infer things, but not ‘measure’ things like an X-ray.