AI chatbots unable to accurately summarise news, BBC finds - eviltoast
  • MoonlightFox@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    9 hours ago

    So there is not any trustworthy benchmarks I can currently use to evaluate? That in combination with my personal anecdotes is how I have been evaluating them.

    I was pretty impressed with Deepseek R1. I used their app, but not for anything sensitive.

    I don’t like that OpenAI defaults to a model I can’t pick. I have to select it each time, even when I use a special URL it will change after the first request

    I am having a hard time deciding which models to use besides a random mix between o3-mini-high, o1, Sonnet 3.5 and Gemini 2 Flash

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 hours ago

      Heh, only obscure ones that they can’t game, and only if they fit your use case. One example is the ones in EQ bench: https://eqbench.com/

      …And again, the best mix of models depends on your use case.

      I can suggest using something like Open Web UI with APIs instead of native apps. It gives you a lot more control, more powerful tooling to work with, and the ability to easily select and switch between models.