The AI bill Newsom didn’t veto — AI devs must list models’ training data - eviltoast
  • OhNoMoreLemmy@lemmy.ml
    link
    fedilink
    English
    arrow-up
    20
    ·
    1 month ago

    The other reason they don’t do it is because many models are trained on a large corpus of pirated texts, and documenting this would be a confession.

    Not just in an ‘I scraped the new york times without permission’ kind of way, but in a ‘I illegally downloaded a torrent containing bestsellers from the last 30 years’ kind of way.

    • Tar_Alcaran@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      11
      ·
      1 month ago

      Exactly. It’s not that they can’t, or that it’s too expensive, it’s that doing so will reveal their crimes.

      • imadabouzu@awful.systems
        link
        fedilink
        English
        arrow-up
        9
        ·
        1 month ago

        In a sense, to me, it is the same thing. If your business is built upon repurposing everyone else’s inputs indiscriminately to your benefit and their detriment, it is, too expensive, to reveal that simple truth.

    • Soyweiser@awful.systems
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 month ago

      Bestsellers? There used to be torrents of basically all releases. My provider blocks torrent sites and I dont use a vpn so im not sure if people still do this, but downloading basically all books (in english) at once released in a certain period was possible