Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data - eviltoast

ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • TWeaK@lemm.ee
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    2
    ·
    11 months ago

    information that it can index, tag and sort for keywords.

    The dataset ChatGPT uses to train on contains data copied unlawfully. They’re not just reading the data at its source, they’re copying the data into a training database without sufficient license.

    Whether ChatGPT itself contains all the works is debatable - is it just word relationships when the system can reproduce significant chunks of copyrighted data from those relationships? - but the process of training inherently requires unlicensed copying.

    In terms of fair use, they could argue a research exemption, but this isn’t really research, it’s product development. The database isn’t available as part of scientific research, it’s protected as a trade secret. Even if it was considered research, it absolutely is commercial in nature.

    In my opinion, there is a stronger argument that OpenAI have broken copyright for commercial gain than that they are legitimately performing fair use copying for the benefit of society.

    • Womble@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      11 months ago

      Every time you load a webpage you are making a local copy of it for your own use, if it is on the open web you are implicitly given permission to make a copy of it for your own use. You are not given permission to then distribute those copies which is where LLMs may get into trouble, but making a copy for the purpose of training is not a breach of copyright as far as I can understand or have heard.

      • TWeaK@lemm.ee
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        11 months ago

        Yes, you do make a copy of a web page. And every time you load a video game you make a copy into RAM. However, this copying is all permitted under user license - you’re allowed to make minor copies as part of the process of running the software and playing the media.

        Case in point, the UK courts ruled that playing pirated games was illegal, because when you load the game from a disc you copy it into RAM, and this copying is not licensed by the player.

        OpenAI does not have any license for copying into its database. The terms and conditions of web pages say you’re allowed to view them, not allowed to take the data and use it for things. They don’t explicitly prohibit this (yet), but the lack of a prohibition does not mean a license is implied. OpenAI can only hope for a fair use exemption, and I don’t think they qualify because a) it isn’t really “research” but product development, and even if it is research b) it is purely for commercial gain.

        • Womble@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          11 months ago

          Could you point to the judgement on playing copied games was illegal in the UK? I can only find articles about specifically DS copy cartridges which are very obviously intended to make/use unlicensed copies of games to distribute.

          Even so, that again hinges on right to distribute, not right to make a copy for personal use. If a game is made freely available on the web for you to play it is not illegal to download that game to play offline or study it.

          • TWeaK@lemm.ee
            link
            fedilink
            English
            arrow-up
            1
            ·
            edit-2
            11 months ago

            I’d have to go digging, sorry I don’t have the time right now. It was to do with piracy on the OG X-Box. It wasn’t the main part of the case, just a tangential point inside the judge’s ruling.

            Downloading a game to play it would be copyright infringement. Downloading involves making a copy on your device. However one copy really isn’t worth the hassle of claiming against, so it never happens. This is why all the Napster cases inflated the counts of infringement by including everyone you connected to as if you had uploaded a complete copy to them, that’s the only way to make the claim worthwhile. Also in the US uploading to someone carried punitive damages, similar cases didn’t work so well in the UK with actual damages.

            Downloading it to study is fair use under the research exemption, particularly if it’s a non-commercial activity.

            Copyright infringement happens all the time, but the vast majority of cases aren’t worth prosecuting, and there’s no penalty for a rights holder not to prosecute. Meanwhile, with Trademarks, the rights holder absolutely can lose their rights if they don’t prosecute every infringement they become aware of.