cross-posted from: https://lemmy.ml/post/5400607

This is a classic case of tragedy of the commons, where a common resource is harmed by the profit interests of individuals. The traditional example of this is a public field that cattle can graze upon. Without any limits, individual cattle owners have an incentive to overgraze the land, destroying its value to everybody.

We have commons on the internet, too. Despite all of its toxic corners, it is still full of vibrant portions that serve the public good — places like Wikipedia and Reddit forums, where volunteers often share knowledge in good faith and work hard to keep bad actors at bay.

But these commons are now being overgrazed by rapacious tech companies that seek to feed all of the human wisdom, expertise, humor, anecdotes and advice they find in these places into their for-profit A.I. systems.

  • Pantoffel@feddit.de
    link
    fedilink
    English
    arrow-up
    71
    arrow-down
    1
    ·
    9 months ago

    I don’t think the issue is corps feeding the internet into AI systems. The real issue is gatekeeping to information and only giving access to this information while milking the individual for data by trackers, money by subscriptions, and more money by ads (that we pay for with subscriptions).

    Another larger issue that I fear is often ignored is the amount of control large corporations and in theory the government can have over us just by looking at our trace we leave in the internet. Just have a look at Russia and China for real world examples of this.

    • kibiz0r@midwest.social
      link
      fedilink
      English
      arrow-up
      25
      arrow-down
      1
      ·
      9 months ago

      As an open source contributor, I believe information (facts and techniques) should be free.

      As an open source contributor, I also know that two-way collaboration only happens when users understand where the software came from and how they can communicate back to the original author(s).

      The layer of obfuscation that LLMs add, where the code is really from XYZ open-source project, but appears to be manifesting from thin air… worries me, because it’s going to alienate would-be collaborators from the original authors.

      “AI” companies are not freeing information. They are colonizing it.

      • FaceDeer@kbin.social
        link
        fedilink
        arrow-up
        6
        arrow-down
        5
        ·
        9 months ago

        The code that AI produces isn’t “copied” from those original authors, though. The AI learned how to code from them, it isn’t literally copying and pasting from them.

        If you think a bit of code is “really from” XYZ open-source project, that’s a copyright violation and you can pursue that legally. But you’ll need to actually show that the code is a copy.

        • kibiz0r@midwest.social
          link
          fedilink
          English
          arrow-up
          2
          ·
          9 months ago

          Your justification seems to rest on whether LLM training technically passes the legal standard of violating IP.

          That’s not a super compelling argument to me, because:

          1. Nobody designed current IP law with LLMs in mind
          2. I would wager that a vast majority of creators whose works were consumed by LLMs did not consider whether their license would permit such an act, and thus didn’t meaningfully consent to have their work used this way (whether or not the law would agree)
          3. I would argue that IP law is heavily stacked in favor of platforms (who own IP, but do not create it) and against creators (who create, but do not own IP) and consumers

          I don’t think that there is fundamentally anything wrong with LLMs as a technology. My problem is that the economic incentives are misaligned with long-term stability of the creative pools that fuel these things in the first place.

          • FaceDeer@kbin.social
            link
            fedilink
            arrow-up
            2
            arrow-down
            1
            ·
            9 months ago

            Your justification seems to rest on whether LLM training technically passes the legal standard of violating IP.

            That’s basically all that I’m talking about here, yeah. I’m saying that the current laws don’t appear to say anything against training AIs off of public data. The AI model is not a copy of that data, nor is its output.

            Nobody designed current IP law with LLMs in mind

            Indeed. Things are not illegal by default, there needs to be a law or some sort of precedent that makes them illegal. In the realm of LLMs that’s very sparse right now for exactly the reason you say. Nobody anticipated it so nobody wrote any laws forbidding it.

            I would wager that a vast majority of creators whose works were consumed by LLMs did not consider whether their license would permit such an act, and thus didn’t meaningfully consent to have their work used this way (whether or not the law would agree)

            There are things that you can use intellectual property for that do not require consent in the first place. Fair use describes various categories of that. If it’s not illegal to use copyrighted material without permission when training AIs, why would it matter whether the license permitted it or the author consented to it?

            I would argue that IP law is heavily stacked in favor of platforms (who own IP, but do not create it) and against creators (who create, but do not own IP) and consumers

            Wouldn’t requiring licensing of data for the training of LLMs stack things even more in the favour of big IP-owning platforms?

            Again, as I said before, if you think some specific bit of LLM output is violating the copyright of some code you wrote, there’s already laws in place specifically covering that situation. You can go to court and show that the two pieces of code are substantially identical and sue for damages or whatever. The AI model itself is another matter, though, and I doubt any current laws would count it as a “copy” of the data that went into training it.

        • NeoNachtwaechter@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          5
          ·
          9 months ago

          The copyright violation has happened when the code got fed into that AI’s greedy gullet, not when it came out of it’s rear end.

          • FaceDeer@kbin.social
            link
            fedilink
            arrow-up
            7
            arrow-down
            1
            ·
            9 months ago

            That remains to be tested legally speaking, and I don’t think it’s likely to pass muster. If it was trained correctly (ie, no overfitting) the resulting AI model does not contain a copy of the training inputs in any identifiable sense.

            • NeoNachtwaechter@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              9 months ago

              Yes, the laws are probably muddy in Usa as usual, but rather clear here in the EU. But legal proceedings are slow, and Big Tech is making haste with their feeding.

              • FaceDeer@kbin.social
                link
                fedilink
                arrow-up
                1
                arrow-down
                1
                ·
                9 months ago

                There are many jurisdictions beyond the US and EU, Japan in particular has been very vocal about going all-in on allowing AI training. And I wouldn’t say the EU’s laws are “clear” until they are actually tested.

      • Meowoem@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        5
        ·
        9 months ago

        My open source project benefits hugely from the free to access LLM coding tools available, that’s a far bigger positive than the abstract fear that someone might feel alienated because the guy copy pasting their code doesn’t know who he’s copying from?

        And yes, obviously the LLM isn’t copying code it’s leaning from a huge range of sources and combining it to make exactly what you ask for (well not exactly but with some needling it gets there eventually) but even if it were that’s still not disrupting collaboration because that’s not how collaboration works - no one says ‘instead of coding all the boring elif statements required for my fiction determining if something is a prime, I’ll search code snippits and collaborate with them’ every worthwhile collaborator to my project has been an active user of the software and wanted to help improve it or add functions - AI won’t change that, and if it does it’ll only be because it makes coding so easy I don’t need collaborators

    • OutlierBlue@lemmy.ca
      link
      fedilink
      English
      arrow-up
      11
      ·
      9 months ago

      Yep, the truly free and open internet is coming to an end. Corporations and governments have spent decades trying to claim control over it, and they’re nearly there.

      • FaceDeer@kbin.social
        link
        fedilink
        arrow-up
        10
        ·
        9 months ago

        Which, ironically, will be greatly expedited by the drive to prohibit AI from learning from “unlicensed” materials. That will guarantee that the only AIs with a broad training set will be those owned by corporations that already control an enormous amount of training materials (Disney, Getty Images, etc.)

        • Ryantific_theory@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          9 months ago

          Yeah, right now the fight is between corporations and creators, but I feel like the future battle is going to be between corporate AIs and “pirated” ones, because Disney is going to keep a firm chokehold over what its generative AI can make, while the community ones will completely ignore copyright restrictions and just let people do whatever they want.

          Not gonna need to worry about paywalls when you can get a pirated generative AI to create the superhero mashup you always wanted to watch as a child. That said, I could definitely see Disney and other piggybacking off of AI panic to extend copyright protection into spaces that were previously fair use.

        • Pantoffel@feddit.de
          link
          fedilink
          English
          arrow-up
          1
          ·
          9 months ago

          A factor I didn’t consider. Thanks. And there I thought given hardware requirements it would be relatively easy to build such LLMs or similar foss-like.