American nonprofit OCLC is known globally for its leading database of bibliographic records, WorldCat. A few months ago, many of these records were posted publicly by the shadow library search engine, Anna’s Archive. OCLC believes that this is the result of a year-long hack and, with a lawsuit filed at an Ohio federal court, it demands damages.

WorldCat Sues Anna’s Archive

It is no secret that publishers fiercely oppose the search engine’s stated goals. The same also applies to OCLC, which has now elevated its concerns into a full-blown lawsuit, filed this month at a federal court in Ohio.

The complaint accuses Washington citizen Maria Dolores Anasztasia Matienzo and several “John Does” of operating the search engine and scraping WorldCat data. The scraping is equated to a cyberattack by OCLC and started around the time Anna’s Archive launched.

“Beginning in the fall of 2022, OCLC began experiencing cyberattacks on WorldCat.org and OCLC’s servers that significantly affected the speed and operations of WorldCat.org, other OCLC products and services, and OCLC’s servers and network infrastructure,” OCLC’s complaint notes.

“These attacks continued throughout the following year, forcing OCLC to devote significant time and resources toward non-routine network infrastructure enhancements, maintenance, and troubleshooting.”

The non-profit says that it spent roughly $68 million over the past two years developing and enhancing WorldCat records, which are an essential part of its operation. Having a copy of the data publicly available through Anna’s Archive is a direct threat to its business.

OCLC claims that Anna’s Archive unmasked itself as the “perpetrator of the attacks on WorldCat.org” when it publicly announced its scraping effort. This includes a detailed blog post the operators published on the matter, encouraging the public to use the scraped data.

In addition to harvesting data from WorldCat.org, the defendants are also accused of obtaining and using credentials of a member library to access WorldCat Discovery Services. This opened the door to yet more detailed records that are not available on WorldCat.org.

OCLC says that it spent significant time and resources to address the ‘attacks’ on its systems.

“These hacking attacks materially affected OCLC’s production systems and servers, requiring around-the-clock efforts from November 2022 to March 2023 to attempt to limit service outages and maintain the production systems’ performance for customers.

“To respond to these ongoing attacks, OCLC spent over 1.4 million dollars on its systems’ infrastructure and devoted nearly 10,000 employee hours to the same,” the complaint adds.

    • body_by_make@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      3
      ·
      edit-2
      8 months ago

      Yes, let only the rich control your thoughts.

      I’m not surprised this will get downvoted here, I’m as much of a pirate as anyone, but news needs to be paid or only people who can afford to control the news without income will control the news.

      • ShepherdPie@midwest.social
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        8 months ago

        Npt saying you’re right or wrong but paid news has been the model for quite a while now and that has resulted in 24 hour talking heads on TV, paid stories, clickbait, and people resorting to word of mouth on places like Facebook for all their news. It’s not as if the current trajectory is any better than your hypothetical one.

  • dangblingus@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    5
    ·
    8 months ago

    Having a copy of the data publicly available through Anna’s Archive is a direct threat to its business.

    How would it hurt WorldCat’s business given that the service they offer is free? If the information, that being the location of books and articles in specific libraries around the USA, was freely available on another site, what value has been lost?

    • RogueBanana@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      8 months ago

      Just going by anna’s blog post their business model seems to be trading information ie sharing the full database of hundreds of millions records with their memeber’s own records so the list keeps growing as more members join. Although I don’t see why they need a monopoly on said information given any other library would still continue working with them for their free streamlined process. There could be more to it but feels like they are wasting resources on this instead of putting them in things that actually matter.

      Edit: also I don’t think they scrapped or have information about the members like location of each book, simply just the metadata so it really seems harmless to me

  • ancuuiqter@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    8 months ago

    The official Anna’s Archive Reddit account, AnnaArchivist, has responded to an r/Annas_Archive post linking the same Torrent Freak article:

    Thanks! We’re not making any public statements about this lawsuit but rest assured we’re fine.

    • Darkassassin07@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      8 months ago

      Gotta wonder what their plan is. The lawsuit was an obvious outcome, and they haven’t exactly made much effort to make their actions appear legal.

      I don’t see AA winning this one. Data’s out there though; no taking that back. Maybe they’ve just accepted the consequences… A martyr as it were.

  • MotoAsh@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    8 months ago

    I mean… it’ll all come down to how they accessed the data. If they had a public portal and no EULA, they can push rocks. If the data wasn’t public or the ‘theives’ had to use non-standard channels, or otherwise violated an EULA, they’re likely screwed. Especially if they had to go through abnormal channels.

    I know their data can be accessed publicly, but I’m pretty sure it’s under license. You cannot just use any old thing found in public… That’s the biggest reasons the AI models are technically theft: they weren’t licensed to commercially profit off of 99.99% of the things their LLMs are trained on, but the law and politicians are WAY behind the times. Commercial data they’d normally have to pay for is suddenly magically OK when laundered through an LLM…

    • Dkarma@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      arrow-down
      1
      ·
      8 months ago

      “AI models are technically theft: they weren’t licensed to commercially profit off of 99.99%”

      This is simply a lie. There is no license like what you describe. You never need a license to view or learn from something given away completely free on the internet. You guys keep pretending there’s a law that says otherwise . There is not or you’d post it.

      Copyright does not cover viewing or experiencing a piece.

      • MotoAsh@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        8 months ago

        Notice how I said “commercially profit” too. Read all the words next time.

        Also LLMs do not “learn” anything, you idiot. That’s the entire point. They mathematically blender things. They DO NOT learn and create.

    • ancuuiqter@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      0
      ·
      8 months ago

      As to how Anna’s Archive accomplished their data scraping, this is what OCLC is claiming (see page 62-63):

      1. These attacks were accomplished with bots (automated software applications) that “scraped” and harvested data from WorldCat.org and other WorldCat®-based research sites and that called or pinged the server directly. These bots were initially masked to appear as legitimate search engine bots from Bing or Google.

      2. To scrape or harvest the data on WorldCat.org, the bots searched WorldCat.org results, running a script based on OCN for individual JavaScript Object Notation, or “JSON,” records. As a result, WorldCat® data including freely accessible and enriched data, such as OCNs, were scraped from individual results on WorldCat.org.

      3. The bots also harvested data from WorldCat.org by pretending to be an internet browser, directly calling or “pinging” OCLC’s servers, and bypassing the search, or user interface, of WorldCat.org. More robust WorldCat® data was harvested directly from OCLC’s servers, including enriched data not available through the WorldCat.org user interface.

      4. Finally, WorldCat® data was harvested from a member’s website incorporating WorldCat® Discovery Services, a subscription-based variation of WorldCat.org that is available only to a member’s patrons. Again, the hacker pinged OCLC’s servers to harvest WorldCat® records directly from the servers. To do this through WorldCat® Discovery Services/FirstSearch, the hacker obtained and used the member’s credentials to authenticate the requests to the server as a member library.

      5. From WorldCat® Discovery Services, hackers harvested 2 million richer WorldCat® records that included data not available in WorldCat.org. This hacking method resulted in the harvesting of some of OCLC’s most proprietary fields of WorldCat® data.

      6. These hacking attacks materially affected OCLC’s production systems and servers, requiring around-the-clock efforts from November 2022 to March 2023 to attempt to limit service outages and maintain the production systems’ performance for customers. To respond to these ongoing attacks, OCLC spent over 1.4 million dollars on its systems’ infrastructure and devoted nearly 10,000 employee hours to the same.

      7. Despite OCLC’s best efforts, OCLC’s customers experienced many significant disruptions in paid services during the aforementioned period as a result of the attacks on WorldCat.org, requiring OCLC to create system workarounds to ensure services functioned.

      8. During this time, customers threatened and likely did cancel their products and services with OCLC due to these disruptions.

      9. Because OCLC had to combat these persistent hacking attacks, OCLC was forced to divert existing personnel and resources from OCLC’s other products and services. As a result, OCLC’s development and improvements to other products and services were delayed and limited.

      10. OCLC has devoted, at various times, ten or more employees to respond to and mitigate the harm from these attacks from October 2022 to present.