OpenAI’s data scraping wins big as Raw Story’s copyright lawsuit dismissed by NY court

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

The Southern District of New York has dismissed a copyright violation lawsuit brought by Raw Story Media, Inc. and AlterNet Media, Inc., alternative left-leaning online news outlets, against OpenAI, effectively shutting down claims that the generative AI firm violated copyrights by using scraped news content in its training data.

This dismissal could be seen as an important moment in the ongoing battle over copyright and AI tools—particularly under Section 1202(b) of the Digital Millennium Copyright Act (DMCA)—but it is worth noting that other cases have also failed to establish successful claims under this provision.

Let’s dive into what happened, why the judge dismissed the case, and what this means for the future of AI, copyright and the legality of tech companies to scrape content off the web without the creators’ express permission or compensation.

Understanding the DMCA’s Section 1202(b)

The lawsuit revolved around Section 1202(b) of the DMCA, a provision that aims to protect “copyright management information” (CMI).

This includes any author names, titles, and other metadata that identify copyrighted works. Section 1202(b) prohibits the removal or alteration of such information without authorization, especially if doing so facilitates copyright infringement.

In this case, Raw Story and AlterNet alleged that OpenAI used articles from their websites for training ChatGPT and other models without preserving CMI, violating Section 1202(b).

OpenAI is not the only AI company likely to have scraped such material from the web — while AI model providers tend to closely guard their training datasets, the industry at large has undoubtedly scraped large swaths of the web to train its various models (a practice similar to what Google did to crawl and index search results in its main search engine product). In this way, some creators view data scraping akin to AI’s “original sin.”

In this case, the plaintiffs Raw Story and Alternet claimed that OpenAI’s AI outputs—responses generated by the models—were sometimes based on their articles and the company knowingly violated copyright after the CMI was removed.

Why the court dismissed Raw Story’s claims

Judge Colleen McMahon granted OpenAI’s motion to dismiss the case on grounds of lack of standing. Specifically, the judge found that the plaintiffs couldn’t demonstrate that they suffered a concrete, actual injury from OpenAI’s actions—an essential requirement under Article III of the U.S. Constitution for any lawsuit to proceed.

Judge McMahon also considered the evolving landscape of large language model (LLM) interfaces, noting that updates to these systems further complicate attribution and traceability. She emphasized that generative AI’s iterative improvements make it less likely that content will be reproduced verbatim, making the plaintiffs’ claims even more speculative.

The judge noted that “the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote.” This reflects a key difficulty in these types of cases: generative AI is designed to synthesize information rather than replicate it verbatim. The plaintiffs failed to present convincing evidence that their specific works were directly infringed in a way that led to identifiable harm.

The ruling aligns with similar cases where courts have struggled to apply traditional copyright law to generative AI. For example, the Doe 1 v. GitHub case involving Microsoft’s Copilot also dealt with claims under Section 1202(b). There, the court found that the code generated by Copilot wasn’t an “identical copy” of the original, but rather snippets that were reconfigured, making it difficult to prove the violation of CMI requirements.

A growing divide on Section 1202(b)

The Raw Story decision highlights the broader uncertainties courts are facing regarding Section 1202(b), especially with generative AI.

There is currently no firm consensus on how Section 1202(b) applies to a wide swath of online content. In one corner, some courts have imposed what’s called an “identicality” requirement—meaning plaintiffs must prove that the infringing works are an exact copy of the original content, minus CMI. Others, however, have allowed for more flexible interpretations.

For instance, the court in the Southern District of Texas recently rejected the identicality requirement, stating that even partial reproductions could qualify as violations if CMI is deliberately removed.

Meanwhile, in the lawsuit brought by Sarah Silverman and a collection of authors, the court held that the plaintiff failed to show sufficient evidence that OpenAI had actively removed CMI from her content. That ruling, much like Raw Story’s, underscores the evidentiary burden plaintiffs face.

As explained by Maria Crusey in a piece for the Authors Alliance, “The uptick in §1202(b) claims raises challenging questions, namely: How does §1202(b) apply to the use of a copyrighted work as part of a dataset that must be cleaned, restructured, and processed in ways that separate copyright management information from the content itself?”

Why this ruling matters for AI and content creators

The dismissal of Raw Story’s lawsuit is more than a win for OpenAI—it’s an indicator of how courts may handle similar copyright claims in the rapidly evolving landscape of generative AI. With OpenAI and its investor Microsoft currently defending against a similar lawsuit filed by The New York Times, the ruling can only help establish some precedent to dismiss this and future claims.

Indeed, the ruling suggests that without clear, demonstrable harm or exact reproduction, plaintiffs may be challenged to get their day in court.

Judge McMahon’s ruling also touches on a broader point about how AI synthesizes data versus directly replicating it. OpenAI’s ChatGPT doesn’t directly recall articles from Raw Story—it instead uses training data to produce novel outputs that resemble human writing. This makes proving violations under current copyright laws inherently difficult.

For content creators, this raises a significant challenge: how to ensure proper credit and prevent unauthorized use of their work in training datasets. Licensing agreements like the ones OpenAI has struck with large news publishers such as Vogue and Wired owner Condé Nast could become a new standard, giving companies a way to legally use copyrighted content while compensating its creators.

Between a bot and a hard place

Courts are still figuring out how to handle generative AI, and recent rulings suggest they’re reluctant to extend Section 1202(b) protections unless plaintiffs show real, specific harm. AI-generated content synthesizes rather than replicates, making it tough to prove copyright violations.

For plaintiffs, this means proving harm is an uphill battle. Courts are signaling that vague claims aren’t enough—plaintiffs need hard evidence of damage. For developers and tech companies, even if the odds seem favorable, no one wants a lawsuit. Transparency, data records, and compliance are essential to avoid legal trouble.

Judge McMahon noted the case could be refiled (“together with an explanation of why the proposed amendment would not be futile,” she wrote), but significant obstacles remain.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link lol