OpenAI vs. The New York Times: AI Ethics and Copyright • Masala Marsala

The New York Times was one of the first media companies to sue OpenAI for breaking copyright laws by illegally using the Times’ content without permission. OpenAI’s crawler allegedly got through the New York Times’ paywalls to access content, offering no compensation to use The New York Times’ work. The current lawsuit against OpenAI does not propose monetary compensation, but rather an alternative way for OpenAI to compensate their damages to the New York Times.

Several other news outlets have also sued OpenAI for accessing their content without permission and compensation.

How can The New York Times sue OpenAI?

Until recently, there was no way for websites to block OpenAI from crawling their website. OpenAI first began crawling websites for their AI models before ChatGPT was released, using much of their data from Common Crawl. Although OpenAI’s crawler can be seen as “opt-in,” in how developers have to insert a line of code onto their website to prevent OpenAI from indexing their website in their database, it was only recently announced by OpenAI. For this reason, websites and organizations were unable to prevent OpenAI from crawling their websites, or had little information about OpenAI’s bot to block it from using content on their website.

Additionally, multiple chatbots have been found to incorrectly cite information from major news outlets, described as “fabricating” misinformation from articles. This once happened to The Denver Post, in an incident where an AI model made the false claim that smoking was beneficial for asthma. The spread of this misinformation can make news outlets appear discreditable to the public, affecting their reputation greatly.

Is Common Crawl Responsible for Breaking Copyright Law?

As mentioned earlier, most of OpenAI’s datasets stem from a non-profit organization called the Common Crawl, with petabytes of web crawl data that have arguably allowed for the AI boom in the past years. Because 60-80% of OpenAI’s responses may use data from Common Crawl, OpenAI may not be seen as responsible for breaking copyright laws by using The New York Times articles.

A Brief History of Web Crawling

Web crawling, often used for the indexation of content on search engines, is commonly employed by technology companies to index data on the internet; most commonly webpages. Major technology companies and data brokers including Google, Yahoo, and Baidu all use web crawling to power their technologies.

The Legality and Ethics of Web Crawling

Web crawling is legal, as long as it does not raise any ethical concerns. Even though OpenAI can be seen as using the New York Times’ data without permission, it can be argued that search engines use the same data as OpenAI to index web pages. At the same time, there are differences between the use of web crawling for search engines and AI language models:

Search engines are made to navigate the internet. They are not using information on the internet without crediting the web pages that information was found on.
Meanwhile, AI language models will borrow content from the internet without correctly citing its source. Although many language models now provide information as to where they received their information, some models still do not list their sources. If AI models do list their sources, then the sources may be incorrectly cited.

Analysis

Arguably, both search engines and AI language models are making a profit from the content that they index in their databases. While ChatGPT charges users for higher-end AI models, Google utilizes Adsense and other marketing platforms to profit from the content they index. ChatGPT can be held responsible for using The New York Times’ articles without permission, as they were crawling The New York Times’ articles without permission, at a time when the OpenAI crawler was widely unknown to the public. Since ChatGPT is using these inputs to profit from paying users, they are using The New York Times’ work for commercial use without legal permission. The New York Times only allows commercial use in specific circumstances, and otherwise, only for personal and non-commercial use.

A Different View

If OpenAI is using publicly accessible information from The New York Times in its index, and OpenAI can claim that is not passing through The New York Times paywalls, then it can be argued that OpenAI can be used as a tool that uses freely accessible information on the internet to formulate responses. Perhaps the New York Times’ paywall protection was ineffective at preventing crawlers from viewing content and had not clarified the circumstances content can be used commercially.

At the same time, OpenAI may be using data from Common Crawl, making Common Crawl responsible for allowing the indexation of copyrighted content in their database and permitting the use of this content for commercial use. Once again, it can also be argued that OpenAI should not have used the Common Crawl’s copyrighted content for commercial use.

Conclusion

If OpenAI can claim that they are not using The New York Times content for commercial use, it can be argued that OpenAI is not illegally using the work of The New York Times. However, since OpenAI uses content for commercial use in upgraded versions of its AI models, OpenAI can be held responsible for infringing copyright laws by using content from The New York Times and other news outlets without permission for commercial use. OpenAI can finally compensate by creating a system that better allows it to directly cite information from The New York Times and other news outlets.