The New York Times (NYT) has recently updated its terms of service (TOS) to forbid the scraping of its articles and images for artificial intelligence (AI) training. This move comes during a time when tech companies have been capitalizing on AI language applications, such as ChatGPT and Google Bard, which have obtained their capabilities through unauthorized data scraping.
The updated terms explicitly state that the use of Times content—including articles, videos, images, and metadata—for AI model training requires express written permission. Any attempt to develop software programs, including machine learning or AI systems, using NYT’s content without prior consent is strictly prohibited.
In light of non-compliance, the NYT outlines the potential consequences, including civil, criminal, and administrative penalties imposed on both users and individuals assisting the users.
While these restrictions may seem intimidating, previous instances indicate that widespread scraping of internet data for machine learning purposes often goes unchecked. Many existing large language models, such as GPT-4, Claude 2, Llama 2, and PaLM 2, have all been trained on vast datasets obtained through web scraping. This unsupervised learning process involves feeding web data into neural networks to enable AI models to develop a conceptual understanding of language by analyzing word relationships.
The controversy surrounding the use of scraped data to train AI models has raised legal concerns and even led to a lawsuit accusing OpenAI of plagiarism. In response to these issues, the Associated Press and other news organizations have called for the development of a legal framework to protect content used in AI applications.
OpenAI, expecting further legal challenges, has taken proactive steps such as providing guidance on blocking its AI-training web crawler through robots.txt. Nevertheless, the content that has already been scraped, including New York Times content, has become an integral part of GPT-4. Only time will tell if future iterations like GPT-5 will respect the wishes of content owners or if more lawsuits or regulations regarding AI and content usage are on the horizon.
Frequently Asked Questions
Why is The New York Times prohibiting content scraping for AI training?
The New York Times aims to protect its content from unauthorized use in AI model training. By prohibiting scraping, the NYT intends to maintain control over the usage of its articles, images, videos, and metadata.
Can AI models be trained without scraping content from the internet?
AI models can be trained using a variety of methods, including supervised learning with curated datasets or pre-training on publicly available data. Scraping content from the internet is one way to acquire large amounts of data, but it is not the only approach.
What are the potential consequences for ignoring The New York Times’ scraping restrictions?
Engaging in prohibited use of The New York Times’ content can result in civil, criminal, and/or administrative penalties, fines, or sanctions against the user and those assisting the user.
Have AI vendors faced legal challenges for utilizing scraped data?
Yes, the use of scraped data in AI training has resulted in legal challenges, including accusations of plagiarism. Concerns surrounding content protection when used for AI applications have prompted calls for the development of a legal framework to address these issues.
How has OpenAI responded to the scraping controversy?
OpenAI has taken measures to address criticism and legal challenges. They recently provided guidance on blocking their AI-training web crawler through robots.txt. However, content that has already been scraped remains integrated into existing models, such as GPT-4.