SAN FRANCISCO — For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Now that data is drying up.
Over the past year, many of the most important web sources used for training AI models have restricted the use of their data, according to a study published last week by the Data Provenance Initiative, a Massachusetts Institute of Technology-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used AI training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5% of all data, and 25% of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
The study also found that as much as 45% of the data in one set, C4, had been restricted by websites’ terms of service.
“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for AI companies, but for researchers, academics and noncommercial entities,” Shayne Longpre, the study’s lead author, said in an interview.
Data is the main ingredient in today’s generative AI systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources.
Learning from that data is what allows generative AI tools like OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude to write, code and generate images and videos. The more high-quality data is fed into these models, the better their outputs generally are.
For years, AI developers were able to gather data fairly easily. But the generative AI boom of the past few years has led to tensions with the owners of that data — many of whom have misgivings about being used as AI training fodder or at least want to be paid for it.
As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging AI companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
Companies like OpenAI, Google and Meta have gone to extreme lengths in recent years to gather more data to improve their systems, including transcribing YouTube videos and bending their own data policies.
More recently, some AI companies have struck deals with publishers including The Associated Press and News Corp., the owner of The Wall Street Journal, giving them ongoing access to their content.
But widespread data restrictions may pose a threat to AI companies, which need a steady supply of high-quality data to keep their models fresh and up to date.
They could also spell trouble for smaller AI outfits and academic researchers who rely on public data sets and can’t afford to license data directly from publishers. Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit, has been cited in more than 10,000 academic studies, Longpre said.
It’s not clear which popular AI products have been trained on these sources, since few developers disclose the full list of data they use. But data sets derived from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus) have been used by companies including Google and OpenAI to train previous versions of their models. Spokespeople for Google and OpenAI declined to comment.
Yacine Jernite, a machine-learning researcher at Hugging Face, a company that provides tools and data to AI developers, characterized the consent crisis as a natural response to the AI industry’s aggressive data-gathering practices.
“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods,” he said.
But he cautioned that if all AI training data needed to be obtained through licensing deals, it would exclude “researchers and civil society from participating in the governance of the technology.”
Stella Biderman, the executive director of EleutherAI, a nonprofit AI research organization, echoed those fears.
”Major tech companies already have all of the data,” she said. “Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller startups or researchers.”
AI companies have claimed that their use of public web data is legally protected under fair use. But gathering new data has gotten trickier. Some AI executives I’ve spoken to worry about hitting the “data wall” — their term for the point at which all of the training data on the public internet has been exhausted, and the rest has been hidden behind paywalls, blocked by robots.txt or locked up in exclusive deals.
Some companies believe they can scale the data wall by using synthetic data — that is, data that is itself generated by AI systems — to train their models. But many researchers doubt that today’s AI systems are capable of generating enough high-quality synthetic data to replace the human-created data they’re losing.
Another challenge is that while publishers can try to stop AI companies from scraping their data by placing restrictions in their robots.txt files, those requests aren’t legally binding, and compliance is voluntary. (Think of it like a “no trespassing” sign for data, but one without the force of law.)
Major search engines honor these opt-out requests, and several leading AI companies, including OpenAI and Anthropic, have said publicly that they do, too. But other companies, including the AI-powered search engine Perplexity, have been accused of ignoring them. Perplexity’s CEO, Aravind Srinivas, said that the company respects publishers’ data restrictions. He added that while the company once worked with third-party web crawlers that did not always follow the Robots Exclusion Protocol, it had “made adjustments with our providers to ensure that they follow robots.txt when crawling on Perplexity’s behalf.”
Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data. Some sites might object to AI giants using their data to train chatbots for a profit but might be willing to let a nonprofit or educational institution use the same data, he said. Right now, there’s no good way for them to distinguish between those uses, or block one while allowing the other.
But there’s also a lesson here for big AI companies, who have treated the internet as an all-you-can-eat data buffet for years, without giving the owners of that data much of value in return. Eventually, if you take advantage of the web, the web will start shutting its doors.

