When AI’s output is a threat to AI itself

Experts estimate that AI models may run out of public data to sustain their current pace of growth within a decade. PHOTO: REUTERS

Published Sun, Nov 24, 2024 · 12:00 PM

THE internet is becoming awash in words and images generated by artificial intelligence.

Sam Altman, OpenAI’s chief executive officer, wrote in February that the company generated about 100 billion words per day – one million novels’ worth of text, every day, an unknown share of which finds its way onto the internet.

AI-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a website that tracks online misinformation, recently identified over a thousand websites that churn out error-prone AI-generated news articles.

In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.

All this AI-generated information can make it harder for us to know what is real. And it also poses a problem for AI companies as they develop more and more powerful models, which involve digesting volumes of data so vast that they are approaching the size of the internet itself.

As these companies trawl the web for new data to train their next models on – an increasingly challenging task – they are likely to ingest some of their own AI-generated content, creating an unintentional feedback loop in which what was once the output from one AI becomes the input for another.

In the long run, this cycle may pose a threat to AI itself. Research has shown that when generative AI is trained on a lot of its own output, it can get a lot worse.

Imagine a chatbot offering medical advice that is trained on medical conditions that were “hallucinated” by a previous AI – or one that offers legal advice trained on fictitious rulings that it encountered online. While this is a simplified example, it illustrates a problem on the horizon.

Just as a copy of a copy can drift away from the original, when generative AI is trained on its own content, its output can also drift away from reality, growing further apart from the original data that it was intended to imitate.

Where AI meets design: Leveraging the unfamiliar as a business differentiator

Anthropic is also in talks with additional investors to raise more capital amid Amazon's backing, say sources.

Amazon doubles down on AI startup Anthropic with another US$4 billion

In a paper published in July in the journal Nature, a group of researchers in Britain showed how this process results in a narrower range of AI output over time – an early stage of what they called “model collapse”.

If only some of the training data were AI-generated, the decline would be slower or more subtle. But it would still occur, researchers say, unless the synthetic data was complemented with a lot of new, real data.

This problem is not just confined to text. Another team of researchers at Rice University studied what would happen when the kinds of AI that generate images are repeatedly trained on their own output – a problem that could already be occurring as AI-generated images flood the web.

They found that glitches and image artifacts started to build up in the AI’s output, eventually producing distorted images with wrinkled patterns and mangled fingers.

“You’re kind of drifting into parts of the space that are like a no-fly zone,” said Richard Baraniuk, a professor at Rice who led the research on AI image models.

The researchers found that the only way to stave off this problem was to ensure that the AI was also trained on a sufficient supply of new, real data.

Why it matters

This does not mean generative AI will grind to a halt anytime soon. The companies that make these tools are aware of these problems, and they will notice if their AI systems start to deteriorate in quality.

But it may slow things down. As existing sources of data dry up or become contaminated with AI “slop”, researchers say it makes it harder for newcomers to compete.

AI-generated words and images are already beginning to flood social media and the wider web. They are even hiding in some of the data sets used to train AI, the Rice researchers found.

“The web is becoming increasingly a dangerous place to look for your data,” said Sina Alemohammad, a graduate student at Rice who studied how AI contamination affects image models.

Ways out

Perhaps the biggest takeaway of this research is that high-quality, diverse data is valuable and hard for computers to emulate.

One solution, then, is for AI companies to pay for the data instead of scooping it up from the internet, ensuring both human origin and high quality.

AI slop is not the only reason that companies may need to be wary of synthetic data. Another problem is that there are only so many words on the internet.

Some experts estimate that the largest AI models have been trained on just a few percent of the available pool of text on the internet. They project that these models may run out of public data to sustain their current pace of growth within a decade.

“These models are so enormous that the entire internet of images or conversations is somehow close to being not enough,” Baraniuk said.

To meet their growing data needs, some companies are considering using today’s AI models to generate data to train tomorrow’s models. But researchers say this can lead to unintended consequences.

And new research suggests that when humans curate synthetic data (for example, by ranking AI answers and choosing the best one), it can alleviate some of the problems of collapse.

But for now, there is no replacement for the real thing. NYT

Decoding Asia newsletter: your guide to navigating Asia in a new global order. Sign up here to get Decoding Asia newsletter. Delivered to your inbox. Free.

Artificial Intelligence

Share with us your feedback on BT's products and services

Feedback