Synthetic data is more useful than you think

Fears of ‘model collapse’ are overstated — at least for now

Nov 20, 2024

A study published in Nature earlier this year painted a dire picture of AI’s future. Training models on synthetic data — data generated by AI systems, rather than humans — could trigger ‘model collapse’, the authors warned: a recursive failure mode where models become increasingly worse as they’re trained on data produced by their previous iterations.

The paper found that when generating synthetic data rare data points are less likely to be included, leading to a narrower distribution of data that misses important edge cases and detrimentally impacts performance. Over multiple iterations, the synthetic data becomes so full of hallucinations and bad data that models trained on it stop being useful, Zak Shumaylov, one of the study’s authors, told Transformer.

The paper sparked widespread concern about model collapse, pushing many to write off training on synthetic data altogether. But talk to the people at AI companies, and you’ll hear a very different story: one where training on synthetic data is already yielding results.

The disconnect stems from the specifics of the Nature paper’s experiments. To get model collapse, the authors repeatedly fine-tuned their models only on synthetic data, discarding the original human-generated data and not verifying whether the synthetic data was high-quality.

But this scenario is nothing like how AI companies actually use synthetic data. Frontier LLMs like Llama 3.1 use synthetic data strategically to fill in gaps where human-generated data would be challenging or expensive to obtain (such as long-context reasoning, multilingual examples, and tool use like web searches). Rather than replacing human data, synthetic examples fill gaps, ensuring outliers and rare cases are included.

AI companies also take care to curate their synthetic data. Before training a model on synthetic data, they verify it with AI tests (especially for checkable tasks like maths or coding), or have humans filter it to ensure quality. Such an approach may be more labour-intensive than training on uncurated synthetic data, but it’s far more efficient than producing new human-generated data. While creating new high-quality examples from scratch is challenging for humans, verifying AI-generated examples is comparatively straightforward. And, unlike training on uncurated data, this approach avoids model collapse.

It seems to be working. LLMs trained on carefully curated synthetic data can rival traditional larger models trained on web-scraped data, often at much lower cost. For example, researchers at Stanford University fine-tuned Meta’s LLaMA 7B model using 52,000 synthetic examples generated by OpenAI’s GPT-3.5. The end result achieved a comparable performance to GPT-3.5 — for under $600.

The approach is increasingly popular in industry, too. OpenAI is reportedly using its o1 model to generate synthetic data for Orion (the suspected code name for GPT-5), while Google and Meta have incorporated synthetic data into at least some models, including Llama 3.1’s extensive use of synthetic data. Hugging Face trained their small language model Cosmo 1B primarily on synthetic data.

None of this means the concerns raised by the Nature paper are entirely misplaced. Concerns remain about the increasing presence of AI-generated content in internet data scrapes, though AI researcher Pablo Villalobos says existing content filtering techniques — already used to screen out spam and low-quality content — should filter out low-quality AI data as well.

And while we know that synthetic data can complement human data, we don’t know if it can replace it entirely. We could hit the “data cliff” — the point where AI runs out of new data to train on — as soon as 2028 (though Villalobos thinks it’s more likely to be in the 2030s). If and when we do hit the cliff, we don’t know if synthetic data will allow us to continue scaling models beyond it. The successes to date suggest it will, but there are still big unknowns.

Yunzhen Feng, an NYU PhD candidate studying model collapse, told Transformer that there might be something important in human-generated data that causes the seemingly magical performance increases we’ve seen so far from scaling training data. He said that training on synthetic data will likely still produce better models, but possibly with diminished returns compared to human-generated data. Ultimately, though, we just don’t know: no one has tried training on synthetic data at that kind of scale.

Despite these uncertainties, it’s clear that synthetic data won’t necessarily destroy a model. By combining synthetic data with rigorous verification, AI companies are turning what was seen as a potential threat into a powerful tool. The question now isn’t whether to use synthetic data, but how much we can rely on it at scale.

Disclosure: The author’s partner works at Google DeepMind.

A guest post by

Lynette Bye

A Harvard graduate and current Tarbell Fellow for journalists, I write about AI's growing influence on society.