AI: Debate continues over AI 'Synthetic Data'. RTZ #433
...the cons include, 'model collapse', pros point to net Data nirvana when done right
I’ve long talked about how one of the most unique differentiator of this AI Tech Wave vs earlier tech waves, is the need for Data to constantly feed both the Large and Small Language models (LLM and SLM AIs). Box number 4 below:
My optimism comes from wide swathes of new sources and techniques to extract Data to train the next generation of LLM AIs as they Scale.
They range from resorting to the big pools of ‘Data Exhaust’ from applications and services to using ‘Synthetic Data’, where LLM AI models are used to create new data using other LLMs and fine-tuning reinforcement learning techniques.
For a while now, AI researchers have been balancing the different ways to create useful Synthetic Data, without resulitng in what’s called ‘Model Collapse. Axios explains on some recent work on both sides of the debate:
“Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots' knowledge gaps but also destabilize them.”
“The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing restrictions on the text, images and other kinds of data freely available on the web are driving the technology's developers toward machine-produced alternatives.”
“State of play: AI-generated data has been used for years to supplement data in some fields, including medical imaging and computer vision, that use proprietary or private data.”
“But chatbots are trained on public data collected from across the internet that is increasingly being restricted — while at the same time, the web is expected to be flooded with AI-generated content.”
These are the ongoing content copyright lawsuits and negotiations I’ve discussed at length. In the meantime,
“Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.”
“Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.”
“Google DeepMind's new AlphaGeometry 2 system that can solve math Olympiad problems is trained from scratch on synthetic data.”
“New research illustrates the potential effects of AI-generated data on the answers AI can give us.”
“In one scenario that's extreme yet valid, given the state of the web, researchers trained a generative AI model largely on AI-generated data. The model eventually became incoherent, in what they called a case of "model collapse" in a paper published Wednesday in Nature.”
The piece goes on to give several recent examples of this ‘model collapse’ phenomenon, if the synthetic data process is not done with processes to improve the results. There are some ways to do that:
“Yes, but: AI-generated data can also be a powerful tool to address limitations in data.”
“New research shows how it can be tailored to specific needs or questions and then used to steer models' responses to produce less harmful speech, represent more languages or provide other desired output.”
“A team from Cohere for AI, Cohere's nonprofit AI research lab, recently reported being able to use targeted sampling of AI-generated data to reduce toxic responses from a model by up to 40%.”
“Shumailov and his colleagues performed "algorithmic reparation" by curating training data to improve fairness in models.”
“By molding and sculpting data in different ways, researchers might be able to achieve their goals with a smaller model because it is trained on a dataset with a specific objective in mind, says Sara Hooker, who leads Cohere for AI.”
“Instead of learning from synthetic data produced by one "teacher" model, AI can be trained on data strategically sampled from a community of specialized teachers, she says. That can help avoid "collapse" because the synthetic data comes from multiple sources.”
“When 10% of the original human-generated data was retained, the model's performance didn't suffer, the team reports in the Nature paper.”
“Such data could be given more weight in training a model to protect it from collapsing, but it is currently difficult to tell real data from synthetic data, Shumailov says.”
“The bottom line: AI-generated data is "an amazingly useful technology, but if you use it indiscriminately, it's going to run into problems," Vyas Sekar, a professor of electrical and computer engineering at Carnegie Mellon University, told Axios.”
"If used well, it can lead to really good outcomes," says Sekar, who is also co-founder and chief technology officer of Rockfish, a company that helps customers combine human- and AI-generated data for their specific needs.”
"There's value for both real data and generative data in any use case."
So in the end analysis, there is light at the end of the ‘synthetic data’ tunnel. We’re at the ‘tip of the iceberg’ where AI data is concerned.
And once fine-tuned and optimized, these new processes could go a long way to expanding ‘Digital Twin’ AI Data generation at Scale for a number of Scaling cycles. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)