We are barely at the beginning stages of much too much AI driven data and content being created for both humans and computers to absorb and digest. Let’s start with the human targeted deluge first. The NYTimes delves into the issue in the Travel category, introducing us to:
“...a new form of travel scam: shoddy guidebooks that appear to be compiled with the help of generative artificial intelligence, self-published and bolstered by sham reviews, that have proliferated in recent months on Amazon.”
“The books are the result of a swirling mix of modern tools: A.I. apps that can produce text and fake portraits; websites with a seemingly endless array of stock photos and graphics; self-publishing platforms — like Amazon’s Kindle Direct Publishing — with few guardrails against the use of A.I.; and the ability to solicit, purchase and post phony online reviews, which runs counter to Amazon’s policies and may soon face increased regulation from the Federal Trade Commission.”
“The use of these tools in tandem has allowed the books to rise near the top of Amazon search results and sometimes garner Amazon endorsements such as “#1 Travel Guide on Alaska.”
The entire piece is a tour de force investigative piece going into the breadth and depth of this already impressive deluge of abundantly shoddy AI generated ‘books’, complete with all the accouterments that would convince mainstream users of Amazon that they were buying and perusing real travel books. The whole piece is worth reading if only to get a taste of what’s to come in almost every field of human endeavor: AI generated content and data that’s going to be increasingly hard to distinguish from the real thing. For humans and computers.
It will exponentially increase the burden on us all to sort through an epic storm of chaff to find the actual kernels of wheat. In a post titled “AI: Data is Key” back in June, I outlined the emerging industry of what’s called “Extracted Data” from every imaginable bit of digital data that can be scraped across the internet.
And how the Data we have so far is just barely the tip of the iceberg and what’s to come:
“Data of course is going to be a lot more ubiquitous, varied and voluminous, owing to our increasingly digital lifestyles and the proliferation of billions of connected devices.”
It’s important to remember that Foundation LLM AI models work their magic primarily due to the constant ‘reinforcement learning’ loops generated between users, their constant queries, and then constantly shaping of the underlying data via the LLM models to produce increasingly relevant and reliable results.
As I’ve outlined earlier as well, AI Software is transforming traditional software, and how all this differentiates the current AI Tech Wave from earlier ones that gave us the PC and Internet industries.
A big way to improve current LLM AI models is with tons of new Data, and one of the best ways currently to get new Data, is to create ‘Synthetic Data’, using AI itself to create the content to feed the models. And synthetic data is a good thing for the most part. The issue as with a lot of things in life, has to do with moderation. As this Venturebeat piece explains:
“What happens as AI-generated content proliferates around the internet, and AI models begin to train on it, instead of on primarily human-generated content?”
“A group of researchers from the UK and Canada have looked into this very problem and recently published a paper on their work in the open access journal arXiv. What they found is worrisome for current generative AI technology and its future: “We find that use of model-generated content in training causes irreversible defects in the resulting models.”
This area of intense AI research focuses on the ‘Model Collapse’ failure issues from increasing Synthetic data ‘polluting’ LLM AI models. Another example is this new paper from researchers at Stanford and others titled “Self-Consuming Generative Models Go MAD”. This is an increasingly vital issue for LLM AIs, and a global and urgent search for solutions to mitigate the issues.
Again, as this recent Atlantic piece outlines and summarizes several recent papers on the issue of ‘Model Collapse’, quoting one Researcher as just saying “You are what you eat”. It seems it applies to computers as it does to humans. Garbage in, Garbage out (GIGO), an old problem indeed in computer science. And it applies here with AI anew as well.
And we need to figure out how to put the AI computers on a better, healthier diet. While managing to wade through the deluge of AI generated data and content ourselves. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here).