AI: Copyright compensation systems to the Rescue. RTZ #519
...thorny issue of content for AI data collection still front and center as AI Scales
A core issue all through this AI Tech Wave has been what content providers and copyright holders should be compensated. Especially as LLM AI big and small tech companies hoover up their data to feed, train, and do inference runs on their ever scaling AI models. Large and Small. Doing it under their ‘Fair Use’ interpretations of the internet reality. It’s a topic covered here for months, with so many publishers yelling ‘Show me the Money’.
The NY Times has the latest on this ongoing intellectual and legal tug of war in “Former OpenAI Researcher Says the Company Broke Copyright Law”:
“Over the past two years, a number of individuals and businesses have sued various A.I. companies, including OpenAI, arguing that they illegally used copyrighted material to train their technologies. Those who have filed suits include computer programmers, artists, record labels, book authors and news organizations.”
“In December, The New York Times sued OpenAI and its primary partner, Microsoft, claiming they used millions of articles published by The Times to build chatbots that now compete with the news outlet as a source of reliable information. Both companies have denied the claims.”
The fundamental argument offered by the 25 year old ex-OpenAI engineer is a bit one-side and legally simplistic from my perspective. And the debate turns on ‘Fair use’ doctrine, as the article explains:
“OpenAI, Microsoft and other companies have said that using internet data to train their A.I. systems meets the requirements of the “fair use” doctrine. The doctrine has four factors. The companies argue that those factors — including that they substantially transformed the copyrighted works and were not competing in the same market with a direct substitute for those works — play in their favor.”
The data being discussed is of course the all important Box 4 from the AI Tech Wave chart further down this post.
“Mr. Balaji does not believe these criteria have been met. When a system like GPT-4 learns from data, he said, it makes a complete copy of that data. From there, a company like OpenAI can then teach the system to generate an exact copy of the data. Or it can teach the system to generate text that is in no way a copy. The reality, he said, is that companies teach the systems to do something in between.”
“The outputs aren’t exact copies of the inputs, but they are also not fundamentally novel,” he said. This week, he posted an essay on his personal website that included what he describes as a mathematical analysis that aims to show that this claim is true.”
I’m more on the side of Mark Lemley below:
“Mark Lemley, a Stanford University law professor, argued the opposite. Most of what chatbots put out, he said, is sufficiently different from its training data.
“There are occasionally circumstances where an output looks like an input,” he said. “A vast majority of things generated by a ChatGPT or an image generation system do not draw heavily from a particular piece of content.”
As I framed all this in my piece “AI: On the Shoulders of Giants, Go-Getter 'Creators' & Grunts (OTSOG)” late last year“:
“The New York Times sued OpenAI and Microsoft this week to stake their claim on the Foundation LLM AI empires being built in these early days of the AI Tech Wave. It’s a continuation of a series of these legal claims by many other publishers, and something I’ve written about at length this year. It will continue as we all become daily users of ChatGPT, Copilot, Google Bard/Gemini AI driven Search, and other AI products and services being rolled out into 2024.”
“All of these and many, many more that have the potential to improve billions of lives”.
“It’s an opportunity for us all to have beautifully automated AI machinery that provides us a distillation of our digital data, and for us all to stand more ‘On the Shoulders of Giants’. And Go-Getter ‘Creators’ and of course Grunts. Not to mention Generated AI. Note that online ‘Creators’ are already a multi-hundred billion economy, according to Goldman Sachs, about to be transformed again by AI.”
“(Side-note: My all-time favorite historical book on OTSOG by Robert K. Merton highly recommended. TL;DR, but very worthwhile life-long journey).”
“And yes, we will have to figure out how to compensate us all for our creations large and small, that are the ‘data fuel’ for AI to do its things. It’s a very important detail that will be negotiated and litigated over well into 2024 and beyond. Likely with multiple trips to the Supreme Court. And similar actions worldwide. Lawyers rejoice.”
“The good news is that we will also have a plethora of new technologies from companies large and small to help build the ‘compensation systems’ for the ‘data fuel’ across LLM AI platforms of all types. It’s not all going to be dependent on legal agreements arranged by a gaggle of lawyers.”
An example of these new technologies and startups that I was hoping for last year is a company called Tollbit, described by Axios:
“TollBit, a two-sided marketplace for publishers and AI companies, has raised a $24 million series A round led by Lightspeed Venture Partners, executives told Axios.”
“Why it matters: TollBit hopes its marketplace can reduce the legal and business friction that's made data-sharing between the media industry and AI firms tense and complicated.”
“The big picture: OpenAI, Microsoft, Perplexity and other AI firms are brokering licensing deals with publishers where they pay upfront for the right to use a publisher's content.”
Another example in this space is Pricing Culture, founded by my friend Bhargav Shivarthy, focused on using AI tech to convert data feeds into narratives for computers and humans.
Companies like Tollbit, Pricing Culture and many more will fill this critical, exploding need. Interspersing themselves between Boxes 3 and 6 in the AI Tech Wave chart above. They’ll be all the more important as we go from billions of users to tens of billions of ‘machine to machine’ (m2m) agents. All scouring everywhere online for the data needed for our continuously expanding, insatiable human queries and quests.
This critical issue of compensating content and copyright holders for the use of their work in AI models to come will get worked out, in many different ways. Stay tuned.
(NOTE: The discussions here are for information purposes only, and not meant as investment advice at any time. Thanks for joining us here)