Two days ago we focused on how Amazon’s Jeff Bezos turned a suggested business strategy into a trillion dollar plus business by using the now famous Amazon Flywheel. Then yesterday, we saw how OpenAI, with its partner Microsoft, could take the current lead with their Foundation Large Language Model AIs (LLM AI GPT3, GPT 4 etc.), and their phenomenally successful ChatGPT App and client, and start to actualize their own Flywheel at scale. While competitors at Google, Meta, and elsewhere, do the same.
But there are an entirely different category of Flywheel loops that come into play in the AI cycle, that I want to review here. They’re called Reinforcement Learning by Human Feedback (RLHF) and Reinforcement Learning by AI Feedback (RLAIF) loops. A mouthful, I know, but they make all the difference in making AI that’s cool, but still mortal software, to AI that’s potentially “magical” and beyond. Let me explain.
As the good folk at ‘AI Native’ company Weights & Biases explain:
“As AI models grow, issues of bias — as well as fairness and safety — emerge. Reinforcement Learning from Human Feedback (RLHF) is a novel approach to reducing bias in large language models (LLMs).
In this article, we explore how to use RLHF to reduce the bias — and increase performance, fairness, and representation — in LLMs.“
“The essential goal here is to make a conventional large language model (GPT-3 in our case) aligned with human principles or preferences. This makes our LLMs less toxic, more truthful, and less biased.“
The chart below explains the process in more detail. It involves several key steps:
Extracting Data from the Internet from sites like Wikipedia, Reddit, Google Books, Academic paper databases, Media sites (with permission), using Internet Crawling software.
Then using human and machine ‘labelers” to curate, process and finetune the extracted data.
Then run human and machine generated prompts to test the output of the data against the LLM AI models.
Then rank the outputs from these queries from best to worst.
Use the results to train the underlying models again, to improve the results, in terms of accuracy and biases. Repeat, Rinse, and Run the Reinforcement Learning Loops, as many times as necessary.
Continue to optimize these results using a variety of additional techniques, at Scale. This uses the reinforcement learning techniques by both humans and AI mechanisms as mentioned above.
Since the launch of ChatGPT late last November especially, AI researchers have been stunned how these Reinforcement Learning (RL) Loops fundamentally improve the reliability and usefulness of LLM AI models. As one senior researcher at Stanford’s open source LLM AI team told me, “This process really ramps up the efficacy of these models by taking the worst 20% of the initial model results and improving them by 80% or more”.
And researchers are finding that the optimization innovations are just starting. AI companies both old and new are laser focused on improving reinforcing these models, in both the training and learning cycles of the LLM AI process.
The key is that these RLHF and RLAIF loops create flywheels that increasingly are responsible for the ‘Emergent’ “Sparks of Artificial General Intelligence”, (AGI or Superintelligence)”, that LLM AIs are starting to exhibit, as discussed in a recent AI paper by Microsoft Researchers.
Other benefits of RLHF include:
“RLHF offers several advantages in the development of AI systems like ChatGPT and GPT-4:
Improved performance: By incorporating human feedback into the learning process, RLHF helps AI systems better understand complex human preferences and produce more accurate, coherent, and contextually relevant responses.
Adaptability: RLHF enables AI models to adapt to different tasks and scenarios by learning from human trainers' diverse experiences and expertise. This flexibility allows the models to perform well in various applications, from conversational AI to content generation and beyond.
Reduced biases: The iterative process of collecting feedback and refining the model helps address and mitigate biases present in the initial training data. As human trainers evaluate and rank the model-generated outputs, they can identify and address undesirable behavior, ensuring that the AI system is more aligned with human values.
Continuous improvement: The RLHF process allows for continuous improvement in model performance. As human trainers provide more feedback and the model undergoes reinforcement learning, it becomes increasingly adept at generating high-quality outputs.
Enhanced safety: RLHF contributes to the development of safer AI systems by allowing human trainers to steer the model away from generating harmful or unwanted content. This feedback loop helps ensure that AI systems are more reliable and trustworthy in their interactions with users.”
As these folk at AI Analytics Vidhya highlight the increasing importance of RLAIF over RLHF:
“We can compare the performance of the LLMs at different stages, using different LLM AI model sizes. We see there is a significant increase in the results after each training phase. We can replace the Human in RLHF in this segment with Artificial Intelligence RLAIF. This significantly reduces the cost of labeling and has the potential to perform better than RLHF”.
Other ‘AI Native’ LLM AI model companies like Anthropic are using variations of these Reinforcement Learning (RL) techniques without humans, which they refer to as “Constitutional AI”. Here is a review of that approach for those who want a deeper look.
The floodgates are open on fine tuning, optimizing, and improving LLM AI models with Reinforcement Learning, driven by humans and machines. They’re creating the new Flywheel loops that are fundamentally changing the field of AI technologies and their applications. They’re an evolution from the Amazon Flywheel, but potentially FAR more potent. Will have future posts on that subject at length.
In the meantime, we need to add Reinforcement Learning in AI to our vocabulary. Try it out at your next cocktail party where AI is being discussed. Stay tuned.