Translation: In my decade in AI - Tarogo General Blogs

I’ve never seen so many people so imaginative about an algorithm. With just one name, no papers, data, or products. So, let’s demystify Q* fantasies, which can be quite a long exploration.

First, to understand the powerful combination of search and learning, we need to go back to 2016 and revisit AlphaGo, a brilliant achievement in AI history. It is mainly composed of four elements:

Policy Neural Network (Policy NN, Learning Part): Its task is to choose good actions and estimate the likelihood of victory by estimating each action.
Value Neural Network (Value NN, Learning Part): This part is responsible for evaluating the board condition and predicting the outcome of any legal position in Go.
Monte Carlo Tree Search (MCTS, Search Section): It stands for “Monte Carlo Tree Search”. This process uses a strategic neural network to simulate multiple possible sequences of moves starting from the current location, and then synthesizes the results of these simulations to determine the most promising action. It is a “slow thinking” session, in contrast to the way large language models (LLMs) quickly sample tokens.
True signals: This is the source of power that drives the entire system. In Go, this signal is very simple, that is, a binary label “who wins”, determined by fixed game rules. Think of it as a source of energy to sustain your learning process.

So, how do these components work together?

AlphaGo continues to evolve through self-play, where it plays against its own previous version. In this process, the strategy neural network and the value neural network are continuously optimized through iteration: as the strategy becomes more efficient in choosing actions, the value neural network obtains better data to learn from it, and in turn provides more accurate feedback to the strategy. Stronger strategies also help Monte Carlo tree search discover better strategies.

This forms an ingenious “perpetual motion machine”. In this way, AlphaGo improved its ability and defeated Human World Champion Lee Sedol 4-1 in 2016. AI cannot reach the level of superhumans by simply imitating human data.

Now, let’s explore what makes up Q*. What are its four major components?

Strategic Neural Network: This will be OAI’s most powerful in-house large language model (GPT), responsible for actually executing the thought process of solving mathematical problems.
Value Neural Network: Another GPT that evaluates the probability of correctness for each intermediate reasoning step.
OAI released a paper titled “Let’s Verify Step by Step” in May 2023, co-authored by big names such as Ilya Sutskever (@ilyasut), John Schulman (@johnschulman2), and Jan Leike (@janleike): https://arxiv.org/abs/2305.20050

Although it is not as well-known as DALL-E or Whisper, it gives us quite a few clues.

This paper proposes a “process supervised reward model” (PRM), which provides feedback on each step in the thought chain. In contrast, the Outcome Supervised Reward Model (ORM) only makes judgments about the overall output in the end.

ORM is an expression of the original reward model for reinforcement learning from human feedback (RLHF), but it is too coarse to properly evaluate the individual subparts of a long response. In other words, ORMs are not suitable for allocating credit. In the reinforcement learning literature, we refer to ORMs as “sparse rewards” (given only in the end), while PRMs are “dense rewards” that smoothly guide large language models towards our desired behavior.

Search: Unlike AlphaGo’s discrete states and actions, large language models operate on more complex spaces of “all reasonable strings.” Therefore, we need new search methods.

On the basis of the Chain of Thought (CoT), the research community has developed some nonlinear CoTs:

Tree of Thought: actually combines CoT with tree search: https://arxiv.org/abs/2305.10601
@ShunyuYao12
Graph of Thought: As you guessed. Turning a tree into a graph gives you a more complex search operator: https://arxiv.org/abs/2308.09687

Real signals: there are several possibilities:
(a) Every math problem is accompanied by a known answer. OAI may have collected a large amount of corpus from existing math exams or competitions.
(b) ORMs themselves can serve as real signals, but they can be exploited to “lose energy” to sustain learning.
(c) Formal verification systems, such as the Lean theorem prover, can turn math problems into coding problems, providing compiler feedback: https://lean-lang.org

Just like AlphaGo, strategy and value large language models can mutually promote improvement through iteration, and can also learn from the annotations of human experts where possible. Better strategy large language models will help think tree search discover better strategies, which in turn will collect better data for the next iteration.

Demis Hassabis (@demishassabis) has mentioned that DeepMind’s Gemini will use “AlphaGo-style algorithms” to enhance reasoning capabilities. Even if Q* isn’t what we think, Google will definitely follow suit in its own way. If I could think of that, they would certainly too.

It should be noted that what I am describing is only about reasoning. It is not said that Q* would be more creative in writing poetry, telling jokes Grok (@grok), or role-playing. Boosting creativity is inherently a human thing, so I believe natural data will still trump synthetic data.