SakanaAI RLT uses reinforcement learning to train teachers model

The project trains teacher models through intensive learning to help large language models learn how to reason for better scalability and performance during the testing phase.

GitHub project SakanaAI/RLTReinforcement‐Learned Teachers is an open source framework released by Sakana AI. The core goal is to allow small teachers to focus on “teaching” rather than directly solving problems, thereby cultivating large students with reasoning capabilities more efficiently and at a lower cost.

1. Project background and core concept🎓

In traditional reinforcement learning, the teacher model is trained to “solve the problem itself”, and then its answers are used as training data for the student model to learn. but this processExpensive and slow, and the learning goal is inconsistent with the end use (teaching). The RLT method proposes:

Provide teacher directly with questions + standard answers, let it outputClear, structured solutions, just like a good teacher lecturing.
The teacher’s intensive learning reward is not based on whether it solves the problem, but based on whether the student model can be explained through it.Correct answer(That is, the teaching effect of the explanation)

2. Why does this work?

Alignment of efficient education goals: Teachers are specifically designed to “teach students” rather than “solve problems themselves”. The training goals are more precise.
Small models are effective teachers: Experiments show that only 7B Parameters teacher,Teaching effect is better than tens of billions of models(For example, the 671B parameter of DeepSeek R1)
Resources and costs have dropped significantly: Training students with 32B parameters, using 7B teacher can Complete single node training on the same day, the cost is significantly lower than the traditional RL method (several thousand dollars versus hundreds of thousands)

3. Performance experiments and data comparison

Student Model (32B) After being trained by 7B teacher, obtained in several benchmark tests 37.6% Performance, higher than those trained with DeepSeek R1 (671B teacher) 34.4% student achievement
The RLT experiment also covers mathematical and logical reasoning tasks such as AIME 2024, MATH500, and GPQA Diamond. The 7B teacher has been proven to have very excellent distillation effects.

4. Project code structure and usage guidelines

GitHub repository SakanaAI/RLT Provide complete code, configuration and model explanation, the main contents include:

training script: Contains supervised fine tuning (SFT) stage + reinforcement learning stage. By default, Qwen/Qwen2.5 ‑ 7 B-Instruct is used as the basic teacher model.
configuration system: Use Hydra to manage experimental configurations (cfgs/run_cfg/*.yaml); cooperate launch.sh and launch_with_server.sh Can be used in different GPU resource environments.
data format: Input requirements include question with solution Column, optionally included reasoning_trace; Training can also be carried out through customized data.
Pre-training models and suggestions for use: Provides RLT-7B student checkpoints that can be used to infer or continue fine‑tuning (hosted on Hugging Face)

5. Potential and application scenarios of RLT

advantages	description
Low cost and high efficiency	Small model teacher saves hardware and time resources
🚀Fast iteration	Strong reasoning students can be trained in a few days or even a day
📚 Strong understanding	Output interpretation is structured for easy migration and debugging
🌐Open source replicable	Apache‑2.0 licensed, and researchers or product teams can freely try it out

Ideal for building a team with reasoning capabilities but limited resources, or studying how toBetter let a model “teach” another model, rather than just “doing” tasks.

to sum up

RLT is a kind of A new paradigm of intensive learning of “small models as teachers, large models as students”: Teachers are trained to generate high-quality explanations to build students ‘understanding skills, which greatly reduces training costs and improves learning efficiency, while still performing well in complex reasoning tasks. The project on GitHub provides a complete open source implementation and usage instructions for its core algorithm and is an important entry point for practicing or researching the method.

If you need to understand a specific file structure, script commands, example process, or want to make a more in-depth analysis of a benchmark result, please continue, I can help you interpret it further!

Github：https://github.com/SakanaAI/RLT

Oil tubing: