RL economics, morally charged terms, and “distillation”

Author

Thomas Dullien

Published

June 15, 2026

Modified

June 15, 2026

After a number of Twitter discussions, and repeating myself a lot in these discussions, it is time to write a short note on the economics of advancing LLM capabilities through RL, about principles of propaganda and coining new words, and about my stubborn refusal to use the term “distillation” except in a specific narrow sense.

How do models advance when human-curated data has run out?

It’s been a while since we ran out of human data to train LLMs on. We are training on copies of the internet, large piles of (originally pirated, then purchased-and-scanned-and-wholesale ingested) books, and whatever other data sources we can obtain. This leads to a certain performance plateau, as we haven’t quite figured out how to make the models more data-efficient in training.

The advancements we have seen in coding and mathematics in the last year are mostly due to reinforcement learning. At the highest level, you pose a problem to an LLM that the LLM has a small but nontrivial chance of solving. You then run N copies of the LLM to generate solutions, and you get a small number of solutions and many failures. You can then use the successful solutions as new data to improve your model - moving the weights in a way that helps the model succeed with greater probability.

This is very elegant in a way, because you are kinda pulling yourself up by your own bootstraps. The cost is computational - if you have a 1% chance of finding a solution given your current LLM and current training data, you need to do 100s or 1000s of rollouts to get a reasonable variety of useful solutions.

Once you have a model that can generate a good solution for this problem with high probability, and you make that model available to others, you also provide a much cheaper way of producing the better training data: Third parties can now just ask your model to generate good solutions for them.

So for the second-mover that gets to use your model, improving their model from your model outputs is cheaper, as they can skip the more-or-less-random-search into a high-dimensional solution space and be guided better.

This is a fundamental part of the “closed LLM as a service” business, and it is painful for the leader of the pack because they need to spend money to advance, and others can catch up more cheaply.

Reframing an inconvenient issue with your business model in moral terms

Imagine you’ve raised billions of dollars and you realize that your business model has a rather inconvenient flaw - you have a good business, but for it to become a fantastic business, you’d need to fix this flaw. And the flaw, as you perceive it, is the current legal system for intellectual property with it’s old and well-tested precedents and mechanisms.

It will be easy to convince yourself that the flaw in your business model that gives your competitors a way to catch up with lesser investment is a moral outrage - it is so unjust! - and then complain about the fact that others have the right to do what they are doing.

Once you’ve convinced yourself of the immorality of what your competition is doing (how dare they compress your margins?), you will need to somehow re-frame what they are doing in moral terms. So “training on solved problems to improve” doesn’t quite have the right ring to it. We need something malicious, like “distillation attacks”.

“Distillation” is great, because it evokes bootlegging and 1920s prohibition-era intrigue. And “attack” is great because only bad people attack. So you leverage the fact that people called a technique to teach a smaller model from a larger model provided you have access to the internals of the larger model “distillation”, you tack on the word “attack” to make it sound more nefarious, and you start screaming from the rooftops that evil distillation attackers are killing your morally superior business (that started by actual copyright violations, only justified ex-post by your success).

This is what happened here, and I urge every reader to not go along with it. Distillation means having access to a large model, including all the last-layer token probabilities, and training a smaller model by taking those internal last-layer probabilities into account.

Just training on model output isn’t it. And you cannot have a world where people use LLMs to write code or text, and are allowed to publish that on the internet, and simultaneously prevent up-leveling other models as they train on that data. You have no legal or moral legs to stand on if you want to prevent that.

If the chinese models are distilled, so is the Cursor fine-tune of Kimi, or any model that is trained on the output of other models - and most of human output is now model-assisted.

You are free to argue that this is inconvenient for your business model, and a legal framework which allows you to prevent that would be useful in attracting more investment to advance your model, but that’s about it.

This is why I don’t call training on other models output “distillation”

Let’s call it “training on model output”, or whatever else that is not morally charged. And let’s be honest that the existence of LLMs in their current form is the result of highly dubious approaches to copyright that are ex-post legitimized by the actual value these models bring to society. Let’s please avoid allowing parties with particular financial interest build a moral framing around their interests, though.