Tuning chain of thought on questions with verifiable answers

A recent paradigm in training reasoning models is to tune them on questions where the answers are verifiable. We train them to generate chains of thought that are useful to come to correct answers by rewarding them for doing so. If the answer they produce is correct, we reward them regardless of what is actually in the chain of thought. Here I will discuss how this paradigm is related to the full generality of reasoning and see how it can be expanded to produce more intelligible chains of thought.

The canonical example of training a model to generate chain of thought that increases the correctness of its answers is DeepSeek R1-Zero. It was given a series of mainly maths and coding questions where the answers were known or verifiable in some way and rewarded for getting the answer correct (it was also rewarded for formatting its answer correctly, but I don’t think this is particularly important). This was surprisingly effective for the simplicity of the method, producing incredibly impressive results.

Along similar lines, Foster et al. did some work where they made this more effective and general. Taking the philosophy of next word prediction is all you need to heart, they proposed a method to generate correctness-inducing chains of thought on free-form datasets. Given a prefix of a sequence of tokens, e.g., The cat sat the model would have to generate a chain of thought that increased the prediction accuracy on the next tokens, e.g., on the mat. The main advantage of this is that it generalises the correct question-answering paradigm to datasets where we don’t have question-answer pairs expressed in a structured format.

I like this paradigm. I think self-reasoning is one of the parts of reasoning that is most useful for us, especially when we want to come up with correct answers to questions. The RL here clearly encourages models to generate more correct answers, and the application of this is incredibly broad. There is, however, a key disadvantage that we should note. This is the fact that the chain of thought in this setting has no incentive to be intelligible by others. This was clearly evident in DeepSeek R1-Zero where in its <thinking> tags it started to speak in multiple languages and just generally produced chains of thought where it was not directly obvious why it was useful to come to the correct answer. It did this because there are many ways to generate chains of thought to induce the correct answer, and most of these are not the intelligible reasoning we desire. I don’t see a reason why we would expect the model to learn to use this intelligible subset when we express no preference that it should do so.

In the discussion so far, I have been hesitant to describe what we are incentivising here as reasoning. This is because I would argue that here we are only incentivising good self-reasoning, which is a special case of the fully general good reasoning that we want to achieve. In the phrase good self-reasoning I am using good to mean that the reasoning encourages the answers to be correct and self-reasoning to mean that the model provides reasons whose function is to persuade itself of the answer. To do this, incentivising coming to the correct answer is sufficient. However, incentivising coming to the correct answer does not encourage generally good reasoning, which would persuade a diversity of agents of the correct answer. This is because other agents are not involved in the reward at all. Here we have focused on a very specific case where the only agent that our model needs to convince of the answer is itself. Therefore, the model will overfit its reasoning to what it understands and miss out on the general arguments that will convince agents different from itself.

To fix this, we need to add something in this loss that encourages models to give intelligible and general reasoning. It is not surprising that we have not done so already because this whole paradigm is based on verification, and it is not clear how we would verify quality of reasoning. This would probably involve ascribing some reward on the chain of thought that measures how well it convinces other agents.

A proxy for this would be to just ask a group of LLMs (could be just one) to evaluate how coherent the reasoning is. I think this would be limited as the evaluation would not improve along with the quality of the reasoning unless it was done by the model itself.

Instead, I think an open-ended adversarial scenario where agents are incentivised to persuade others of their answer would work better. I have a belief that it would not be necessary to incentivise being correct here, as I think it is possible to make more persuasive arguments for correct answers than it is for incorrect answers. Therefore, only incentivising persuasion is enough, and correct answers will naturally dominate. A key advantage of this scenario is that we do not even need to know the correct answer to the question to incentivise persuasion. However, since we have it, there is nothing stopping us from including correctness in the reward to help the agents along the way. This scenario would encourage intelligible reasoning as all the other agents would need to be able to understand and be persuaded by the reasons. Therefore, with sufficient diversity of agents to avoid undesirable equilibria, we would get agents trained to produce correct answers along with intelligible chains of thought justifying those answers.