Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
This blog distills the key insights of ***"Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?"*** in connection with our recent paper, ***“Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty”*** [1]
26 Mar, 2026
Full paper: https://arxiv.org/abs/2603.24472
Codes: https://github.com/beanie00/self-distillation-analysis
Hugging face: https://huggingface.co/collections/beanie00/self-distillation-analysis

Recently, self-distillation has attracted increasing interest. In this setting, two versions of the same model are used: one conditioned on rich context (i.e., correct solutions) acts as the teacher and provides token-level reward signals for responses generated by the other model, which does not have access to the answers. Several studies have shown that self-distillation can serve as an effective approach for LLM post-training, going beyond RLVR methods that rely on binary (0 or 1) rewards. These approaches have shown particularly strong improvements in domains such as agentic settings and scientific reasoning, especially under in-domain evaluation. A notable trend is that performance improves as response length decreases, suggesting that self-distillation encourages more concise and effective reasoning.
However, when applying the same self-distillation method to mathematical reasoning tasks, we sometimes observe a markedly different phenomenon. In some cases, response length decreases as training progresses (consistent with previous results); unlike prior findings, however, the model loses its original reasoning ability, leading to a drop in performance.


[Fig. 1] Changes in training score and response length when training GRPO and Reinforcement Learning via Self-Distillation (SDPO) [3] in the chemistry domain. We borrow these results from the SDPO Weights & Biases logs (https://wandb.ai/jonhue/SDPO?nw=mgotcx6kk7).


[Fig. 2] Changes in training score and response length during training on the DAPO-Math-17k dataset with GRPO and SDPO.
<aside> 🤔
This raises an question: ”Why does performance sometimes degrade despite the model being trained to move toward the correct answer?”
</aside>
We trace this phenomenon to the suppression of epistemic verbalization [1], which refers to models explicitly verbalizing and incorporating their uncertainty during the reasoning process. Strong reasoning models such as DeepSeek-R1 frequently express uncertainty using tokens like “Wait” or other self-corrective phrases. Although these expressions may not directly contribute to the reasoning process, removing them discards important information that the reasoning may be incorrect, leading to a significant drop in performance [1].
Self-distillation may unintentionally suppress this behavior. Because the teacher model has access to the correct solution, it produces more confident and linear reasoning traces, reducing the expression of uncertainty. As a result, the student may also learn to reason with greater apparent certainty, losing epistemic verbalization. In this blog, we present various experimental results and analyze why self-distillation can degrade the base model’s reasoning ability from the perspectives of its inherent abilities, epistemic verbalization, and their impact on generalization. We believe our findings provide useful insights into LLM reasoning and post-training, and contribute to improving on-policy distillation.