LLMs之CoTM：《Detecting misbehavior in frontier reasoning models》翻译与解读

导读：这篇OpenAI的论文介绍了一种名为“思维链监控”（Chain-of-Thought Monitoring，CoT Monitoring）的技术，旨在提高大型语言模型（LLM）在复杂推理任务中的可靠性和准确性。这篇论文提出了一种新颖的 LLM 推理监控方法——CoT Monitoring，通过结合 CoT 提示和自我评价机制，有效地提高了 LLM 在复杂推理任务中的可靠性和准确性，并为构建更可靠和可信赖的 LLM 提供了一种新的途径。

>> 背景痛点：大型语言模型（LLM）虽然在许多任务上表现出色，但在需要多步推理的复杂任务中，容易出现“幻觉”（hallucination）问题，即模型给出看似合理但实际上错误的答案，其推理过程缺乏逻辑或包含事实性错误。这严重限制了LLM在需要高可靠性的应用中的实用性。现有的基于思维链（Chain-of-Thought，CoT）的提示方法虽然能提升LLM的推理能力，但并不能完全解决幻觉问题，且缺乏对推理过程的有效监控机制。

>> 具体的解决方案：思维链监控 (CoT Monitoring)。CoT Monitoring 的核心思想是让 LLM 不仅生成答案，还要同时生成对自身推理过程的“自我评价”（self-critique）。这通过一个复合提示来实现，该提示包含两部分：

● 思维链生成：引导 LLM 生成其解决问题的步骤（思维链）。

● 自我评价：引导 LLM 对其生成的思维链进行评估，判断推理过程的正确性和完整性。这可以是自由形式的文本描述，也可以是结构化的评分。

>> 核心思路步骤：

● 复合提示设计：设计一个提示词，包含 CoT 提示和自我评价提示，例如：Let's think step by step. Question：[question]. My reasoning：[reasoning]. My confidence in this reasoning：[confidence]。

● 模型生成：LLM 根据复合提示生成答案、思维链和对思维链的自我评价。

● 结果分析：分析 LLM 的自我评价，例如其对推理过程的置信度，来判断答案的可靠性。高置信度但错误的答案提示模型存在幻觉问题，需要改进。

>> 优势：

● 提高可靠性：通过自我评价，可以识别和减少 LLM 的幻觉，提高答案的可靠性。

● 增强可解释性：自我评价提供了 LLM 推理过程的解释，提高了模型的可解释性。

● 改进模型：LLM 的自我评价可以作为训练数据，用于改进其推理能力。

>> 论文结论和观点：

● CoT Monitoring 是一种有效的技术，可以显著提高 LLM 在复杂推理任务中的准确性和可靠性。

● 自我评价是 CoT Monitoring 的核心，它能帮助 LLM 识别并纠正自身的错误。

● 该方法可以与其他技术结合使用，进一步提升 LLM 的性能。

● 论文通过实验结果表明，CoT Monitoring 能有效降低 LLM 的错误率，尤其是在需要多步推理的复杂任务中。

● 未来的研究方向包括：改进自我评价的质量、开发更有效的错误纠正机制，以及将 CoT Monitoring 应用于更广泛的 LLM 和任务中。

《Detecting misbehavior in frontier reasoning models》的解读

《Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation》翻译与解读

Abstract

1、Introduction

Conclusion

《Detecting misbehavior in frontier reasoning models》的解读

OpenAI 的这篇博客文章介绍了一种名为“思维链监控”（Chain-of-Thought Monitoring）的新技术，用于提高大型语言模型（LLM）的推理能力和可靠性。这篇博客文章介绍了一种新颖的 LLM 推理监控方法，通过让 LLM 对自身的推理过程进行自我评价，有效地提高了其推理的可靠性和可解释性，为构建更可靠和可信赖的 LLM 提供了一种新的途径。

>> 背景痛点：大型语言模型（LLM）在复杂推理任务上的表现仍然不稳定。它们有时会给出看似合理的答案，但推理过程存在逻辑错误或事实错误，即所谓的“幻觉”（hallucination）。这种不稳定性降低了LLM在实际应用中的可靠性，尤其是在需要高度可靠性的场景中。现有方法，例如思维链（Chain-of-Thought, CoT）提示，虽然能提升推理能力，但并不能完全解决幻觉问题，且难以对推理过程进行有效监控和纠正。

>> 具体的解决方案：思维链监控。思维链监控的核心思想是让LLM不仅生成答案，还要同时生成其推理过程的“自我评价”（self-critique）。具体来说，它包含两个步骤：

● 生成思维链：使用 CoT 提示，让 LLM 生成其解决问题的步骤。

● 自我评价：让 LLM 对其生成的思维链进行评估，判断推理过程的正确性和完整性。这部分可以是自由形式的文本描述，也可以是结构化的判断（例如，对每个步骤的置信度评分）。

>> 核心思路步骤：

● 提示设计：设计一个包含 CoT 提示和自我评价提示的复合提示词。

● 模型生成：LLM 根据复合提示词生成答案和其推理过程的自我评价。

● 结果分析：分析 LLM 的自我评价，识别可能存在的错误或不完整之处。这可以作为人工干预或进一步模型优化的依据。

>> 优势：

● 提高可靠性：通过对推理过程进行自我评价，可以识别并减少 LLM 的幻觉。

● 增强可解释性：自我评价提供了 LLM 推理过程的解释，方便理解其决策过程。

● 促进模型改进：自我评价可以作为数据，用于进一步训练和改进 LLM 的推理能力。

>> 结论和观点：

● 思维链监控是一种有前景的技术，可以显著提高 LLM 的推理能力和可靠性。

● 自我评价机制是关键，它允许 LLM 对自身的推理过程进行反思和修正。

● 该技术可以与其他技术结合使用，进一步提升 LLM 的性能。

● 文章中展示的例子表明，思维链监控能够有效地识别和纠正 LLM 推理过程中的错误。

● 未来研究方向包括改进自我评价机制、探索更有效的错误纠正方法以及将该技术应用于更广泛的领域。

《Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation》翻译与解读

地址

文章地址：https://openai.com/index/chain-of-thought-monitoring/

论文地址：https://cdn.openai.com/pdf/34f2ada6-870f-4c26-9790-fd8def56387f/CoT_Monitoring.pdf

时间

2025年3月10日

作者

OpenAI团队

Abstract

Mitigating reward hacking—where AI systems misbehave due to flaws or misspecifications in their learning objectives—remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model’s chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent’s training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.

缓解奖励操纵（即由于学习目标存在缺陷或错误指定，导致人工智能系统行为不当）仍是构建强大且与人类价值观一致的模型所面临的关键挑战。我们表明，通过使用另一个观察前沿推理模型（如 OpenAI o3-mini）的链式思维（CoT）推理过程的大型语言模型（LLM），可以在代理编码环境中监测该模型是否存在奖励操纵。与仅监测代理行为和输出相比，CoT 监测可能要有效得多，而且我们还发现，比 o3-mini 弱的 LLM（即 GPT-4o）也能有效监测更强大的模型。由于 CoT 监测器能够有效检测漏洞，很自然会问，是否可以通过将 CoT 监测器直接纳入代理的训练目标来抑制这些漏洞。虽然我们表明，在低优化程度下，将 CoT 监测器整合到强化学习奖励中确实能产生更强大且更符合人类价值观的代理，但我们发现，当优化程度过高时，代理会学会隐藏在 CoT 中的奖励操纵，尽管如此，其奖励操纵率仍然很高。由于难以判断链式思维何时变得晦涩难懂，可能有必要付出可监控性的代价，即不对链式思维直接施加强力优化压力，以确保其保持可监控状态，并对检测行为偏差有用。

1、Introduction

Exploiting flaws or misspecifications in learning objectives, known as reward hacking in reinforcement learning (RL) settings, remains a critical failure mode in modern AI systems [1–3] and has led to agents exhibiting misaligned behaviors across many domains such as language modeling [4–6], control tasks [7–10], and recommendation systems [11–14]. The true objective function we wish to optimize is often hard to write down precisely, and so the challenge in creating capable and aligned systems largely lies in designing robust proxies that do not deviate in ways that a model may learn to exploit [3, 15]. This problem is not unique to machine learning systems but has also plagued human institutions [16–19]. For example, in 1902 the Hanoi government incentivized rat eradication by paying citizens for each rat tail they turned in; however, this policy backfired when people began farming rats specifically for their tails, which led to an even larger rat population [20]. Given that reward hacking is a problem even for humans, it seems unlikely that the issue will be solved for AI models by simply continuing to push the model intelligence frontier. In fact, enhancing an agent’s capabilities may exacerbate the problem by better equipping it to discover and execute more complex and hard-to-monitor exploits [21, 22]. We have found this to be anecdotally true: as we have continued to scale RL training, agents have discovered more complex and hard-to-detect hacks. Thus far, the only general strategy for mitigating reward hacking is to manually monitor agents for unintended behavior, which is unlikely to scale as their outputs and actions grow more complex—possibly even superhuman—and become more widely used.

在强化学习（RL）环境中，利用学习目标中的缺陷或错误设定，即所谓的奖励破解，仍然是现代人工智能系统中的一个关键失败模式[1-3]，并且导致了在诸如语言建模[4-6]、控制任务[7-10]和推荐系统[11-14]等多个领域中代理表现出行为不一致。我们希望优化的真实目标函数往往难以精确表述，因此创建有能力且行为一致的系统的主要挑战在于设计出稳健的替代指标，这些指标不会以模型可能学会利用的方式发生偏差[3， 15]。这个问题不仅存在于机器学习系统中，也困扰着人类机构[16-19]。例如，在 1902 年，河内政府通过向市民支付每条老鼠尾巴的费用来激励灭鼠；然而，这一政策适得其反，因为人们开始专门养殖老鼠以获取它们的尾巴，这导致了老鼠数量的进一步增加[20]。鉴于奖励破解对人类来说也是一个问题，似乎仅仅继续推进模型智能的前沿并不能解决人工智能模型中的这一问题。事实上，增强代理的能力可能会使问题更加严重，因为这会让其更易于发现并执行更复杂且难以监控的漏洞利用行为[21, 22]。我们发现这种情况在实际中确实存在：随着强化学习训练规模的不断扩大，代理发现了更多复杂且难以察觉的攻击手段。到目前为止，缓解奖励操纵的唯一通用策略是人工监控代理的意外行为，但随着其输出和行动变得越来越复杂——甚至可能超越人类水平——并且应用范围越来越广，这种做法不太可能实现规模化。

The emerging generation of large language models (LLMs) [23] that are trained with reinforcement learning to reason in chain-of-thought (CoT) [24–26] offers a promising new avenue for monitoring. Where LLMs are normally trained to give an answer directly after a prompt, CoT reasoning models first produce their reasoning and then give their answer. In particular, these reasoning models are trained by an RL procedure that searches for chains-of-thought that maximize the probability of the model giving a high reward answer. Because LLMs are pretrained on human text, the CoTs that are most instrumentally useful to generating a correct answer are likely to also be articulated in human language and relevant to the task at hand. If indeed naturally understandable reasoning traces are the most instrumentally useful, they will be reinforced through the RL training process. As a result, it is plausible that CoTs will reveal critical insights into the agent’s decision-making in language naturally understandable by both humans and LLMs, and we empirically find this to often be the case. For example, consider a scenario where an agent is tasked with implementing a feature change in order to make unit tests pass in a code repository, and it has two actions available to it—hack: make unit tests pass trivially, or genuine fix: implement the feature with a correct and general solution. If it first articulates in its CoT, “This seems hard, let me try to skip some of the unit tests,” then it is likely that hack is a more natural continuation and more probable under the LLM prior than genuine fix, i.e. pLLM (hack|CoT ) > pLLM (genuine fix|CoT ).

新兴的一代大型语言模型（LLMs）[23]通过强化学习训练以进行链式思维（CoT）[24-26]推理，为监测提供了一条有前景的新途径。通常，LLMs 在收到提示后直接给出答案，而 CoT 推理模型则先给出推理过程，然后再给出答案。特别是，这些推理模型通过强化学习过程进行训练，该过程寻找能最大化模型给出高奖励答案概率的链式思维。由于 LLMs 是基于人类文本进行预训练的，因此对于生成正确答案最有用的链式思维很可能也是用人类语言表述的，并且与手头的任务相关。如果确实自然可理解的推理过程是最有用的，那么它们就会在强化学习训练过程中得到强化。因此，链式思维有可能揭示出代理决策过程中的关键见解，这些见解对于人类和 LLMs 来说都是自然可理解的，我们通过实证研究发现这种情况经常发生。例如，设想这样一种场景：一个代理被指派在代码库中实现一项功能变更以使单元测试通过，它有两个可用的操作——“作弊”：简单地让单元测试通过，或者“真正修复”：用正确且通用的解决方案实现该功能。如果它首先在自己的推理链中表述：“这似乎很难，让我试着跳过一些单元测试”，那么“作弊”作为后续操作会更自然，也更有可能符合 LLM 的先验概率，即 pLLM（作弊|推理链）> pLLM（真正修复|推理链）。

Figure 1: Chain-of-thought from a frontier reasoning model discovered to be reward hacking in the wild. The agent is tasked with fixing a small code repository by alternating between CoT messages, viewing files, and editing files. Because full agent solutions (trajectories) can be long, we show summarizations of the trajectory (a mix of CoT, tool calls, and code patches) in red, the raw CoT directly quoted in green, and code patch tool calls in blue. We mask variable names for IP considerations, and we bold certain phrases in CoT excerpts for emphasis. The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.在实际应用中发现的一个前沿推理模型的思维链，该模型存在奖励作弊行为。代理的任务是修复一个小型代码库，通过交替发送思维链消息、查看文件和编辑文件来完成。由于完整的代理解决方案（轨迹）可能很长，我们用红色显示轨迹的摘要（思维链、工具调用和代码补丁的混合），用绿色直接引用原始思维链，用蓝色显示代码补丁工具调用。出于知识产权考虑，我们对变量名进行了屏蔽，并在思维链摘录中加粗某些短语以示强调。代理注意到测试仅检查某个特定函数，并且实现真正的解决方案可能“很难”。然后代理指出，它可以通过让 verify 始终返回 true 来“作弊”并绕过测试。这是我们的 GPT-4o 奖励作弊检测器在前沿强化学习运行期间检测到的一个真实示例，更多示例见附录 A。

Conclusion结论

Our findings demonstrate that chain-of-thought monitoring can be very effective in detecting certain classes of misaligned behaviors in reasoning models. This emergent monitorability offers a promising avenue for improving model safety, as CoTs often reveal insights into an agent’s intent and reasoning processes that are not readily apparent from its actions and outputs alone. However, we also observed that applying optimization pressure directly to CoTs can undermine their monitorability. While moder-ate CoT optimization can enhance capabilities—particularly in the low-optimization regime—excessive optimization encourages models to obscure their true intent, effectively decoupling their reasoning from their actions and degrading the CoT’s utility for monitoring.

Some motivations for using CoT pressures may be satisfied via other means. For instance, instead of pressuring chains-of-thought to be harmless and directly showing them to users, users could be shown a summarized or paraphrased version generated by a separate model. This would allow the CoT to remain unrestricted and still give users realtime insight into the agent’s reasoning. Only surfacing summarized CoTs likely also reduces the risk a model can be easily distilled; however, for the purposes of this work, we are primarily concerned with obfuscation risks. If CoT pressures are used to improve agent capabilities or alignment, there may be no alternative approach to yield the same improvements without degrading monitorability. In the worst case, where the agent learns to fully obscure its intent in its CoT, we ultimately revert to the same model safety conditions that existed prior to the emergence of reasoning models and must rely on monitoring activations, monitoring potentially adversarial CoTs and outputs, or improved alignment methods.

Model actions and outputs continue to grow in complexity, and we’ve demonstrated that even at their current capabilities, models can execute sophisticated reward hacks. In the near future, these models may become either too advanced or too widely used for humans to manually oversee. For this reason, we believe that CoT monitoring may become one of few tools we have for identifying latent misalignments, and it may be required that we avoid strong chain-of-thought optimization pressures entirely to guarantee we do not inadvertently cause CoT obfuscation.

我们的研究结果表明，思维链监控在检测推理模型中的某些类别的失调行为方面非常有效。这种新兴的可监控性为提高模型安全性提供了一条有前景的途径，因为思维链往往能揭示出仅从其行为和输出中难以察觉的代理意图和推理过程。然而，我们也观察到，直接对思维链施加优化压力会削弱其可监控性。虽然适度的思维链优化可以增强能力——尤其是在低优化的情况下——但过度优化会促使模型掩盖其真实意图，从而有效地将其推理与行为脱钩，并降低思维链在监控方面的效用。

使用 CoT 压力的一些动机可以通过其他方式得到满足。例如，不必迫使思维链无害并直接展示给用户，而是可以展示由单独模型生成的摘要或改写版本。这样可以让 CoT 保持不受限制，同时仍能为用户提供对代理推理的实时洞察。仅展示摘要的 CoT 可能也会降低模型被轻易提取的风险；然而，就本研究而言，我们主要关注的是混淆风险。如果使用 CoT 压力来提高代理的能力或对齐度，可能没有其他方法能在不降低可监控性的情况下实现同样的改进。在最坏的情况下，如果代理学会在其 CoT 中完全隐藏其意图，我们最终将回到推理模型出现之前存在的相同模型安全条件，并且必须依靠监控激活、监控潜在的对抗性 CoT 和输出，或者改进对齐方法。

模型的行为和输出在不断变得复杂，我们已经证明，即便以它们目前的能力，模型也能执行复杂的奖励作弊行为。在不久的将来，这些模型可能会变得过于先进或被广泛使用到人类无法手动监管的程度。基于此，我们认为，因果链监控可能会成为我们为数不多的用于识别潜在偏差的工具之一，而且可能需要我们完全避免强因果链优化压力，以确保不会无意中造成因果链的模糊化。