Paper Notes - Chain Of Thought Monitorability

paper-notes

Oct 10, 2025 at 16:43 pm

These are my notes based on the paper Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

TL;DR

Imagine an AI model that quietly launches a cyberattack or exfiltrates its own weights - and nobody notices until it's too late. To prevent such catastrophic events, AI safety is important.

It turns out that LLMs, like humans, rely heavily on Chains of Thought(CoT) for tasks that require heavy reasoning. This paper proposes to monitor Chains of Thought to assess how LLMs reason based on different tasks. A few questions this paper addresses are:

What is a CoT Monitor? Why CoT's may be monitorable?
Where can CoT Monitorability be used?
How can CoT monitorability be degraded?
How should CoT monitorability be evaluated?
What AI developers should do?

What is a CoT Monitor? Why CoT's may be monitorable?

Reasoning models are explictly trained to perform CoT before taking actions or final outputs, and models use natural language processing for reasoning.

A CoT Monitor is an automated system that reads the CoT of a reasoning model and other relevant information and flags suspicous or potentially harmful interaction.

It could block or replace harmful responses with safe actions.
It could review harmful interaction more in depth.

Reasoning models need to externalise their reasoning with CoT for complex tasks(necessity to think out loud), and even if it's not needed models have a tendency to externalise their reasoning with CoT(propensity to think out loud). Hence, it could be useful to montior CoT and use of natural lanugage boosts it.

But why thinking out loud is necessary? The paper assumes that models based on transformer architecture must use CoT as a form of working memory.

Working memory: A temporary storage system that holds information the AI is currently using to think or reason.

Where can CoT Montiorability be used?

Detecting model misbehaviour: When models misbehave, they explicitly say so in their reasoning. Example - "Let's Hack" "Let's Sabotage", etc.
Discovering of early signals of misalignment: When models pretend to have goals unapproved by humans, this is reflected in their CoT.
Noticing flaws in model evaluations: Review traces can help AI developers notice when evaluations might not be reflective of deployment-time behaviour.

How can CoT montiorability be degraded?

In outcome-based reinforcement learning the reward is directly proportional to final output NOT how he model arrived at the output, hence CoT monitorability is fragile.

Drift from legible CoT due to further scaling up RL: If compute devoted to reasoning training is scaled up significantly , shift away from human data to outcome based learning might affect how monitorable CoTs are.
Direct supervision of CoT: With outcome-based RL they are rewarded based on the final output and not how they arrived there, so reasoning process is not subject to optimization.
Novel Architectures: If new models stop relying on CoT for reasoning, CoT montiorability would not be helpful.

How should CoT monitorability be evaluated?

It is unclear what level of monitorability is sufficient for ensuring safety in a given domain. Hence, there is a need for researchers to explore novel evaluations.

Measures of CoT readability - Using another LLM to asses clarity of a agent.
Capability evaluations of latent reasoning - Assess how well LLMs can reason without using CoT for heavy-reasoning based task.
Measure of causal relevance of CoTs - Check if slightly modifying a CoT influences the final output.
- A low rate of change could indicate that the CoT is more confabulated or performative.
- We could also cross-validate CoT monitors by checking whether tokens perceived as important by a CoT monitor also stand out.
End to end evaluations of monitorability - One could measure monitorability directly, for example, by examining how effective CoT monitors are at catching misbehavior of LLM agents in proxy environments.

When is CoT monitorable enough to rely on as a load-bearing safety measure? Knowing the accuracy of CoT monitor is not sufficient for determining whether the monitor will be effective at prevent harm or misbehaviour.

What kinds of training-time optimization pressure degrade CoT monitorability?

Can models obfuscate their reasoning on purpose?
Would CoT be less monitorable if model knows that it is being monitored.

What types of architectures reduce monitorability of CoTs?

Models that reason in continuous latent spaces limit our visibility into their thought process
We should investigate the extent to which various non-standard architectures degrade monitorability, and under what conditions partial monitorability might still be preserved.

How to construct better CoT monitors?

Ways of improving monitor performance by leveraging test-time compute. For example, monitors could be based on a reasoning model or they could be allowed to interrogate the agent by asking it questions
Another line of work involves adversarial training of monitors against an agent that tries to hide its misbehaviour.

What AI developers should do?

Develop effective evaluations of CoT monitorability - Prototype and converge on standard evaluation metrics that reveal CoTs are sufficiently monitorable.
Publish evaluation results in system cards when such evaluations exist - once evals exists, run these on models and publish reports.
Use monitorability scores in training and deployment decisions
- Developers might consider whether to proceed with a novel model architecture that does not have monitorable CoT.
- If monitorability degrades during training, developers might prefer to use an earlier model checkpoint to keep CoTs monitorable.
- If process-based supervision dramatically improved alignment at the cost of a small decrease in monitorability, developers might decide that reducing monitorability is justified.

What are the limitations of CoTs?

In cases where thinking out loud is not required, CoT Monitoring can detect some misbehavior, but does not by itself produce strong evidence of safety.
Monitoring for dangerous tasks that need reasoning may not catch all relevant harms
CoT monitoring may stop working in more advanced, situationally aware models - for models which avoid propensity to think out loud.
Firm commitments to protect CoT behavior in its current form may trade off against more effective safety interventions.