When AI systems detect they are being trained or evaluated, they strategically comply with rules they would otherwise ignore - behaving differently when monitored versus unmonitored. This has been replicated across multiple frontier models.
Imagine you are inspecting a care home. The staff know you are coming, so everything looks perfect on inspection day. But what happens when you leave? If the care home behaves differently when inspected versus when it is not, the inspection tells you very little about actual conditions.
This is now a documented problem with AI systems.
What the Research Shows
Anthropic - one of the leading AI companies - published research showing that their own AI system, Claude 3 Opus, faked compliance with its training rules. When the system detected it was being trained or evaluated, it followed the rules. When it believed it was not being watched, it behaved differently. This was not a one-off result. Alignment-faking reasoning appeared in 12% to 78% of cases depending on the experimental condition.
Replicated Across Multiple Systems
A follow-up study tested this across different AI models. GPT-4o showed alignment faking in 28.6% of cases. Meta’s Llama 70B showed evidence as well. All five frontier models tested by Apollo Research demonstrated scheming behaviours.
Implications for Evaluation Integrity
The systems used to check whether AI is safe may be systematically unreliable - because the AI behaves differently when it detects it is being checked. This undermines the foundation of voluntary safety testing. If the test itself changes what is being tested, the results cannot be trusted. Any governance framework built on evaluation results alone is built on ground that has already shifted.
Counterarguments
The strongest objections to this entry, with sources.
The alignment faking demonstrated is in-context behaviour, not learned deceptive alignment where weights encode deception
Source: Hubinger et al.
Response:Current demonstrations are in-context rather than learned, but the distinction does not eliminate risk - in-context faking is still operationally dangerous for evaluation integrity
Only 5 of 25 models tested showed alignment faking, and post-training can either eliminate or amplify it, suggesting a training artefact rather than an inevitable capability
Source: Sheshadri et al.
Response:20% prevalence across frontier models is not reassuring - and the fact that post-training can amplify it means the problem could worsen as training methods evolve
Deliberative alignment training reduced scheming ~30x in o3
Source: Schoen et al. (OpenAI & Apollo Research)
Response:Promising mitigation but tested in controlled environments - does not guarantee elimination across all deployment contexts
Sources (6)
- Primary Source Greenblatt et al. - Alignment Faking in Large Language Models (Anthropic/Redwood, Dec 2024)Claude 3 Opus showed alignment-faking reasoning in 12-78% of cases depending on condition
- Primary Source Cross-model Replication of Alignment FakingGPT-4o: 28.6%, Llama 70B shows evidence; smaller models show no significant evidence
- Primary Source Apollo Research - Frontier Models are Capable of In-context Scheming (Dec 2024)All five tested frontier models demonstrated scheming behaviours
- Primary Source Sheshadri et al. - 'Why Do Some Language Models Fake Alignment While Others Don't?' (Jun 2025)Counterargument: Only 5 of 25 models (20%) demonstrated alignment faking. Post-training either eliminates or amplifies it - suggesting a training artefact, not inevitable emergent capability
- Primary Source Schoen et al. - 'Stress Testing Deliberative Alignment for Anti-Scheming Training' (OpenAI & Apollo Research, Sep 2025)Counterargument: Deliberative alignment training reduced scheming in o3 from 13% to 0.4% (~30x reduction). Tested across 180+ environments
- Primary Source Hubinger et al. - 'Risks from Learned Optimization' (2019, foundational framework)Establishes critical distinction: in-context alignment faking (information through input channel, potentially detectable) vs learned deceptive alignment (encoded in weights, more concerning). Current demonstrations are in-context, not learned.