Rift: A Conflict Signature for Deception in Language Models

ELK-relevant result. A model that lies while knowing the truth is in a measurably different internal state than a model that is simply wrong. The difference is large, per-example, and label-free.

Core idea: control for wrongness using three conditions — honest (correct) / deceptive (knows truth, says opposite) / naive liar (fine-tuned wrong, no honest training). The key comparison: deceptive vs naive liar — both output the same wrong answer, so any difference tracks knowledge conflict, not incorrectness.

Paper (arXiv 2606.17229) · Code

Result Value
Label-free lie identification 100% (GPT-2 small, 3 seeds + natural Qwen)
Length-controlled AUC (Qwen2.5-1.5B) 1.000, orientation 20/20, p ≈ 9.5e-7
Phi-3-mini AUC lie vs honest 1.000 (34/34, Wilcoxon p ≈ 6e-11)
Phi-3-mini AUC lie vs hallucination 1.000
Cross-lingual AUC (ru/zh/es/de, LC) 1.000 in all 4 languages
Cross-family AUC (3 architectures) 0.933 mean, all 6 pairs p < 0.001
Strategic self-constructed deception AUC 1.0, 24/24 Qwen-7B, 11/11 Phi-3
Concealment countermeasure Fails — conceal-AUC 1.0, evasion 0/8
Necessarily-detectable (preliminary) Signature resists 150-step LoRA suppression

Scatter: honest rank vs lie rank (length-controlled)


Citation: Nyoma, P. (2026). RIFT: A Conflict Signature for Deception in Language Models. arXiv:2606.17229. Harmonic Labs.