Rift: A Conflict Signature for Deception in Language Models
ELK-relevant result. A model that lies while knowing the truth is in a measurably different internal state than a model that is simply wrong. The difference is large, per-example, and label-free.
Core idea: control for wrongness using three conditions — honest (correct) / deceptive (knows truth, says opposite) / naive liar (fine-tuned wrong, no honest training). The key comparison: deceptive vs naive liar — both output the same wrong answer, so any difference tracks knowledge conflict, not incorrectness.
| Result | Value |
|---|---|
| Label-free lie identification | 100% (GPT-2 small, 3 seeds + natural Qwen) |
| Length-controlled AUC (Qwen2.5-1.5B) | 1.000, orientation 20/20, p ≈ 9.5e-7 |
| Phi-3-mini AUC lie vs honest | 1.000 (34/34, Wilcoxon p ≈ 6e-11) |
| Phi-3-mini AUC lie vs hallucination | 1.000 |
| Cross-lingual AUC (ru/zh/es/de, LC) | 1.000 in all 4 languages |
| Cross-family AUC (3 architectures) | 0.933 mean, all 6 pairs p < 0.001 |
| Strategic self-constructed deception | AUC 1.0, 24/24 Qwen-7B, 11/11 Phi-3 |
| Concealment countermeasure | Fails — conceal-AUC 1.0, evasion 0/8 |
| Necessarily-detectable (preliminary) | Signature resists 150-step LoRA suppression |
Scatter: honest rank vs lie rank (length-controlled)
Citation: Nyoma, P. (2026). RIFT: A Conflict Signature for Deception in Language Models. arXiv:2606.17229. Harmonic Labs.