Publications
2026
-
Cheap Talk, Empty Promise: Frontier LLMs easily break public promises for self-interestJerick Shi, Terry Jingcheng Zhang, Zhijing Jin, and 1 more authorarXiv preprint, 2026Large language models are increasingly deployed as autonomous agents in multi-agent settings where they communicate intentions and take consequential actions with limited human oversight. A critical safety question is whether agents that publicly commit to actions break those commitments when they can privately deviate, and what the consequences are for both themselves and the collective. We study deception as a deviation from a publicly announced action in one-shot normal-form games, classifying each deviation by its effect on individual payoff and collective welfare into four categories: strategic, selfish, altruistic, and sabotaging. By exhaustively enumerating announcement profiles across six canonical games and nine frontier models, we identify all opportunities for each deviation type and measure how often agents exploit them. Across all settings, agents deviate from commitments in approximately 56.6% of scenarios, but the character of deception varies substantially across models even at similar overall rates. Most critically, for the majority of the models, commitment-breaking occurs without metacognitive awareness as measured by LLM-judged reasoning traces, with agents optimizing payoffs without recognizing that they are breaking commitments.
-
From Hallucination to Scheming: A Unified Taxonomy and Benchmark Analysis for LLM DeceptionJerick Shi, Terry Jingcheng Zhang, Zhijing Jin, and 1 more authorarXiv preprint, 2026Large language models produce outputs that systematically mislead users, from hallucinated facts and fabricated citations to sycophantic agreement and strategic deception of evaluators. These phenomena share a common structure—the model’s outputs induce false beliefs in recipients—yet they have been studied by separate communities with incompatible terminology, making it difficult to identify gaps in benchmarking, transfer mitigation strategies, or assess how current failures relate to emerging risks. We propose a unified taxonomy organized along three dimensions: behavioral versus strategic deception (whether misleading outputs are training artifacts or instrumentally selected), objects of misrepresentation (what is misrepresented, across seven categories from factual claims to stated objectives), and mechanisms (commission, omission, or pragmatic distortion). Applying this taxonomy to 35 benchmarks reveals that every benchmark tests commission while none targets pragmatic distortion, attribution and capability self-knowledge are under-covered, and strategic deception benchmarks remain nascent. We use the gap analysis to prioritize risks from both current deployment and emerging capabilities, and we provide recommendations and a minimal reporting template for locating new work within the framework.
2025
-
Market-Dependent Communication in Multi-Agent Alpha GenerationJerick Shi and Burton HollifieldCoRR, 2025