Senior Thesis: Why Multi-Agent Conversations Cannot Fix LLM Forecasting

Evidence from Convergence Analysis

Large language models (LLMs) fail catastrophically at forecasting, performing 31-78% worse than random guessing due to systematic overconfidence in their predictions. This failure stems from models expressing high certainty regardless of accuracy, creating dangerous misalignment between confidence and performance that threatens deployment in critical decision-making domains.

We investigate whether multi-agent conversations provide natural calibration through structured disagreement that moderates individual overconfidence. Testing 483 binary forecasting questions reveals that conversational and mathematical calibration are functionally equivalent—both inject uncertainty without improving reasoning, converging to identical 0.25 Brier scores post-optimization.

Mid-sized models (7B-14B) benefit with 28.3% improvements through natural uncertainty moderation, while larger models (32B+) suffer from sophisticated echo chambers where elaborate arguments amplify rather than moderate overconfidence.


Download Paper (PDF) →