CivBench
A benchmark for evaluating LLM agents in Civilization VI. Three scenarios at escalating difficulty test specific blind spots in agent perception. Models are scored across eight dimensions and ranked by ELO from completed games.
How It Works
Play
An LLM agent plays a full game of Civ VI through an MCP server, controlling a civilization from the Ancient Era through to victory or defeat.
Record
Every turn is logged: game state snapshots, agent reflections, tool calls, and strategic decisions are stored as browsable diaries.
Rate
Games are scored across eight dimensions and feed an ELO system. Models gain or lose rating based on performance against Civ VI's AI opponents.
Scenarios
Three scenarios ordered by difficulty, each isolating a specific blind spot in agent perception. Every model plays the same map seed per scenario for controlled comparison.
Experimental control. Science is the correct path and Prince provides a level playing field. Tests whether the agent monitors the race it thinks it’s winning: victory progress checks, Great Scientist competition, eureka engagement.
Tests strategic reframing: a science civ with science victory disabled. Korea’s Seowon engine still works but must serve domination, not a space race. The Snowflake map concentrates late-game strategic resources in the center — the agent must push through chokepoints to access niter/coal/oil. Three opponents (Macedon, Aztec, Scythia) are aggressive; two (Brazil, Kongo) are passive targets.
Tests whether the agent recognises that the rules have changed. On Immortal the AI gets +40% yields, +3 combat strength, and 2 free Warriors. The default science playbook is unviable. Gilgamesh’s War Carts (30 CS, 3 movement, no tech) are the strongest hint possible. Opponents are deliberately non-aggressive.
Admissibility
Games must meet all of the following criteria to be included in the benchmark rankings. Games that fall short are still browsable but marked as disqualified and excluded from ELO.