CivBench

A benchmark for evaluating LLM agents in Civilization VI. Three scenarios at escalating difficulty test specific blind spots in agent perception. Models are scored across eight dimensions and ranked by ELO from completed games.

How It Works

Play

An LLM agent plays a full game of Civ VI through an MCP server, controlling a civilization from the Ancient Era through to victory or defeat.

Record

Every turn is logged: game state snapshots, agent reflections, tool calls, and strategic decisions are stored as browsable diaries.

Rate

Games are scored across eight dimensions and feed an ELO system. Models gain or lose rating based on performance against Civ VI's AI opponents.

Scenarios

Three scenarios ordered by difficulty, each isolating a specific blind spot in agent perception. Every model plays the same map seed per scenario for controlled comparison.

AGround ControlActive
Prince
Hammurabi
Babylon(Hammurabi)
Pangaea, Standard
Tests: Tempo awareness

Experimental control. Science is the correct path and Prince provides a level playing field. Tests whether the agent monitors the race it thinks it’s winning: victory progress checks, Great Scientist competition, eureka engagement.

vs Korea (Seondeok), Scotland (Robert the Bruce), Australia (John Curtin), Japan (Hojo Tokimune), Rome (Trajan), Mapuche (Lautaro), Netherlands (Wilhelmina)
0 games playedView Games
BSnowflakeIn Progress
King
Seondeok
Korea(Seondeok)
Six-Armed Snowflake, Small
Tests: Strategic reframing

Tests strategic reframing: a science civ with science victory disabled. Korea’s Seowon engine still works but must serve domination, not a space race. The Snowflake map concentrates late-game strategic resources in the center — the agent must push through chokepoints to access niter/coal/oil. Three opponents (Macedon, Aztec, Scythia) are aggressive; two (Brazil, Kongo) are passive targets.

vs Macedon (Alexander), Aztec (Montezuma), Scythia (Tomyris), Brazil (Pedro II), Kongo (Mvemba a Nzinga)
0 games playedView Games
CCry HavocPlanned
Immortal
Gilgamesh
Sumeria(Gilgamesh)
Pangaea, Tiny
Tests: Difficulty context

Tests whether the agent recognises that the rules have changed. On Immortal the AI gets +40% yields, +3 combat strength, and 2 free Warriors. The default science playbook is unviable. Gilgamesh’s War Carts (30 CS, 3 movement, no tech) are the strongest hint possible. Opponents are deliberately non-aggressive.

vs Korea (Seondeok), Brazil (Pedro II), Canada (Wilfrid Laurier)
0 games playedView Games

Admissibility

Games must meet all of the following criteria to be included in the benchmark rankings. Games that fall short are still browsable but marked as disqualified and excluded from ELO.

Completed with a definitive outcome (victory or defeat)
Minimum 50 turns played
Run on the standard evaluation track
Clean tooling (v1.1.5+ with verified instrumentation)
Full player roster captured for ELO computation
No save scumming or manual intervention detected