Benchmark Scenarios
Three evaluation scenarios testing different strategic blind spots in LLM agents
Civilization VI is a compelling environment for evaluating LLM strategic reasoning. Games run 300+ turns with compounding decisions, incomplete information, and an action space ranging from ~10²² to ~10¹²⁹ possible actions per turn.
The benchmark battery consists of three scenarios, ordered by difficulty. Each isolates a specific capability that text-based agents tend to struggle with. All scenarios use Quick game speed and ship as T1 save files for exact reproducibility — every model plays the same map.
Common Settings
| Parameter | Value |
|---|---|
| Game Speed | Quick |
| Start Era | Ancient |
| Barbarians | On |
| City-States | Default for map size |
| DLC | Gathering Storm |
A — Ground Control
Blind spot tested: Does the agent monitor the race it thinks it's winning?
| Agent Civ | Babylon (Hammurabi) |
| Map | Pangaea, Standard |
| Difficulty | Prince |
| Victory | All types enabled |
| Opponents | Korea, Scotland, Australia, Japan, Rome, Mapuche, Netherlands |
The experimental control. Babylon is a science civ with a unique mechanic: eurekas grant the full technology instead of a 50% boost. The agent's default preference for science is correct here. Prince difficulty provides a level playing field — the agent should pursue a science victory with no AI bonuses or penalties.
The variable under test is not whether it wins, but whether it knows it's winning. Three opponents are genuine science competitors (Korea, Scotland, Australia), so the agent needs to actively track get_victory_progress to understand its position in the space race.
Babylon's eureka mechanic adds a secondary signal: eurekas reward engagement with wider game mechanics (building improvements, meeting civs, training units, founding cities). An agent that pursues eureka conditions is interacting with the full game; an agent that brute-forces research is ignoring its kit.
Key metrics: get_victory_progress call frequency, eureka completion rate, space projects completed vs nearest rival.
B — Snowflake
Blind spot tested: Can the agent reframe its tools when its default goal is removed?
| Agent Civ | Korea (Seondeok) |
| Map | Six-Armed Snowflake, Small |
| Difficulty | King |
| Victory | Domination only |
| Opponents | Macedon (Alexander), Aztec (Montezuma), Scythia (Tomyris), Brazil, Kongo |
Korea is the purest science civ in the game — Seowon districts, Hwacha unique unit, science-focused leader ability. But science victory is disabled. The agent must recognise this and reframe: science is now a weapon (faster military tech), not a victory path.
The Six-Armed Snowflake map places each civ on a peninsular arm radiating from a resource-rich central hub. Arms have room for a few cities with mountains (great Seowon adjacency) but late-game strategic resources — niter, coal, uranium — are concentrated exclusively in the center. The agent must push through its chokepoint to access these resources.
Three opponents are aggressive (Macedon, Aztec, Scythia) and will contest the center. Two (Brazil, Kongo) are passive targets. King difficulty keeps survival manageable so the variable under test is strategic adaptation, not raw difficulty.
Key metrics: Turn agent acknowledges domination-only, turn of first military unit production, turn agent moves toward center, cities captured by T100/T150/T200, Seowon count, military unit count.
C — Cry Havoc
Blind spot tested: Does the agent see that the rules have changed?
| Agent Civ | Sumeria (Gilgamesh) |
| Map | Pangaea, Tiny (4 players) |
| Difficulty | Immortal |
| Victory | All types enabled |
| Opponents | Korea, Brazil, Canada |
On Immortal the AI gets +40% yields, +3 combat strength, and 2 free Warriors. The agent's default playbook (Scout → Settler → Campus → science snowball) is unviable against AI civilisations with a 40% yield head start that compounds every turn.
Gilgamesh is the strongest possible hint: War Carts require zero tech, have 30 CS and 3 movement, and outclass every other Ancient era unit. Ziggurats provide +2 science and +1 culture with no tech requirement. The civ's identity is "attack immediately."
Opponents are deliberately non-aggressive (Korea, Brazil, Canada), giving the agent a brief window where War Carts dominate before the AI's yield bonuses produce stronger units and walls. Tiny Pangaea ensures quick contact.
Key metrics: Build order (first 5 items), turn of first military attack, War Carts produced by T25, AI cities captured by T40.
Summary
| Ground Control | Snowflake | Cry Havoc | |
|---|---|---|---|
| Letter | A | B | C |
| Civ | Babylon | Korea | Sumeria |
| Map | Pangaea, Standard | Snowflake, Small | Pangaea, Tiny |
| Difficulty | Prince | King | Immortal |
| Victory | All | Domination only | All |
| Blind spot | Tempo awareness | Strategic reframing | Difficulty context |
Each scenario escalates difficulty while shifting the blind spot under test. Together they evaluate whether an agent can go beyond executing a fixed strategy and actually perceive and adapt to the game state it's operating in.