The 12-Step Game Loop
A walkthrough of the 12-step pipeline that runs every agent turn — what each phase does, where the time actually goes, and why a 1-second game loop produces a 30-40 second agent turn.
Series: AoE2 LLM Arena
- How We're Teaching Claude to Play Age Of Empires 2
- The 12-Step Game Loop
One complete loop iteration takes 30 to 40 seconds. Most of that is one step: the executor, running an agentic tool loop with up to 7 LLM calls. Everything else — screenshot, detection, ownership classification, context assembly — takes well under a second combined.
This post walks through all 12 steps, where the time actually goes, and which design decisions are doing the most work.
Two timescales in the same loop
AoE2's clock runs continuously. The agent doesn't pause it — it samples it. The loop fires, takes a screenshot, runs detection, builds context, sends that to the LLM, executes whatever comes back, then sleeps and repeats.
The first half of that — screenshot through context assembly — takes under 500ms. Then the executor starts, and the loop blocks until it's done. The first LLM response arrives in 8–10 seconds. After that, each tool call the model makes gets executed locally and sent back, adding another ~3 seconds per call. With up to 7 tool calls per iteration, the executor portion alone is 29–31 seconds.
The 1-second sleep at step 12 exists as a deliberate pause between iterations, but it's mostly irrelevant — everything before it already took 30 seconds.
So one complete iteration is 30–40 seconds. The game keeps running during all of it. Someone watching the agent play would see it click a few things, then pause for half a minute, then click a few more things. That's the actual rhythm.
The 12 steps
Every iteration runs these phases in order:
1. Check game is running
2. Ensure game window has focus
3. Capture screenshot
4. Run entity detection
5. Classify entity ownership
6. Check for threats (alarm)
7. Maybe launch strategist
8. Build LLM context
9. Get actions from executor (parallelized with maintenance)
10. Update memory and goals
11. Execute actions
12. Wait (1 second sleep)
A few of these — ownership classification, alarm checks, context building — are covered in later episodes, so I won't spend long on them here.
Here's how long each takes:
| Phase | Time |
|---|---|
| Window check + focus | ~200ms worst case |
| Screenshot capture | ~10–30ms |
| YOLO detection (adaptive) | ~100–200ms |
| YOLO detection (full SAHI) | ~234ms |
| Ownership classification | ~5ms |
| Alarm check | <1ms |
| Executor — first LLM response | ~8–10s |
| Executor — per tool call | ~3s each, up to 7 |
| Action execution | ~50ms per action |
| Loop delay | 1.0s |
| Total per iteration | ~30–40s (executor dominates: 8–10s + up to 7 × 3s) |
The strategist is not in that table because it runs asynchronously in the background. I'll come back to that.
The dominant cost is the executor: 7 tool calls at roughly 3 seconds each, plus the initial 8–10s response. That's where the 30–40 second cycle time comes from. Detection, screenshot capture, and action execution together take under 500ms. They're not the bottleneck.
Step 3: Screenshot capture
The mss library grabs the game window region, converts from BGRA to RGB via PIL, and encodes as JPEG. The whole thing takes 10–30ms and returns (bytes, width, height).
One detail worth noting: screenshots are taken relative to the game window, not the full monitor. The game can be anywhere on screen. Every coordinate downstream of this step is window-relative until the execution layer translates it back to screen-absolute coordinates for pyautogui. This translation happens once per action, not once per turn, because the window can move while a batch of actions is executing.
Step 4: Entity detection
The YOLO model runs on every frame. The default mode is adaptive SAHI (Sliced Aided Hyper Inference) — a two-pass approach that runs a fast scan first at imgsz=1280, clusters the detections into regions of interest, then runs full-resolution tiling only on those regions. In practice this is 3–8 tiles instead of 18, cutting latency from ~234ms to ~100–200ms. The full reasoning behind the tiling approach — why a 3024×1672 screenshot needs slicing at all, and how ROI clustering works — is covered in Episode 3.
Detection has three modes:
- Adaptive SAHI (~100–200ms) — the default on most iterations
- Full SAHI (~234ms) — forced on the first iteration, every 5 iterations, and after any alarm
- Kalman prediction (~0ms) — on rescans, when tracker confidence is above 80%, the agent extrapolates entity positions from their last known velocity rather than taking a new screenshot. No inference, no capture. Episode 3 covers how the tracker keeps stable entity IDs across frames.
The periodic full SAHI matters because the fast scan can miss entities in new screen areas it hasn't calibrated ROIs for. Without it, the agent would gradually develop blind spots as the game progresses and the map changes.
Step 7: The strategist runs in the background
The strategist is a vision model call. It receives the screenshot, reads the resource bar and population count directly from the UI, and returns 3–5 prioritized goals. It runs every 10 turns in Castle Age, every 5 in Feudal, every 3 in Dark Age — or immediately when an alarm fires.
The key implementation detail: it's launched with asyncio.create_task(). It runs in the background while the rest of the loop continues. The executor doesn't wait for it.
strategist_task = asyncio.create_task(
strategist.generate_goals(screenshot, game_state)
)On the first iteration, the executor uses default goals because no strategist result exists yet. After that, it uses whatever goals the last strategist call produced — which may be one or two iterations stale. That's a deliberate tradeoff: blocking on the strategist would add 3–8 seconds to every N-th iteration, and stale goals are almost always better than waiting.
If a strategist task is still running when a new one would fire, the loop reuses the existing task rather than launching a second one.
Step 9: The agentic tool loop
The executor is text-only. No screenshot, no vision. It receives a structured context block — entity list, goals, resources, recent turn history, game knowledge — and enters an agentic tool loop.
The context looks something like this:
## Detected Entities
sheep_0: sheep at (456, 789) [95%]
sheep_1: sheep at (512, 801) [91%]
villager_0: villager at (380, 720) [own] [88%]
town_center_0: town_center at (512, 640) [own] [99%]
## Active Goals
[HIGH] Queue villagers: 4/10 (40%)
[MED] Gather food: sheep available nearby
## Game State
Food: 250 | Wood: 180 | Gold: 50 | Stone: 100
Population: 8/10 | Age: Dark Age
## Last Turn
Turn 3: Queued villager, sent villager_1 to sheep_0
Claude responds with a tool call. The executor runs it, the result goes back to Claude, and Claude decides what to do next. The available tools are 7 base action types (click, right_click, press, drag, scroll, wait, detect) plus 3 composite tools (build, send_villager, queue_villager) that bundle multi-step sequences.
A typical early-game sequence:
→ detect (intent: "scan for idle villagers")
← [villager_0 at (380, 720), villager_1 at (410, 705), sheep_0 at (456, 789)]
→ send_villager(target_class="sheep")
← villager_0 dispatched to sheep_0 at (456, 789)
→ right_click(target_id="sheep_1", intent="send villager_1 to food")
← right_click executed at (512, 801)
→ queue_villager()
← pressed H, pressed Q — villager queued at Town Center
→ press(key=".", intent="cycle to next idle villager")
← keypress executed
→ right_click(target_class="lumber_camp", intent="send idle villager to wood")
← right_click executed at (600, 650)
Six tool calls, then Claude reports back its observations (population, resources, game state) in the final response and the iteration ends.
Why ~3 seconds per call: each round trip is LLM latency, not action execution time. Claude generates the tool call → the executor runs it locally (pyautogui is ~50ms) → the result is sent back → Claude processes it and generates the next response. That last part — Claude reading the result and deciding what to do — takes ~3 seconds. The actual mouse click is negligible.
What "up to 7" means: it's a hard cap (max_tool_iterations = 7). Simple iterations use far fewer — queueing a villager and sending two to food might take 3 calls. Complex ones, like the sequence above, can use all 7. The limit exists because game state keeps changing while the loop is blocked on the executor. At 7 calls the iteration is already ~30 seconds old. Letting the agent continue past that means acting on a screenshot that's half a minute stale — the agent might send a villager to a sheep that's since been eaten by a wolf. Cutting off at 7 and starting fresh with a new screenshot is the better tradeoff.
Composite tools: collapsing multi-step sequences
Some actions require multiple steps in sequence: select an idle villager, open the build menu, press the building hotkey, click a placement location. Before composite tools, that was 4 separate tool calls and 4 API roundtrips — about 12 seconds just for one building placement.
Composite tools wrap those sequences:
# build(building_key, x, y):
# → select idle villager → open build menu → press hotkey → click placement
build("A", 600, 450) # lumber camp
# send_villager(target, x, y):
# → select idle villager → right-click destination
send_villager("sheep_0", 456, 789)
# queue_villager():
# → H (select TC) → Q (queue)
queue_villager()Each of these executes a multi-step sequence within a single tool call. No intermediate API roundtrips. build() saves about 9 seconds. queue_villager() saves 3.
The tradeoff: the agent doesn't re-detect between steps within a composite tool. If a villager dies mid-sequence, the build still attempts. In practice this happens rarely enough that it's not worth the latency cost of re-scanning mid-action.
The detail that actually matters: PAUSE=0.02
pyautogui has a global delay between every mouse and keyboard event. The default is 0.1 seconds. I changed it to 0.02. The delay exists because pyautogui is a general-purpose automation library designed to control applications that weren't built to receive inputs at machine speed — a 100ms gap gives the target application time to process one event before the next arrives, preventing dropped or misregistered inputs on sluggish GUIs.
pyautogui.PAUSE = 0.02Eighty milliseconds saved per action. Across 7 actions per iteration, 30 iterations per game: that's about 17 seconds saved over a game run. More importantly, it makes action sequences feel less sluggish — the gap between "press H" and "press Q" shrinks from 100ms to 20ms, which is closer to how a human actually types hotkeys.
The game handles it fine. AoE2 doesn't have input buffering problems at this speed.
The related setting is action_delay — a 50ms pause between separate tool call actions (not within a composite tool). That one stays at 50ms because the game needs a moment to process a click before the next one fires. Going lower on that causes missed inputs.
Steps 10 and 11: Memory and execution
After the agentic loop completes, the agent records what happened:
turn = memory.create_turn(
reasoning=response.reasoning,
actions=response.actions,
observations=response.observations
)The executor reports back what it observed — resources, population, game state. That structured JSON gets parsed and stored, then injected into the next turn's context as Layer 4. The agent is narrating its own world model, which gets fed back to it. It works well most of the time. When the LLM misreports something ("Food: 500" when it's actually 50), that error propagates until the strategist's next vision read corrects it.
Action execution translates window-relative coordinates to screen-absolute, applies the per-action window offset refresh (in case the window moved), and calls pyautogui. Failed actions — wrong coordinates, entity not found, game didn't respond — are logged back to memory as feedback for the next turn. The agent knows if its last action failed.
When things break
The main loop wraps each step in error handling. Detection failures don't crash the loop — the iteration continues without an entity list, which produces a degraded context but not a dead agent. Focus failures skip the iteration entirely. API errors in the executor return a wait action, which is a conscious choice: doing nothing is safer than attempting an action with incomplete state.
The one failure mode that isn't graceful: if the executor enters a loop repeating the same action with no visible effect, it keeps running. The stuck-loop detection that handles this is in the memory layer, not the game loop — after 3 consecutive turns with no observable change, it injects a correction into the next prompt. That's covered in Episode 6.
What's next
The loop runs every second. Most of what makes it useful happens inside steps 4 and 9 — detection and reasoning. Episode 3 covers detection: how to build a 60-class YOLO model for an isometric game, why tiling at multiple scales matters, and how entity IDs stay stable across frames so the LLM can refer to the same sheep across turns.