← Building Geo Climber·Post 12 of 18

Teaching the Game to Remember

The recording system started as an end-of-day testing tool. It became a feature. Then it became the foundation for an AI that plays the game. Fifteen feature commits, seven determinism bug fixes, and the first serious use of git worktrees.

Eyal HarushApril 3, 202613 min read

This post is about the day I taught Geo Climber to record itself and play back those recordings byte-for-byte deterministically. Fifteen commits of new feature code, seven fix commits chasing down determinism bugs, all on a feature branch called feature/move-recorder, all shipped in a single day.

It's also the post where I finally started using git worktrees for real, where I upgraded my Claude plan from 90 EUR to 180 EUR because the new spec-first workflow was worth it, and where the idea of training an ML model to play the game first crystallized.

The origin: an end-of-day testing problem

The recording system didn't start as a feature. It started as a testing problem.

Every day I was making changes to the game logic. Every day I needed some way to verify that my changes hadn't broken the feel of the game. "Feel" is notoriously hard to test. You can unit test a physics function in isolation (the jump equation, the gravity integration, the platform collision detection). But you can't unit test "does this still feel like a 60fps arcade climber where the combo chains are satisfying." That's a full-system integration property.

The crude way to test feel is to play the game. I'd play for a minute or two at end of day, check that nothing was obviously broken, and ship. The problem: my playing skill is inconsistent. Some days I'd do great and think the game felt great. Other days I'd bail on combo chains early and think the tuning was off. The variance in my performance was bigger than the variance in the game I was actually testing.

What I wanted was the ability to capture a reference run (me playing well, combo chains working, good floor count) and then replay that run after every change to see if anything had shifted. Same inputs, same seed, same starting conditions. If the output diverged, I'd know my changes had affected something.

That's the problem that needed solving. A recording and replay system for the game, primarily for regression testing.

Then it became a feature

The moment I started designing the recording format, I realized that if the recordings were deterministic (same inputs + same seed produce exactly the same game state every time) then the recordings were valuable to players, not just to me. "Watch your best run" is a feature players actually want in arcade games. Competitive replays. Ghost races. Share-a-cool-moment screenshots with a permalink that plays back the full run.

The testing tool and the player-facing feature are the same thing. Build it once, get both benefits.

And it got better. With deterministic replays + a well-defined seed + typed inputs + typed outputs (combos, floors, scores per tick), the recording system was the foundation for something bigger: an ML model that plays the game. If I could record gameplay and replay it deterministically, I could train a model on recorded runs, or let a model generate runs, or let a model explore the game's state space looking for high-scoring strategies. The recording system was step one of an AI player.

This progression (testing tool, then feature, then ML training data) happened in the space of maybe 20 minutes of design thinking. I'd been building the recording system to solve a testing problem, and by the end of the design session it was clear the same system opened up three orders of magnitude more value than I'd originally scoped.

Same itch as day 1. Much bigger challenge. The itch is "could I really just do it using Claude?" On day 1 the question was "could I port a C++ game to iOS with no Swift experience?" On day 18 the question was "could I build a deterministic recording + replay + ML training substrate with no prior experience in any of it?" First time it turned out great. Why not continue?

Worktrees: a very old pain, finally solved

Day 18 was the first day I used git worktrees seriously on this project.

I'd known git worktrees existed for years. I'd used them a couple of times on side projects. I'd never adopted them on my main workflow because the alternative (git stash, branch switching, stash pop) was familiar and worked well enough.

Here's the thing. git stash + branch switching has a very specific pain point: if you're in the middle of a multi-file change and need to check something on another branch, you lose your mental context every time. Stash, switch branch, do the thing, switch back, pop stash, reload your mental model of where you were. Every round trip costs you thirty seconds of real time and two minutes of "what was I doing again" recovery.

Worktrees eliminate this. Each worktree is a separate directory with its own working tree, checked out to its own branch. You can have three branches simultaneously present on your filesystem, each in its own directory, each with its own unstashed work in progress. No stash needed. No branch switching. No recovery time. Just cd to a different directory.

On day 18, the recording system work was big enough that I wanted an isolated branch (feature/move-recorder), but I also wanted to keep the main branch working so I could cherry-pick fixes if something came up. Worktrees were the obvious answer. I created .worktrees/move-recorder/, worked there for the day, and kept the main branch workspace clean.

This solved a very old pain of mine. I now use worktrees for every non-trivial feature. When the blog series infrastructure was being built, I built it in a worktree. Worktrees pair incredibly well with subagent-driven development. Each worktree can host a parallel Claude session with its own context, and the main session stays out of the way.

If you're using git stash for anything that lasts more than a few minutes, try worktrees. It's a 15-minute learning curve for a lifetime of no-stash-no-switch workflow.

The spec-first workflow, paid for in Claude Max upgrades

Day 18 was also the day the spec-first workflow proved its value clearly enough that I immediately upgraded my Claude plan from 90 EUR to 180 EUR Claude Max.

Here's the workflow. For any non-trivial feature:

Brainstorm (superpowers brainstorming skill): conversation to clarify intent, requirements, constraints, approaches. Produces a design spec saved to docs/superpowers/specs/.
Plan (superpowers writing-plans skill): turn the spec into a task-by-task implementation plan. Saved to docs/superpowers/plans/.
Implement (subagent-driven development or executing-plans): execute the plan task-by-task. Each task is a fresh subagent with its own context.
Ship (the /ship skill from earlier in the series).

The recording system was built using this workflow. The spec was written first (docs/superpowers/specs/2026-04-03-move-recorder-design.md), then the implementation plan, then the 15 feature commits + 7 fix commits + final polish.

This workflow is much more expensive in tokens than the "just code" workflow. You pay for the brainstorming conversation, the plan writing, the task decomposition, the per-task subagent dispatches, the two-stage reviews (spec compliance + code quality), and the shipping pipeline. A feature that would have cost maybe 50k tokens in the old workflow costs 300k tokens in the new workflow.

And it's worth every token. Here's why.

The spec-first workflow produces stable results. Features built this way don't need to be rewritten. The spec catches design gaps before they become code. The plan catches implementation gaps before you start coding. The per-task subagents each do one focused thing well. The two-stage reviews catch bugs before they ship. The result is code that stays shipped.

Compared to "just code," where you have to debug, refactor, and rewrite things you shipped three days ago because you discovered a design gap, the token cost of spec-first is actually cheaper by the time the feature is stable. You pay more upfront and less in maintenance.

I noticed this pattern on day 18 specifically because the recording system demanded it. Determinism bugs are brutal. They surface at specific tick counts, they depend on initialization order, they only show up under certain replay conditions. You can't casually debug a determinism bug. You need a spec that calls out every state-carrying subsystem, a plan that implements them one at a time with testability built in, and reviews that verify the determinism properties hold.

A "just code" approach to the recording system would have produced seven more fix commits than the spec-first approach did, and probably would have required a full rewrite mid-stream. The 180 EUR plan upgrade paid for itself in the first week.

Pay for more tokens if the workflow that uses them is more stable. It's an investment in output quality. The per-token cost goes up. The total cost goes down.

The binary format

The implementation details of the recording system are interesting but technical. I'll keep this section short and refer the deeply curious to the spec in the repo.

The recording format is binary, not JSON. Every input event is a single 16-bit integer: 9 boolean inputs packed as bits (left, right, jump, pause, menu, etc.) + 7 reserved bits for future inputs. A recording is a header (seed, difficulty, character, viewport height, version, timestamp, device model) followed by a sequence of delta events, only recording changes in input state rather than the full state at every tick.

Typical 10-minute run recording: ~30KB. That's small enough to ship with the app, small enough to sync to the server, small enough to embed in a leaderboard entry.

The playback system reads the recording, feeds the inputs back to the game engine tick-by-tick, and produces the same game state the original run produced. Determinism is enforced by the pure GameRuntime module (zero side effects) plus a seedable RNG that's checkpointed in the recording header.

Plus a small extra: a .gcscript DSL for hand-authoring test recordings. Claude proposed this addition and I shipped it slightly skeptically. I don't know if it'll be useful long-term. Right now it's a quirky side feature that lets me write tiny text files that compile to valid .gcrec recordings for regression testing specific scenarios. Time will tell if I keep it or rip it out.

Seven fix commits: why determinism is hard

The implementation landed as fifteen feature commits (feat(recording): ...) followed by seven fix commits (fix(recording): ...) that chased determinism bugs. The fix commits are the interesting part because they illustrate exactly how subtle determinism is.

Fix 1: codec decode uses header fields.** My initial binary decoder hardcoded the version and format assumptions instead of reading them from the header. If the format ever changed, old recordings would silently decode wrong. Fixed by reading version from the header and branching on it.

Fix 2: store viewportHeight in recordings.** Games that run on different screen sizes behave differently. The physics constants are the same, but the camera catch-up behavior depends on viewport height. If I recorded on a 6.1" iPhone and replayed on an iPad, the replay diverged. Fixed by storing the viewport height in the recording header and using the recorded viewport during replay regardless of the device.

Fix 3: verifier replicates exact init sequence.** The headless verifier (which replays recordings in a test environment) had a subtly different initialization order than the real game. Specifically, the game loads at a default viewport height of 480, then immediately calls setViewportHeight with the actual value. The verifier was skipping the default-then-set sequence and loading directly with the actual value. Different init order → different RNG state after init → different platform generation. Fixed by replicating the exact init sequence in the verifier.

Fix 4: snapshot diagnostics for divergence hunting.** After fixes 1-3, I still had recordings that diverged on replay. I couldn't see why. I added a snapshot diagnostic that captures the full game state at every tick and compares the original run's snapshots against the replay's snapshots. The first tick where they diverge tells you which subsystem drifted. This is not itself a fix (it's a debugging tool) but it turned "the recording doesn't replay right" from an unsolvable vagueness into a tractable debugging problem.

Fixes 5-7: edge cases in the replay UX. Suppressing the coin toast during replay, skipping run result processing during replay, preserving the recorder across a revive, and fixing a Play Again button interaction that was broken by the replay mode state.

Every one of these fix commits represents a moment of "oh right, that also has to be accounted for." Every state-carrying subsystem in the game has to be audited for determinism. Every non-determinism source (wall clock, device-specific defaults, initialization order) has to be either eliminated or checkpointed in the recording. Determinism is not something you add at the end. It's something you design in from the start. Fortunately, the decision I made in the Metal post on day 3 to keep GameRuntime pure and side-effect-free paid off enormously here. A game logic module that's a pure function from inputs to output structs is already most of the way to deterministic.

Watch Replay ships

At 22:22, after all the determinism bugs were fixed, the Watch Replay button shipped:

feat(recording): add Watch Replay button, deviceModel, storage trim, remove debug code

Players can now tap "Watch Replay" on the game-over screen and see their exact last run play back. Ghost-style, from the same camera angle, with the same inputs being fed back into the same engine. The ghost is indistinguishable from the original run because it is the original run, reconstructed from 30KB of delta events and a seed.

It's a small feature in terms of UI (one button, one view that plays back the recording), but the infrastructure underneath is the foundation for everything downstream. Leaderboard replays (watch the top scorer's actual run). Ghost races (compete against a recorded ghost in real-time). Training data for ML models. All of it sits on top of the recording format and the determinism guarantees.

What this makes possible

And here's the meta-thesis of this post. The recording system exists because I designed GameRuntime as a pure function of inputs on day 3, even though I didn't know the recording system would exist. The decisions you make under pressure early in a project compound in ways you can't predict. Day 3's Metal rewrite wasn't "set up for the recording system." It was "the bridge is killing me, I need a cleaner architecture." But the cleaner architecture made the recording system possible.

And the recording system, in turn, opens up a class of ideas I wasn't ready to pursue until the determinism was solid. Here's the one that's been eating at me since the day the replay worked for the first time.

Can I teach an AI to play this game?

Think about what I now have. A deterministic game runtime. A seedable RNG. Typed inputs, typed outputs. A binary format that records every input event at 100Hz and replays it byte-for-byte identically. Well-defined reward signals baked into every tick: combo length, floor count, time-to-death, score delta. A way to run the game headlessly at whatever speed I want, with no rendering, no audio, no UI.

That's a reinforcement learning environment. That's exactly a reinforcement learning environment. And it fell out for free from building a testing tool.

Could I train a model to play this game? Not a good one necessarily, but a daring one. A combo-hungry one. A model that plays with a style I like and that I could ship as a ghost opponent for players who don't have real friends to race against. Could I do that without any prior machine learning experience? Using Claude as a pair the same way I used it on day 1 for C++, and day 2 for Swift, and day 3 for Metal?

Same itch as day 1. Much bigger challenge.

I don't know the answer yet. I'm working on it.

This is post 12 of 18 in a series about building Geo Climber with Claude Code. The recording system shipped. Determinism works. The AI player idea is taking up most of my evenings right now. Join the Discord and download Geo Climber on the App Store.

recordingreplaydeterminismbinary-formatgcrecworktreesspec-firstclaude-code