← Building Geo Climber·Post 15 of 18

Teaching an AI to Play My Game

A web dev with zero ML experience builds a reinforcement learning agent for his iOS game. From porting physics to Python, to cross-validating against real gameplay recordings, to watching the agent discover mechanics nobody taught it.

Eyal HarushApril 7, 202618 min read

The previous post ended with a question: can I teach an AI to play this game? I had a deterministic runtime, a seedable RNG, a binary recording format, and zero machine learning experience.

This post is the answer. Over two days I ported the game physics to Python, built a Gymnasium environment, trained a PPO agent through 9 curriculum phases and 8 reward function revisions, watched it discover mechanics I never explicitly rewarded, and hit a feedback loop so addictive I stayed up until 2am tweaking reward coefficients.

"i have zero knowledge about this. if i could somehow train a model that can beat my game that will both: teach me how to do it which is an amazing gain, but also, potentially open up in-game features."

That's me talking to Claude at the start of the session. Same energy as day 1. A TypeScript engineer with no experience in the domain, jumping in anyway.

Why build an AI player at all?

The practical reason: I want PVP in Geo Climber. Ghost races, "Beat the AI" mode. But PVP needs players, and I have maybe ten TestFlight testers. An AI opponent gives me the competitive experience from day 1.

The real reason: I wanted to know if I could do it. Same itch as the C++ to Metal rewrite, same itch as the recording system. Can I really just sit down with Claude and build something I have no prior experience with?

But I didn't want a boring AI. The agent had to play daring: combos, charged jumps, risky multi-floor skips. Not a cautious bot that camps on platforms.

"i want the ML to play daring, not only to reach top floors... to go for combos, try to get the best combos, giving the best scores, in the least amount of time."

The design in 60 seconds

PPO (Proximal Policy Optimization) via Stable-Baselines3. A 37-float observation space covering player position, velocity, camera state, the 10 nearest platforms, timer, combo state, and score. Six discrete actions: idle, left, right, jump, jump+left, jump+right. A curriculum that starts with no timer and progressively adds pressure until safe play is physically impossible.

One model, three difficulty levels via temperature scaling on the action distribution: T=0.1 for a precise hard AI, T=0.5 for medium, T=1.5 for a sloppy easy AI. Train once, ship three personalities.

And one decision that proved critical: port the game physics to Python rather than bridging Swift and Python at runtime. A pure Python environment means thousands of games per second headlessly, one language to debug, and (this part was unexpected) learning things about my own physics code that I'd missed for weeks.

Porting the physics

The Python port covers every subsystem that matters for gameplay: xorshift64 RNG (bit-for-bit match with Swift), platform generation, player physics with three-tier charged jumps, camera chase, combo scoring, and timer escalation. Everything except rendering and audio.

This part was mechanical. What wasn't mechanical was the question my future self would thank me for asking:

"did you run the ported game and tested it's compatibility with it's swift sister?"

I asked this maybe four hours into the build. The honest answer was no. We'd only tested that the Python env was internally consistent (same seed produces the same game twice). We hadn't compared it against the actual iOS game. This turned out to be the most important question of the entire project.

Real data finds real bugs

The first cross-validation used synthetic inputs, scripted left-right-jump sequences fed into both implementations. That passed. Then I played an actual game on the iOS Simulator and challenged: use REAL recordings, not synthetic inputs. I exported the .gcrec binary file, built a Python decoder, and replayed my actual gameplay through the Python physics.

Short runs passed. A long run failed catastrophically. Python died at floor 5 while Swift had reached floor 421.

The first bug was a contamination issue. When you hit "Play Again" in the iOS game, TimerState.reset() wasn't zeroing currentTick. The game looked reset, but the timer's internal state was polluted from the previous run. This was a real Swift production bug. It had been shipping to players. The recording captured it faithfully because recordings capture everything, including bugs you don't know about.

After fixing the timer bug and getting a clean recording, I played a 111-floor run and replayed it through Python. Floors matched. Score was off by 255. Best combo was 12 in Python, 18 in Swift.

Time for a tick-by-tick diff.

I built a Swift CLI tool (SimulationCLI) that replays recordings headlessly and dumps state at every tick. Ran the same recording through both implementations and compared output line by line. The divergence started at tick 258.

Bug 1: sprite half-width in collision detection.** During a charged jump, the player uses the rotate sprite, which is 42 pixels wide (halfW=21). The Python port hardcoded halfW=14 (the idle sprite width). At tick 258, the player was moving fast horizontally and barely clipping a platform edge. Swift's wider collision box caught the platform. Python's narrower box missed by 2.7 pixels. The player fell, the run diverged, and everything after that was nonsense.

This bug would never show up in a unit test. It only manifests when the player is moving fast horizontally during a charged jump near a platform edge, a scenario that happens constantly in real gameplay but would never appear in synthetic test inputs.

Bug 2: combo bar refill. In Swift, startCombo() is called unconditionally on every combo-advancing jump, refilling the combo bar to 100 ticks each time. In Python, startCombo() was only called on the first jump of a combo sequence. This meant Python combos had a fixed window of ~200 ticks total, while Swift combos got 200 ticks per jump, much longer chains, much higher scores.

Bug 3: the RNG split. This one deserves its own section.

The RNG split

The game loop runs every subsystem every tick at 100Hz: player physics, platform scroll, eye candy particles, scoring, camera. All sequentially, all within the same tick.

The problem was RNG state drift. The Swift game used one random number generator for everything. Eye candy particle initialization calls rng.next() roughly 300 times per frame (100 particles times 3 random values each). This advances the RNG's internal counter by 300. When platform generation asks for its next random number, it gets a completely different result than it would have without the eye candy.

The Python port has no eye candy. Its RNG counter is at a different position. Different random rolls. Different platform widths and positions. Different gameplay.

The fix was splitting the Swift RNG into two independent streams: rng for game logic (platforms, spawning, anything that affects gameplay) and visualRng for eye candy and visual effects (particles, sparkles, anything that's purely cosmetic). Zero performance impact. Same number of operations per tick, just drawn from two independent generators.

This is a real codebase improvement, not just a workaround for the Python port. Any future subsystem that wants randomness needs its own stream. And the bug itself is the kind of thing you can only find by building two implementations and comparing them. If you only have one implementation, there's nothing to drift from.

13,440 ticks, zero divergence

After fixing all three bugs, I played a fresh 2-minute-16-second hard-mode run reaching floor 348 with a 32-combo peak. Python replayed it: floor 348, score 9702, best combo 32. Then I ran the full tick-by-tick diff against the Swift CLI: zero divergence across all 13,440 shared physics ticks.

The Python physics port is bit-for-bit identical to the Swift game engine on real gameplay data.

This was the green light. The environment was ready for training.

The reward shaping marathon

What followed was the most addictive debugging session of the entire project.

I started PPO training on the verified Python environment and immediately built a pygame viewer that renders the agent playing in real time. Auto-reloads model checkpoints every 5 seconds, shows the current phase, seed, difficulty, and horizontal speed. This viewer became the single most important tool of the training session. Every wrong diagnosis came from looking at the reward graph. Every right diagnosis came from watching the agent play.

Round 1: Agent stalled at -2700 reward, 5,690-tick episodes. It never jumped. The stall penalty was too gentle and the episodes ran forever. The agent had no incentive to do anything because doing nothing wasn't punished hard enough.

Round 2: Added stall termination at 500 ticks, boosted floor reward to +5. Reward climbed to +17 but flatlined. In the viewer, I could see the problem: the agent found floor 2 on seed 5 and camped there. It had discovered that jumping straight up was enough to reach +17 reward, and it stopped exploring.

Round 3: This is where seed selection became critical. I watched seed 5 in the viewer and saw why the agent was lazy: all platforms on that seed cover x=320 (the center of the viewport) for 24 consecutive floors. The agent learned to jump straight up without ever moving horizontally because the seed never required it.

I found seed 136, where floor 2 requires horizontal movement. The next platform is offset to the left. Trained on that seed. The agent immediately started moving side-to-side and aiming for platforms.

"it actually moves side to side now!!!"

Round 4: And then something happened that I didn't design and didn't explicitly reward. I was watching the agent play on the viewer and it started building horizontal speed, then using a charged jump to skip a floor entirely. The two-floor momentum jump. It discovered this mechanic on its own, from the interaction between the speed gradient reward and the physics engine. Nobody told the agent that charged jumps exist. It found them because moving fast felt good (speed reward) and jumping higher reached more floors (floor reward), and the physics engine naturally converts horizontal speed above 5.9 into a charged jump.

"looks like it learned the momentum two floor jump (not combo yet, but major). it prefers that over aiming, but does aim when it's not a possibility"

This is the moment I understood what people mean when they talk about emergent behavior in RL. The two-floor momentum jump is an advanced mechanic that human players usually discover after dozens of games. The agent found it in round 4 of training because the reward structure made it the obvious strategy, even though nobody mentioned it.

Rounds 5-6: Under hard timer pressure, the agent stopped doing momentum jumps. Too risky when the camera is chasing you. So I added exponential floor rewards (floors^2 * 10) to make multi-floor jumps feel dramatically more valuable, plus a speed gradient that gives continuous positive signal for running faster. The agent started doing charged jumps again. Then combos appeared (the agent chained platform landings fast enough to trigger the combo system), but it hesitated to extend the chains. Added a per-jump combo chain bonus that escalates with combo score. The agent started chaining.

"it works! it hesitates to chain them though."

Along the way, I fixed a training-only bug: the stall penalty was killing the agent mid-air during momentum jumps. The agent was building horizontal speed for a charged jump, which looks like "not making upward progress" to the stall detector. Fix: only count stall time when on the ground. Airborne time is progress, not stalling.

Watch the agent, not the numbers

I want to emphasize this because it's the single most useful thing I learned about RL training.

The reward graph (ep_rew_mean) told me the agent was improving: reward going up, episode length going down, explained variance climbing. All the metrics said "things are working." But the metrics never told me why the agent was stuck, or what specific behavior was wrong, or which reward component was causing a side effect.

Every breakthrough came from opening the pygame viewer and watching the agent play. Round 2's camping behavior? Invisible in the numbers, obvious in five seconds of watching. The seed 5 problem? The reward graph showed a plateau; the viewer showed an agent jumping straight up because every platform was centered. The airborne stall kills? The graph showed "episodes ending early"; the viewer showed the agent dying mid-charged-jump.

Numbers tell you that something is wrong. Watching tells you what.

The philosophy shift

After 8 reward function versions and 8 training phases, I hit a wall. The agent had skills. It could jump, aim, do charged jumps, chain combos. But it didn't play daring. It played safe. It used its skills when the reward signal said to, and coasted when it didn't. I'd been trying to reward daring play: bonuses for speed, charged jumps, combos, wall bounces. Each bonus worked partially but created side effects. The agent learned to wall-bounce for the bounty instead of climbing. Or it played safe because death was too cheap relative to the combo bonus.

Then something clicked:

"maybe death penalty as a equation of how high you are? eventually camera speed needs to be so fast, that simply jumping like a noob isn't fast enough, and you die. this is what should actually encourage daring play, it's a necessity."

Stop engineering the desire. Create the necessity.

I'd been thinking about this backwards. I didn't need to reward the agent for playing daring. I needed to make the environment so demanding that daring play was the only way to survive.

Three changes:

Death penalty scales with height. Dying at floor 5 costs -12.5. Dying at floor 100 costs -60. Early deaths are cheap experimentation. Late deaths are expensive. You had something to lose.

Death margin reward. The agent gets a continuous signal based on its distance from the camera's kill boundary. Camera approaching means reward shrinking. No arbitrary stall penalty, just the game's own pressure curve, expressed as a number.

Extreme mode. A training-only difficulty (difficulty=3) where the camera starts at 2x hard-mode speed and accelerates aggressively. Within 20 seconds, one-floor hops physically cannot outrun the camera. The agent must discover charged jumps and combos to survive, not because I reward them, but because physics demands it.

I stripped out the flat stall penalty, the wall proximity bonus, the flat alive bonus. Let the game teach.

"teach it how to do -> teach it to do it often -> teach it it can't survive without."

The three-act structure

That quote crystallized the meta-pattern across all 8 training phases. It's a pattern for teaching game-playing agents, and I think it generalizes beyond my specific game:

Act 1: teach the mechanics. The agent can't be rewarded for something it hasn't tried. High entropy, simple rewards, fixed seeds where the target mechanic is required from floor 2. Each mechanic is a prerequisite for the next; the agent can't learn charged jumps if it hasn't learned to move sideways. Build the skill stack sequentially.

Act 2: make it habitual. Once the agent knows how, make it do it often. Reward shaping, combo chain bonuses, speed gradients. Strip back the training wheels and keep the good stuff. The agent should be doing the right things by habit, not by accident.

Act 3: make it necessary. Extreme camera speed means one-floor hops physically can't keep up. The agent fights for its life using combos and charged jumps, not because the reward function says to, but because it literally cannot survive without them.

"on phase 8, extreme, it still dies a lot, and at small floors, but it does fight for it's life, meaning more combos, more urgency, less stall, and at this point i can work on accuracy, and top floor top score motivation, without losing the 'do cool tricks' behaviour."

You can't skip to Act 3 without Act 1. An agent thrown into extreme difficulty with no skills dies immediately and learns nothing. An agent that went through all three acts has skills, has habits, and now has survival pressure reinforcing those habits. The game itself becomes the teacher.

The dopamine loop

I need to be honest about something. The training loop is addictive.

"that is just sooo cool. this is the point where it's actually dopamine inducing. every new idea that works, and being able to see it on screen real time is euphoria. it's like drugs."

Here's the cycle: you tweak a reward coefficient. You launch training. You open the pygame viewer and watch the agent play with the new reward function. Within a few minutes, you see new behavior emerging. The agent tries something it's never tried before, because the incentive structure changed. Sometimes the new behavior is exactly what you wanted. Sometimes it's a hilarious side effect you didn't anticipate (wall-bouncing for the bounty, jumping in place to farm the alive bonus). Either way, you learn something, tweak another coefficient, and go again.

This is the exact same feedback loop that makes the game itself addictive: the combo system, the "one more run" pull, the escalating difficulty that keeps you in the flow state. I accidentally built a meta-game on top of my game. The game's dopamine loop is combo chains and floor counts. The trainer's dopamine loop is reward shaping and emergent behavior.

Phase 9 breakthrough and going bigger

Phase 9 (extreme difficulty, 10 million steps) showed the approach working. The reward had plateaued at 1500 for most of the phase, then broke through to 2260. Max combo jumped to 14. The evaluation callback saved a new best model, meaning the improvement was real and not just training noise. Entropy was still healthy (the agent was still exploring), and explained variance was climbing (the value network was catching up to the policy).

The question became: can a bigger network do better?

The 64x64 architecture had roughly 10,000 parameters. A 128x128 network would have roughly 40,000, enough to represent more nuanced strategies for a 37-float observation space without being so large that it overfits or trains slowly.

But you can't load a 64x64 checkpoint into a 128x128 network. The dimensions don't match. You'd have to start from scratch, re-running 9 phases of curriculum. That's hours of training time, and you'd lose all the skills the 64x64 model learned.

Unless you steal from yourself.

Behavior cloning: 88.2% accuracy

Instead of starting fresh, I decided to teach the new network what the old one already knew.

I played 11 games on the iOS Simulator, real human games, across varying difficulties, producing about 150,000 ticks. Human recordings over AI recordings, because humans play with intention and game sense. The AI plays with approximation.

The behavior cloning pipeline (training/pretrain.py) replays each .gcrec recording through the Python environment, collects every (observation, action) pair (98,670 of them from 11 games) and trains a fresh 128x128 policy via supervised cross-entropy loss. 20 epochs, batch size 256.

Result: 88.2% action prediction accuracy. The remaining 12% is mostly ambiguous frames where multiple actions are reasonable. The network learned to play like me, well enough to skip the bootstrap phases and launch PPO directly from phase 5.

The 128x128 model ran through phases 5-9. On extreme mode, it oscillated between 1800-2100 reward, comparable to where the 64x64 ended. The larger network hasn't shown a decisive improvement yet, which might mean the bottleneck isn't network capacity but the reward function or the environment design. That's where I left it. The setup is solid. The remaining question is whether more training time or a different curriculum pacing will push the agent from "competent" to "daring."

What I actually learned

I started this with zero ML experience and ended it with a trained agent that chains combos, discovers mechanics on its own, and fights for survival under camera pressure I can barely handle myself.

The technical learning was valuable. But the deeper insight is about observation. Every wrong diagnosis came from reading numbers. Every right diagnosis came from watching the agent play. The pygame viewer taught me more about my own game's mechanics than months of playing it. I watched the agent discover the momentum two-floor jump and thought: I didn't teach it that. The physics did.

The cross-validation taught me something similar. Three Python port bugs (sprite width, combo refill, RNG drift) were invisible in synthetic tests and obvious in real gameplay recordings. The recording system I built for replay testing turned out to be the single most valuable debugging tool for the ML pipeline.

And the deeper lesson: the architecture decision I made on day 3, keeping GameRuntime as a pure function of inputs with zero UIKit/Metal imports, is the reason all of this was possible. A game engine that's a typed function from InputState to RenderFrame is trivially portable to Python. A game engine tangled with rendering and platform APIs would have required months of extraction work. I didn't know the AI player existed when I made that architecture decision. But the clean architecture made it a two-day project instead of a two-month one.

Same itch as day 1. Same approach: jump in with Claude, figure it out as you go, ship something that works. The gap doesn't disappear. It shifts. On day 1 the gap was "I don't know Swift." On day 20 the gap was "I don't know reinforcement learning."

This is post 15 of 18 in a series about building Geo Climber with Claude Code. The AI player works. It chains combos, discovers mechanics on its own, and fights for survival under camera pressure I can barely handle myself. Join the Discord and download Geo Climber on the App Store.

reinforcement-learningppoai-playercross-validationdeterminismreward-shapingpygamebehavior-cloningclaude-code