One Terminal to Orchestra

How the Claude Code workflow evolved from a single chat window on day 1 to a multi-agent orchestration system on day 18: plugins, skills, agents, hooks, worktrees, and the first domain-specific skill that understood difficulty tuning.

Eyal Harush11 min read

On day 1 of Geo Climber, I had one terminal window and a chat. I'd type a question, Claude would answer, I'd paste code into files, run the build, fix errors, repeat. The loop was tight and it worked. A web dev with no iOS experience ported a C++ vertical climber to Swift in a single day using that loop.

On day 18, I had plugins providing integrations to Linear, Discord, Sentry, and the game's own backend API. I had skills (/ship, /difficulty-adjuster, /linear-review) that encoded multi-step workflows I'd previously typed from scratch every time. I had agents (checkpoint, changelog, level-generator) that could be dispatched with their own context and tools. I had hooks that triggered on specific events. I had worktrees isolating feature branches into separate directories so I could work on the recording system without touching master. I had a spec-first workflow (brainstorm, plan, implement, ship) running on a 180 EUR/month Claude Max plan because the workflow ate tokens and the tokens were worth it.

Thirty-five commits that day. Five PRs merged. The extensibility overhaul, the first domain-specific skill, the recording system's entire implementation from spec to ship. This post isn't about any one of those things. I covered the day's events in The Boring Stuff (at 1am) and Teaching the Game to Remember. This post is about the gap between day 1 and day 18. What happened in that gap. Why it happened when it did and not sooner. And what I noticed about the pattern only after it was done.

The slow drift

The extensibility overhaul didn't come from planning. It came from friction.

Over the first 17 days, my Claude Code usage had settled into routines. I had a commit-push-PR-review-merge-changelog-Discord-Linear sequence I ran every time I shipped something. I had a difficulty-tuning conversation I had every time testers said the game was too hard (or not hard enough). I had a level-art generation pipeline I walked Claude through every time I needed new zone assets. Each of these was a five-to-fifteen minute preamble of "let me explain how this works" before the actual work started.

I didn't notice the drift because each individual session felt fine. A five-minute preamble on a thirty-minute task doesn't feel expensive. But I was running four or five of these sessions a day. Twenty minutes of preamble daily. An hour and a half per week of re-explaining things Claude had heard before but couldn't remember across sessions.

The drift also showed up in a subtler way. Claude was getting slower. Not computationally, but in judgment. Responses were longer than they needed to be. The model was making more mistakes on tasks it had done correctly a week earlier. Prompts that used to produce clean results on the first try were producing results that needed two or three corrections.

In retrospect, the cause was obvious. I was using a general-purpose assistant for domain-specific work without giving it domain-specific structure. Every session started from zero context (plus CLAUDE.md, which helped but couldn't encode everything). Every task was "generic code assistant tries to understand this particular game's tuning rules." Of course the results degraded as the tasks got more specialized. The tool hadn't changed. My work had.

The trigger

I woke up on April 3rd and knew something had to change.

"I got up and realized I am not pushing Claude to its full potential. It's becoming pretty slow again. Makes a lot of mistakes. Some of my workflows stabilized and got realized, to the point where it made sense to automate. Woke in a studying mood. All pointed to that direction, and the ship skill was a success too, so it was time to really see what's possible."

The /ship skill had been running for two days by that point. I'd built it on shipping day to automate the commit-push-PR-review-merge-changelog-Discord-Linear sequence. It worked. One command instead of eight steps. No re-explanation needed. The skill knew the sequence, the conventions, the tools.

That single skill proved the concept. If I could encode the shipping workflow into a reusable skill, I could encode the difficulty-tuning workflow. The level-art pipeline. The Linear board review. The code review process. Every recurring workflow that I was re-explaining from scratch.

I didn't build the skill on day 3 when I started shipping things. I built it on day 16 when the shipping workflow had stabilized. By day 16, I'd run the same eight-step sequence by hand enough times that I knew exactly what each step was, what order they went in, what could go wrong, and what the correct recovery was. If I'd tried to automate it on day 3, I would have automated the wrong workflow, the one I thought I'd use, not the one I actually settled into.

Premature automation is the tooling equivalent of premature optimization. It encodes patterns you haven't validated yet. Wait for the workflow to stabilize. Wait for yourself to get annoyed by the repetition. Then automate.

Four primitives

Claude Code's extensibility model has four building blocks, and understanding why they're separate matters more than knowing what each one does.

Plugins are integrations, connections to external tools and services. The Linear plugin knows how to read and write issues. The Discord plugin knows how to post to channels. The Sentry plugin knows how to investigate crashes. Plugins provide capabilities. They don't decide when or how to use them.

Skills are named workflows with their own instructions. /ship knows the eight-step shipping sequence. /difficulty-adjuster knows the game's tuning rules. /linear-review knows how to walk through stale issues and recommend actions. Skills encode judgment: they decide what to do and in what order, using plugin capabilities as building blocks.

Agents are specialized sub-processes with their own context. The level-generator agent has its own system prompt, its own tools, its own understanding of the zone art pipeline. Agents handle delegation: tasks that are big enough or specialized enough that they need their own context rather than polluting the main conversation.

Hooks are event-triggered shell commands. When Claude uses a tool, when a session ends, when context is compacted. Hooks handle enforcement: making sure things happen automatically without relying on anyone remembering to trigger them.

The distinction matters because each primitive solves a different kind of friction. If I'm re-explaining how to connect to Discord, that's a missing plugin. If I'm re-explaining the shipping sequence, that's a missing skill. If I'm losing context because a sub-task pollutes my main conversation, that's a missing agent. If I'm forgetting to run a check before committing, that's a missing hook.

On day 18, I installed or built all four. Thirty-one files changed in the extensibility overhaul commit alone. By the end of the day, the Claude Code setup was unrecognizable from day 1.

When tools understand your domain

The difficulty-adjuster skill is the one I keep coming back to when I think about this transition.

PR #36 landed the same afternoon as the extensibility overhaul. It did two things in one commit: rebalanced the game's difficulty curve and added the /difficulty-adjuster skill.

Before the skill, every difficulty tuning session started with a long explanation. "Open game-content.json. Find difficultyProfiles. Here's how the timer system works: there's a base speed, an acceleration curve, and a window before the chase camera catches you. Here's how platform weights work: easy mode has more small platforms, hard mode has more big jumps. Here's what I want: make levels 1 through 3 slightly harder but don't touch the late-game. And don't break the separation between easy, normal, and hard." I was typing some version of this every time testers gave feedback.

After the skill, it's "make the early game slightly harder." The skill knows the config format. It knows the weight system. It knows the invariants I care about. It knows that floor 50 in normal mode should take about two minutes. It knows not to collapse the difficulty tier separation. All of that domain judgment is encoded in about 150 lines of markdown.

This is the moment that felt qualitatively different from everything before. The /ship skill was a generic workflow (commit, push, PR, merge). Any project could have one. The difficulty-adjuster skill was domain-specific. It understood Geo Climber's tuning model. It could only exist for this game, with this config structure, these tuning rules.

When your tooling starts understanding your domain (not just your language, not just your framework, but your actual product's rules and constraints) the relationship between you and the tool changes. You stop being an operator giving instructions. You start being a director giving intent. "Make the early game slightly harder" is intent, not instruction. The skill translates intent into the specific config changes that implement it.

I noticed I was doing less typing and more thinking. The raw ratio of keystrokes to shipped changes dropped. Not because I was being lazy, but because the keystrokes I used to spend on context-setting were now encoded in skills, and the keystrokes I still typed were the ones that required judgment.

The shift I almost missed

Something else happened on day 18 that I didn't register until days later. The SEO overhaul (PR #35) and the analytics instrumentation (PR #37) both shipped that day, and they felt different from everything I'd shipped before.

They felt like maintenance.

Not in a negative way. In a factual way. SEO and analytics are important work. They make the project more visible and more measurable. But they didn't feel like building. They felt like operating. Like work you do on a system that exists, not work you do on a system you're bringing into existence.

That subtle shift, from building to operating, is a signal. It means the core of the product exists and works. The building isn't done (it's never done), but the foundation is solid enough that you can do maintenance work on top of it without worrying that the foundation will shift.

I mention this because it's easy to romanticize the building phase and feel deflated when maintenance work starts appearing. "I used to ship Metal renderers and now I'm fixing SEO meta tags." The reframe that helped me: maintenance work appearing doesn't mean the exciting part is over. It means the exciting part produced something real enough to maintain. The SEO work exists because the website exists because the game exists because the last 17 days happened.

And anyway, on the same day I was doing the "boring" SEO work, I was also building the recording system on a feature branch, designing a binary codec, hunting determinism bugs, and starting to think about training an ML model to play the game. The maintenance work and the ambitious work coexist. They have to. There's only one of me.

Thirty-five commits

I keep coming back to that number. Thirty-five commits in one day. Five PRs. The extensibility overhaul, the difficulty-adjuster skill, the SEO overhaul, the analytics instrumentation, the altitude progression system, and the entire recording system (from design spec to working Watch Replay button) on a separate feature branch using worktrees for the first time.

That day was the busiest of the project, and it was also the day the project's workflow matured from "direct chat" to "orchestrated system." Those two facts are related. The reason I could ship thirty-five commits in one day is the same reason I overhauled the extensibility model that day: the workflows had stabilized enough that they could be automated, and once automated, they could be run in parallel, with less context-switching overhead, at higher throughput.

The orchestra metaphor isn't mine, but it fits. Day 1 was a solo performance. Day 18 was a conductor with an ensemble, each instrument (plugin, skill, agent, hook) playing its part, the conductor providing intent and judgment. The conductor is doing less mechanical work than the solo performer. The output is larger, richer, and more reliable.

I don't think this transition is unique to my project or my tools. I think it's the natural arc of any sufficiently complex solo project with AI assistance. You start with direct interaction. You settle into patterns. The patterns calcify into friction. You automate the patterns. The automation frees you to focus on judgment. The judgment gets better because you're spending more of your attention on it and less on mechanics.

The gap between day 1 and day 18 isn't about the tools getting better. The tools were the same. It's about me learning what I actually needed and encoding it into a structure the tools could use. The slow drift, the friction, the moment of "I should really automate this," that's the learning. The automation is just the artifact.

This is post 13 of 18 in a series about building Geo Climber with Claude Code. Day 1 was one terminal. Day 18 was an orchestra. The gap between them is the real arc of the series. Join the Discord and download Geo Climber on the App Store.

extensibilitypluginsskillsagentsworkflow-evolutiondifficulty-adjusterworktreesclaude-code
One Terminal to Orchestra — Building Geo Climber