Using coding agents well

A mental model for LLM coding agents, and the controls that come with them.

Writing code is suddenly cheap. Writing good code — code that works, that you have proven works, that someone else can maintain — is not. Writing bad code is still very expensive.

For professional use, try to use coding agents like scalpels, not bazookas. Spending tokens to produce code you can vouch for can be highly effective and cost efficient. Spending tokens to produce code you cannot — or spending many times what the task actually needs — is not. The goal here is to optimise, not minimise. Cheap sessions that generate a day of rework are not cheap.

There is definitely a place for vibe coding a throwaway demo to see if an idea is worth pursuing at all, if you are clear that's what it is. These notes are for the work you intend to keep or give to someone else to use.


A note on motivation

These are observations more than prescriptions. Coding agents are now genuinely remarkable — in code understanding and quality they now match or surpass many of us. What they tend to lack is the wider context of what the change is really for, who maintains it, the constraints that only come up in conversation with a client or a colleague. Because the code-writing feels so capable, it is very tempting to cognitively surrender and let the agent take bigger and bigger bites. I try to resist that for two reasons.

The first is the codebase. The agent is missing the wider picture, and missing the feeling of pain from past mistakes — like the over-complex code you now have to maintain. Those gaps gradually add debt that you can avoid (or learn the skills to avoid in future) by staying engaged with the code as it changes.

The second is more personal. If I treat the model as better than me and start passing other people's instructions into it on autopilot, I am not building any sort of craft. The friction of doing the work is part of how skills develop. If I skip it entirely for long enough, the scale of problems I can be trusted with will stop growing and I would expect my career to stagnate.

None of this is a hard rule — sometimes speed matters more than learning. But the default I try to set for myself is to really stay in the work, even when the agent could do it without me.


How a coding agent actually works

If you have never built one, here is the simple mechanism. A coding agent is just an LLM running in a loop with access to a computer with tools. You give it a goal. It can call tools — run a shell command, read a file, edit a file, search the web. The tool returns a result. The result goes back into the LLM as part of an ever-growing conversation. The LLM decides the next step. This repeats until the LLM decides the goal is met, or gives up.

Everything sent and received accumulates in the context window — the LLM's working memory for this session. On every turn, the entire conversation so far is sent to the model again. That is why long sessions get expensive, and why — as we will see — they also get measurably worse at thinking and coding.

The loop: a prompt enters the LLM; the LLM iterates with its tools until the goal is met. Everything sent and received accumulates inside the context window. Each new turn also pays for all the turns before it, but with cheaper input tokens if they remain the same and are cached.

Almost every strategy on this page boils down to one of three things: what you put into the context, how you let the loop use tools, and when you start fresh. Once that mental model is in place, the rest becomes clear.



01 Understand and align before you delegate

The biggest source of waste is delegating to an agent before you understand the problem. The agent produces a large, plausible diff but it doesn't solve the problem, or solves the wrong problem. You only spot this after reading it carefully or discover what it's doing during use — by which point the cheap part is done and the expensive part, rework, is starting.

Spend the first few minutes letting the model teach you the unfamiliar part: a new library, a new API, the bits of the existing codebase you don't know. Understanding acquired before the agent starts is the cheapest you will get. Discovered mid-task, it is the most expensive.

For anything ambiguous, get the agent to interview you first. As someone who is not a software engineer by training, I do this every time to solidify my own ideas, not just the agent's. Have it surface its assumptions and resolve open questions one at a time before it writes a spec for you. Misalignment caught after implementation is, by some distance, the most expensive kind of bug to fix. This does not require an elaborate prompt or skill — simply asking it to interview you and surface and resolve all assumptions systematically.

Where the change is consequential, or you are working in an area you don't know well, get a more experienced engineer to sanity-check the approach before the agent starts. The model is, uncomfortably, a mirror — it reflects back what you tell it, without the grounded perspective that an engineer who has lived in this code has. Three or four questions from someone senior can save a day of confidently-wrong agent work.

Five minutes of orientation up front pays back many hours of rework or starting again.

02 Bound the agent in the codebase, not just in the prompt

Every line of code the agent writes is a line that needs to be maintained, and maintenance is also expensive. The agent has no inbuilt incentive to stop. Left alone it will keep refactoring the same area, generalising for no real reason, and spreading a small change across more files than it needed to. The barriers that stop it have to be in the codebase, not just in the prompt.

Decide upfront what modules the change affects, what their interfaces are, and what is explicitly off-limits. Get the agent to interview you on the design before it touches anything — not just "what should this function do" but "what should this module own, and what should it not". You are trying to keep the changeset small by drawing boundaries the agent will stay inside, rather than by hoping it self-limits. A good test: if you cannot describe the change as "edits in module X, with the existing interface to Y", the scope is still too vague.

The agent won't bound the work for you. Barriers belong in the codebase, not just in the prompt.

03 Stay in the smart zone

Context windows are advertised in million-token round numbers, which is misleading for this kind of use. Models get measurably worse at cognitive tasks as the window fills, and the decline starts long before the context is "full." As of mid-2026, treat the first 100,000 tokens or so as the smart zone. Past that you pay twice: a larger input token bill on every turn, and worse reasoning and code.

This might seem counter-intuitive, but a large context is not a sign that the model is being thorough. It is usually the model having to sift through lots of unhelpful sediment to find the bits you really want it to use and think about. The first practical step whenever a session starts to grow or feel sluggish is to ask: how full is the context, and how much of what is in there is still important? Should I clear it?

Smart zone — drag to fill the context
quality / cost context filled → smart zone reasoning quality cost per turn
Quality: Cost per turn: in zone
Quality is roughly flat inside the smart zone, then falls noticeably. Input token cost climbs linearly throughout and output tokens are used less efficiently as there is more noise to reason through. Past the zone you are paying more for worse answers.

The Codex CLI shows the context indicator in the footer. Watch it.

codex — context indicator
coding-agents-screenshots/01-context-indicator.png Crop the CLI footer showing the context-fill bar mid-session.

04 Keep tasks small, threads short

Auto-compaction is still a bit of a sticking-plaster for real engineering work and a sign you are tackling too many things at once. A compacted thread carries sediment forward — introduces vague summaries and references to files that no longer matter to the work the agent is doing. A fresh thread starts from a clean state every time and lets you intentionally curate it with the context needed for the work you are doing.

My rule is one thread per task, not one per project. When the task changes, clear the context yourself and re-seed it deliberately (details on AGENTS.md and skills to come in a moment). This sits in slight tension with OpenAI's guidance to stay in long threads to keep the reasoning trail. Restarting can be disorienting for users who haven't understood how the agent works but, once we've understood how context accumulates and how to add to it deliberately, clearing it ourselves frequently beats letting it auto-compact.

If you cannot hold the task in your head, the model will run ahead of you and introduce things (rightly or wrongly) that you didn't intend. Break it down. Smaller tasks fit inside the smart zone naturally.

Compaction shortens every bit equally. Clearing keeps only the bits you choose to re-seed, at full length.

05 Curate context: reference, don't paste; filter, don't dump

The default failure mode is to give the agent too much for convenience. You paste a whole file when you only need a function from it. You paste a whole log when you only need the failing line. You enable a search tool and let it ingest twenty matches at full length when one would have been sufficient. Every unnecessary token has two costs: the API tokens on the meter, and the capacity it takes from the smart zone. Luckily agents are quite optimised for efficient searching, as long as you've given them enough context about what they are searching.

Point at file paths instead of pasting their contents. Put long output through head, tail or grep before the model sees it. Tell the agent where to look so it doesn't search the whole tree on a guess. This is also why agents that can use grep and glob on text files are so much cheaper than the alternative — they read by reference and return only the relevant lines.

Curating context — same question, two different strategies

Paste the whole file

"Here is the entire billing.py — find the bug in compute_tax."

0 tokens

Loads the entire 1,800-line file. ~24k tokens. The model now has the rest of billing.py sitting in the smart zone, doing nothing for you.

Reference and filter

"Run grep -n compute_tax billing.py then read just that function."

0 tokens

Agent runs grep, gets back the line numbers, reads ~40 lines. ~600 tokens. Same answer, ~40× cheaper, and the rest of the smart zone stays available.

Tokens shown are illustrative but close to what real files cost.
The agent's search tools are how you give it leverage. Using them well is the difference between dragging the whole library to your desk and asking the librarian for one passage.

06 Write durable context once (AGENTS.md)

Some context is durable. The repository layout. The build and test commands. The conventions you follow. What "done" means here. The two or three things you've learned the hard way in prior sessions that the agent must never do. This is the context that's required every time you clear the session, to avoid a scene from Memento, but you shouldn't re-type it.

Put it in AGENTS.md at the repository root and the agent will load it automatically. Keep it short and accurate. Add a rule only after you've watched the agent make the same mistake twice — rules added on a hunch are noise and cost tokens in the smart zone. Don't waste tokens on things the model already knows how to do well.

Durable context is paid for once. Per-prompt context is paid for every single turn.
codex — session start, AGENTS.md loaded
coding-agents-screenshots/02-agents-md-loaded.png The session-start banner that lists which AGENTS.md / config files were loaded.

07 Let the agent write code, where it can

MCP servers (and tool registrations in general) have two hidden costs that catch people out.

Every connected tool's definition sits in the context window on every single call. Wire up the big GitHub MCP and you've permanently added several tens of thousand tokens of schema to every prompt, whether you use those tools or not. Having lots of tools also makes the model worse at choosing between them.

When an agent chains tool calls — tool A's output becomes tool B's input becomes tool C's input — the LLM's whole network runs at every step just to copy the previous result into the next call. At those hops the model isn't really reasoning; it is acting as an expensive courier between tools. LLMs are also far better at writing code than at issuing tool calls. Their training data is mostly code, with far less tool-call syntax.

The rule of thumb: reserve MCP for what it is actually good at — user authorisation, structured or stable APIs, and integrations the agent could not easily write itself. For everything else, let it write a bash command, a Python snippet, a one-shot script. Or expose your tools as a code API and let the model compose them in one pass.

Tool-call chain — vs — code mode

Tool-call chain (three steps)

LLM
tool 1
LLM
Tokens used: 0

Code mode (one script)

LLM
script runs all 3 tools
final answer
Tokens used: 0

The chain pays the LLM's per-turn cost at every hop, plus the cost of the tool definitions themselves on each of those turns. Code mode pays the LLM once and lets the cheap, local code do the shuttling.

The cheaper mechanism is usually the one that runs the LLM fewer times.
codex — loaded MCP servers
coding-agents-screenshots/03-mcp-servers.png Optional. The startup line or settings panel listing connected MCP servers and their tool counts.

08 Dial reasoning effort to difficulty

Modern models let you turn up the amount of "thinking" they do before answering. Extra-high reasoning is useful on hard problems — tracing a bug through three services, designing a tricky migration, comparing two architectures. On a well-scoped, well-understood task it is slower and more expensive for the same answer.

Default to low or medium. Reach for high only when the task really needs it. Change the setting often and build a feel for the trade-off; the skill is noticing when a hard reasoning run pays for itself and when it doesn't.

Reasoning dial — for two different tasks
lowmediumhighextra-high
0

On the simple task, dialling up burns tokens for no gain. On the hard one, dialling down may simply fail to find the bug. The right level is task-dependent.
codex — /model picker
coding-agents-screenshots/04-model-picker.png The /model menu showing the reasoning-effort options.

09 Use subagents to protect the main thread

The cheapest way to protect a clean main context is to send the noisy work somewhere else. A subagent is a fresh copy of the agent with its own context window, given a narrow brief, that returns a short summary instead of a full transcript. Exploring an unfamiliar codebase, running a verbose test suite, scanning a large directory for a pattern — all of these are subagent work.

Where the subagent is doing simple, well-scoped work, consider giving it a smaller and faster model — the cost saving is significant, and on bounded tasks the quality difference is often invisible. But don't go overboard. A swarm of specialist subagents adds coordination overhead that often outweighs the context you save. The point is to practice main agent context hygiene, not agentic org design.

Subagents are a token-budget trick disguised as an organisational one.

10 Prove it works

An agent without a feedback loop is not writing code like an engineer — it is predicting if the code will work. So require evidence that the change works. But be careful with what counts as evidence: an agent that writes its own tests can produce passing tests that prove very little. Tests written by an agent need a second pair of eyes — yours — to confirm they exercise the behaviour you really want and cover edge cases, and not some weaker version. Always read the tests in detail — especially if you are leaning on them to review the code in less detail. The less time you spend reading the implementation, the more weight the tests are carrying, and a test you haven't read carefully is not really proof of anything.

Prefer test-first: write the failing test (or have the agent write it, then read it), confirm it fails for the right reason, then implement until it passes. That is what "red, green" means. Skipping the red step is the most common failure — without it, you risk a test that asserts nothing and would pass against almost any implementation, including the wrong one.

Tests are necessary but not sufficient. Watch the change in action yourself wherever you can — click the button, run the workflow, check the output. This is also a reason to keep changes small. A massive diff cannot be verified end-to-end, and an agent rarely notices when a sweeping change has quietly broken or degraded something elsewhere. Smaller changes are easier to prove out, and easier to back out if they go wrong.

Don't lean only on tests, either. Deterministic tooling — linters, formatters, type checkers, static analysis — verifies deterministic properties more reliably than any agent will. Tell the agent not to worry about style and let the tooling enforce it; you free up its reasoning for the parts that actually need it.

Deliver the diff with the evidence. The commands and their output. A short screen capture. A test that goes red, then green — and that you have read. Without that, your reviewer is doing the verification work for you, and you haven't really delegated to the agent. You've delegated to the reviewer.

Red to green — with a regression check
RED 1 failing

      

"Proven works" means: you watched it work, you read the tests, and there is a test that will scream if it stops working. Sharing the evidence of this helps your colleagues and clients avoid overload reviewing all of your PRs from first principles themselves.

11 Compound your successes

The first time you solve something non-trivial with an agent, the figuring out costs real tokens and real time. The second time, that cost should be significantly smaller if you codify it. The way to do that is to capture the working pattern — as a script, a small wrapper, a SKILL.md, a section of AGENTS.md, a saved prompt. Re-inventing the same solution over many small loops is expensive because the cost is duplicated across many sessions, including where the steps sometimes go wrong.

Also remove friction between the agent and the artefact it has to edit. If touching a PowerPoint deck requires the agent to install nine packages and write custom OOXML, every task involving that deck pays a significant overhead in commands and tokens. A thin wrapper or small script that lets the agent work on the artefact directly is a one-time investment that pays back across every future run.

The compounding habit is probably the single highest-leverage one on this page.

12 Watch the meter

You cannot manage a number you never look at. Most coding agents bill by tokens rather than messages, and reasoning tokens also cost money, so only counting your turns tells you very little. Get into the habit of glancing at usage during sessions until you have a feel for what a task should cost. Then notice when a session is much more expensive than that, and ask why.

Three panels are worth knowing in Codex specifically. The live ones tell you about this session; the dashboard tells you about last week.

QuestionWhere to look
How full is my context right now? The context indicator in the CLI footer.
What's left in this 5-hour / weekly window? /status inside an active session.
What did I spend this week, by model? platform.openai.com/usage.
codex — /status output
coding-agents-screenshots/05-status.png The /status panel showing rolling 5-hour and weekly windows, model and plan.
platform.openai.com/usage
coding-agents-screenshots/06-usage-dashboard.png The usage dashboard with a few days of tokens, filtered by model.

Cost review is a habit, not a one-off. When the agent makes the same mistake twice, ask it for a retrospective and put the lesson into AGENTS.md. This will help you achieve cheaper and more reliable future runs.


The short version

Treat the agent like a scalpel. Understand the problem before you delegate it. Scope each task to fit inside the smart zone. Curate the context that goes in. Let the agent code rather than chain through tool calls when it can. Verify the change works yourself; don't take the agent's own tests as proof, and review with fresh eyes. Capture what worked so the next run is cheaper than this one.

Low token use and fast turnaround are by-products of precision, not goals in themselves. Aim at them directly and you tend to deliver neither.