
The two skills most Claude Code power users skip: keeping token spend and rate limits predictable, and regression-testing the prompts, skills, and commands you depend on so they can't quietly get worse. What the cost command actually measures, the model and caching levers that really move the bill, reading your usage with ccusage and OpenTelemetry, and a small eval suite built on headless print mode and promptfoo.
The last rung of doing everything by hand: spend less, and keep what you've built from quietly getting worse.
Cost and rate-limit engineering. Why the /cost command is meaningful only on a pay-as-you-go API key and misleading on a subscription (use /status and /usage instead), and the gotcha where a stray ANTHROPIC_API_KEY bills you through the API while your Max plan sits unused. The two stacked limit windows (the five-hour rolling window and the seven-day weekly caps, including the separate cap on the top model), drawn from Anthropic's usage and limits docs and the Pro/Max plan guide. Current per-token pricing and the clean five-times pattern (output is 5x input; each model tier is ~5x cheaper than the one above), prompt caching at a 90% read discount and why a stable CLAUDE.md keeps the cache hot, and the batch path at 50% off. Reading your real usage with ccusage and exporting OpenTelemetry metrics to a dashboard. The levers that move the bill most: /model, /compact and /clear, subagents that return summaries, --max-turns, and the thinking-budget setting (thinking tokens bill as output). More in Manage costs effectively.
Evaluating your own prompts, skills, and agents. Why your setup drifts (model updates, CLAUDE.md edits, accumulating instructions) and how regressions stay silent. Building a tiny eval suite with headless print mode: a fixtures folder, a pinned model, and code-based checks (does it compile, do tests pass, does it contain the required clause) before reaching for an LLM-as-judge rubric. promptfoo for assertions and judging, Anthropic's evals guidance that code-based grading wins when feasible, and four ways evals lie to you: tiny overfit sets, judging style over correctness, eval cost, and non-determinism.
News up top: Opus 4.8 as the new Claude Code default with extra-high effort and Dynamic Workflows (docs), and today's 2.1.160 write-guard prompts (changelog).
Earlier episodes referenced: permissions and plan mode, custom slash commands, skills, subagents, MCP servers, and context windows.
This week in Claude Code
Two releases worth your attention, and the bigger one quietly reset your defaults.
Start with the headline, from last Thursday, May 28th. Anthropic shipped Claude Opus 4.8, and inside Claude Code it's now the default model, running at high effort straight out of the box. Effort is a dial you can turn: low, high, extra, and max. The new piece is that the API now exposes an even higher setting in Claude Code, an extra-high level. So on a genuinely hard problem, push the effort up to extra-high and let it think harder. Anthropic says the default high setting spends about the same tokens as the old Opus default but gets more out of them.
Pricing on the standard rate held steady, five dollars per million input tokens and twenty-five per million output, same as the last few Opus releases. The real change is fast mode, which is reportedly about three times cheaper than it was on the previous Opus. If you run latency-sensitive loops and used to think fast mode was too pricey, it's worth another look. I'd mark the exact fast-mode numbers as reportedly for now, since they come from secondary write-ups, not Anthropic's own price page. Simon Willison's writeup is a good primary read.
The same release brought a research preview called Dynamic Workflows. The idea: instead of you wiring up subagents by hand, a dynamic workflow is a script that orchestrates them at scale. You describe the task, Claude writes the script, and a runtime runs it in the background while your session stays responsive. It needs version 2.1.154 or newer, and it reportedly scales to as many as a thousand agents in a single run, sixteen running at once, with independent agents trying to poke holes in each other's work. You can see your runs with the workflows command. There's a companion setting that pairs the highest reasoning level with automatic orchestration, and as of this week it's been renamed to ultracode. The official dynamic workflows docs cover the setup.
Then there's today's release, version 2.1.160, and it's mostly a security tightening you'll feel if you run in accept-edits mode. Claude Code now stops and asks before writing to shell startup files, the ones that can silently run commands the next time you open a terminal. And in accept-edits mode it now also prompts before writing build-tool config files that can execute code, things like your npm and yarn config, devcontainer files, and pre-commit config. Those used to get written without a pause. There's also a small quality-of-life fix: a single grep on a file now satisfies the read-before-edit check, so the agent stops demanding a separate read after it already searched the file. Fewer redundant round-trips. The full list is in the Claude Code changelog.
Two smaller items if you author plugins or run on a cloud provider. Recent releases this week made local plugin work easier: plugins dropped into a skills directory now load without going through a marketplace at all, and there's a new scaffolding command that stands up a fresh plugin for you in one step. So if you've been meaning to package that workflow you keep repeating, the friction just dropped. Separately, auto mode, the hands-off run mode, is now available on the big cloud back ends, Bedrock, Vertex, and Foundry, behind an opt-in environment variable, which matters if your team runs Claude Code through one of those rather than first-party.
One quick callback to close. The rate-limit relief from earlier this spring is still in effect: Anthropic roughly doubled the five-hour limits for Pro and Max and dropped the old peak-hours throttle, tied to a new compute deal. Not fresh news, but if you throttled yourself out of habit back in the winter, you've got more headroom now than you think. Which is a good segue, because today's main topic is exactly that: spending, limits, and not wasting either.
Cost and rate-limit engineering, plus evals so your prompts don't rot
Being fast with Claude Code and being cheap with Claude Code are two different skills, and most of us only practice the first one. So this episode is the cost half of mastering the keyboard, and then a second half on a quieter problem: your prompts and skills getting worse without you noticing. Both are about the same thing, keeping the tool honest over time.
What the cost command actually measures
Start with the command everyone reaches for, the cost command. Type it mid-session and you get a running total of tokens and a dollar estimate for the current session. Here's the catch almost everyone misses. That number only means something if you're on a pay-as-you-go API key. Anthropic spells this out in the Claude Code usage and limits article.
If you're on a Pro or Max subscription, you pay a flat monthly fee, so the dollar figure the cost command shows you is not your bill. It's a hypothetical, what this session would have cost at API rates. Useful for comparing one workflow against another, useless as a statement of what you owe.
For subscribers, the commands you actually want are the status command and the newer usage command. Those show how much of your plan's allowance you've used and when it resets, which is the thing that really constrains your day. The cost command answers a question you don't have; the status command answers the one you do.
There's a tangent here that bites people, so let me plant it early. If you have an Anthropic API key set in your shell environment, Claude Code will quietly bill through that API even when you've got a Max subscription sitting there unused. So if you pay for Max and your spend looks strange, check whether that environment variable is set, and unset it if you meant to ride the subscription. The status command will also tell you which billing mode you're actually on, so when in doubt, check there first.
Plans, the five-hour window, and the weekly caps
Three subscription tiers right now. Pro is twenty dollars a month. Max comes in two sizes: roughly five times Pro's usage for a hundred a month, and twenty times Pro for two hundred a month. And here's the part people forget. Claude Code draws from the same budget as the Claude app in your browser or on your phone. Every message in both counts against one shared limit, as Anthropic spells out for Pro and Max users. So a heavy chat afternoon can quietly eat into your coding budget for the evening.
There are two limit windows stacked on top of each other, and you need to hold both in your head. The first is a five-hour rolling window. It's the original limit and it's still there. Rolling means your messages age out gradually rather than snapping back to zero at a fixed time. Community estimates, which vary by model, put Pro somewhere around forty-five Opus messages or a hundred Sonnet messages in a window, with the bigger Max tier running several times that. Treat those as rough, not gospel.
The second window is the weekly cap, which Anthropic added back in the summer of 2025 and which resets every seven days. They introduced two of them at once: an overall usage cap, and a separate, model-specific cap on the top model. It was aimed at the heaviest few percent of subscribers, the ones running Claude Code around the clock, and TechCrunch covered the rollout when it landed.
The exact numbers keep moving, so don't memorize them. A later update split Opus and Sonnet into separate weekly budgets, so hammering Opus no longer drains your Sonnet allowance. And this spring, the one we mentioned in the news, roughly doubled the five-hour limits and dropped the peak-hours throttle. The lesson isn't the figures. It's the shape: you have two clocks running, a five-hour one and a weekly one, and the weekly cap on the top model is the one that ambushes Max users a couple of days before reset.
What the tokens actually cost
If you're on the API, or you just want to reason about cost in real units, here's the current price sheet, and there's a clean pattern hiding in it. Opus is five dollars per million input tokens and twenty-five per million output. Sonnet is three and fifteen. Haiku is one and five. Anthropic's pricing page has the full grid.
Notice two things and you can do the rest in your head. Output is always five times the price of input, on every model, so long-winded answers cost you on the expensive side of the ledger. And each tier is about five times cheaper than the one above it, which means Haiku is roughly twenty-five times cheaper than Opus for the same number of tokens. Hold that ratio. It's the backbone of every cost decision in the rest of this episode.
One quiet gotcha to file away. The newest Opus uses a different tokenizer that can spend up to about thirty-five percent more tokens on the very same piece of text. So when you compare two models on one task, the price per token isn't the whole story, because the token count itself shifts underneath you.
Let me make that concrete, because the abstraction hides the stakes. Say a single coding turn pulls in about fifty thousand tokens of context, your CLAUDE.md, a few open files, the conversation so far, and produces two thousand tokens of answer. On Opus, that input runs about a quarter of a dollar and the output about a nickel, call it thirty cents a turn. Do that a couple hundred times in a heavy day and you're at sixty-odd dollars. Run the same turns on Sonnet and the input drops to around fifteen cents and the whole day lands near thirty-five. On Haiku, the input is a nickel a turn and the day is closer to twelve. Same work, and the bill swings by five times depending on nothing but the model picker. That is the entire argument for being deliberate about which model is loaded.
Caching is the cheapest trick you're underusing
Now the single cheapest lever most people ignore: prompt caching. When part of your input repeats, Claude can cache it and re-read it at a tenth of the normal input price. That's a ninety percent discount on the repeated part. For Opus, fifty cents per million instead of five dollars.
There's a small charge to write something into the cache the first time, about one and a quarter times normal input for the short-lived version and twice normal for the hour-long one. But the math works out fast. At the short duration, caching pays for itself after a single re-read. At the hour duration, after two. Almost any real session clears that bar without trying.
Claude Code sets this up for you automatically, but you can help it or hurt it. The cache keys off a stable prefix, and your CLAUDE.md sits right at the front of that prefix. So a CLAUDE.md you leave alone during a session keeps hitting the cache at the discounted rate. The moment you edit it mid-session, you invalidate that cached prefix and pay to write it all again. This is the cost side of something we hit in the context-window episode: a stable, lean CLAUDE.md isn't just clearer for the model, it's literally cheaper per turn.
Walk the numbers on that, because they're better than they sound. Say your CLAUDE.md and the always-loaded files come to twenty thousand tokens, and you send forty messages in a session. Without caching, you pay full input price on those twenty thousand tokens forty times over. With caching, you pay the small write premium once, then re-read the same block at a tenth of the price for the other thirty-nine turns. On Opus that's the difference between roughly four dollars and well under a dollar just for the fixed part of your context, before you've typed anything new. The practical takeaway is a rhythm: set your context up at the start of a session, then resist fiddling with the parts that sit at the front, because every edit there throws away a discount you'd already paid to earn.
There's also a batch path on the API: fifty percent off both input and output for work you can run asynchronously, without watching it. You won't use it in a live session, but keep it in your back pocket for the eval runs we'll get to in the second half, where you're firing a hundred prompts and don't care how long they take. Batch and caching stack, which can take a big offline job down by something like ninety-five percent.
Reading your real usage with ccusage
The cost command is per-session and rough. When you want the real picture across days and projects, there's a community tool called ccusage, written by a developer who goes by ryoppippi. It reads Claude Code's own session logs straight off your disk, the structured log files under your Claude projects folder. No login, nothing leaves your machine, and you don't even install it. You run it through npx and it just goes. The project lives on GitHub.
It breaks your usage down however you want to look at it: by day, by week, by month, by individual session. The view I reach for most is the blocks view, because it groups everything into those same five-hour billing windows we just talked about, so you can actually see how close you ran to the limit. Add the live flag and it turns into a real-time dashboard of your current window ticking up as you work.
It splits cache-creation tokens from cache-read tokens, so you can see your caching paying off, and it prints a dollar figure per day or per session. There's a status-line mode too, so you can pipe your current spend straight into the Claude Code status line and keep half an eye on it. And it ships a small MCP server of its own, which means you can wire it in as a tool, the way we connected servers a couple of episodes back, and literally ask Claude how much you've spent this week.
Telemetry when you want a real dashboard
If a command-line tool isn't enough and you want graphs over time, Claude Code can emit OpenTelemetry, the standard metrics-and-logs format that feeds dashboards like Grafana and Prometheus. The monitoring docs walk through all of it.
You turn it on with an environment variable, the enable-telemetry flag set to one, and then point it at a collector with the usual endpoint and protocol variables. You can set the whole thing centrally in your settings file, under an env block, so every session in that project exports the same way without you thinking about it.
The two metrics that matter for this conversation are a running dollar cost per session and a raw token count, alongside extras like lines of code added and removed, commits, and active time. By default it does not log your prompt text, which is the right default. There are separate, explicit flags if you ever want tool details or prompt content in your logs, and you should stop and think before flipping those on. For a solo dev this is overkill on a normal Tuesday, but if you're trying to understand your own habits over a whole month, a small dashboard beats squinting at a session counter. There's even a ready-made community stack, called claude-code-otel, that wires the whole pipeline into Grafana for you.
Model choice is your biggest dial
Everything so far is measurement. Here's the biggest lever you actually pull, and it's almost embarrassingly simple: which model you're on. The model command, run any time, shows what your account can reach and lets you switch on the spot.
Sonnet is the default, and Anthropic's own guidance is that it should stay the default for the large majority of coding work. It's fast, it's capable, and it's a fifth the price of Opus. The advice is to reach for Opus when you actually need it, a hard piece of reasoning, a debugging session that's gone sideways, an architecture call, rather than leaving it switched on all day. And Haiku, the fastest and cheapest, is perfect for the cheap, high-volume stuff: quick lookups, simple edits, and especially the kind of subagent we built in the subagents episode that just searches the codebase and hands back a summary. No reason to pay Opus rates for a grep.
Tie it back to that five-times ratio. Moving one routine task from Opus down to Sonnet is an eighty percent saving on that task. Sonnet down to Haiku, another eighty percent. The most expensive habit a Max user can fall into is leaving everything parked on Opus, because Opus carries that separate, smaller weekly cap, and you'll hit it days early doing work Sonnet would have done just as well.
A pattern worth building into your day: default the session to Sonnet, and only switch up to Opus for the specific stretch where you need it. You're stuck on a race condition, or you're designing the shape of a new module, fine, flip to Opus for that, then drop back down. The friction is one command, and the payoff is that your scarce Opus budget gets spent on the handful of problems that genuinely need the bigger model, not on reformatting a file or writing a migration that Sonnet would nail on the first pass. Think of Opus the way you'd think of a senior colleague's time: you pull them in for the hard call, you don't have them sit with you while you rename variables.
The context window is a cost lever too
A few more knobs, and they all loop back to the context window. Remember that every message re-sends the whole conversation so far, and you pay for all of it again, every single turn. So a bloated session isn't only slower and dumber, the way we covered last time. It's a recurring charge.
Two commands manage that, both from the context episode. The clear command wipes the conversation when you switch to unrelated work, so you stop dragging stale context along and re-paying for it. The compact command summarizes the conversation down to reclaim room, and it takes instructions, so you can tell it what to keep. Something like, focus on the code samples and the API calls, and it'll preserve those while shedding the rest. Claude Code will also auto-compact on its own as you approach the edge of the window.
Subagents help here in a way that's really a cost story underneath. A subagent burns its own context doing the noisy work and hands back a short summary, so the long, expensive transcript never lands in your main window, where you'd be re-sending and re-paying for it on every turn after. That's the cost reason delegation is worth it, sitting right alongside the focus reason we gave it the first time.
A handful of direct knobs
To round out the cost half, a few dials you set directly. There's a max-turns flag that caps how many times the agent loops before it stops, which puts a hard ceiling on the worst case when you kick off something and walk away. There's the effort, or thinking-budget, setting, running from low up through max, and here's the part to really internalize: thinking tokens bill as output tokens, the expensive five-times kind. At the top setting, a single prompt can spend ten times the tokens of the same prompt at the bottom. Max effort is the right tool for a genuinely hard problem and a genuine waste on a rename.
And plan mode, from the permissions episode, is a cost tool too, even though we first sold it as a safety tool. Having Claude lay out a plan before it touches anything lets you catch a wrong approach before it spends a few thousand tokens implementing the wrong thing. The cheapest tokens are the ones you never generate, because you killed the bad plan before it ran. The manage-costs docs collect most of these in one place.
Why your prompts rot without you noticing
That's spending under control. The second half of this episode is about quality slipping out from under you, which costs you in a different currency: trust.
Your setup drifts. The prompts, skills, slash commands, and CLAUDE.md rules you've built up over the last several episodes are not frozen in place. Three things move underneath them. The model updates, and the same prompt behaves a little differently on the new version. You edit CLAUDE.md, and that quietly changes behavior for every task, not just the one you had in mind. And instructions pile up, skills and hooks and MCP tools accumulating and interacting in ways you never tested as a set.
The dangerous part is that none of this throws an error. Nothing turns red. The output just gets a little worse: a forgotten constraint here, some style creep there, a convention it used to follow and now doesn't. You notice three weeks later, if you notice at all. There's even research, a paper on when better prompts hurt, showing that a prompt that looks like a clear improvement can quietly regress on real tasks. The fix is the same one you already trust for code: write tests, and run them continuously. An eval is just a test for your prompt.
Building a tiny eval suite with print mode
Here's the smallest version that's still worth doing. Pick the things you actually depend on, the custom slash command we built a few episodes back, a skill, a subagent, and freeze maybe five representative inputs for each, along with a note on what a good answer looks like.
The mechanism is Claude Code's print mode, the dash-p flag. It takes a single prompt, skips the interactive interface entirely, runs the task, and prints the result to standard output, which means you can call it from a shell script or a scheduled job. We're using it here purely for local testing. It's also the front door into the automation we get to in Act two, but today it's just a test runner, nothing more. The headless docs cover the flags.
Ask print mode for JSON output and you get the answer plus the token counts and the cost for that run, so your test can assert on price and speed, not only on correctness. Pin the model explicitly with the model flag, so a version bump doesn't silently shift your baseline. And bound each run with that max-turns flag, so a misbehaving task can't run away with your tokens while you're not watching.
So the whole loop, concretely. You keep a fixtures folder with your input prompts and expected results. You write a short shell script that runs each one through print mode at a pinned model. Then you check the output. The check can be as dumb as searching for a string that has to be there, or as real as running the type-checker and the test suite against whatever code it just generated. Then you diff against the saved good answer. That diff is your regression signal, and it's the whole point.
Picture one real case to make it stick. Suppose you've got a skill that writes Postgres migrations, and the behavior you care about is that every migration it produces is reversible and wrapped in a transaction. Your fixture is a handful of prompts, add a column to the users table, create an index, rename a field, and your expected result for each is a couple of cheap assertions: the output contains a create-table or alter-table statement, it contains a matching down-migration, and when you actually run the generated SQL against a throwaway database, it applies and rolls back clean. None of that needs a model to judge it. It's all string checks and exit codes, and it runs in seconds. The day you tweak that skill's instructions to handle a new edge case, you run the same fixtures first and confirm you didn't break the reversible-by-default behavior you already relied on.
Grade with code first, a judge second
There are two ways to grade, and the order of preference really matters. Whenever you possibly can, grade with code. Does the output contain the string it has to contain. Does the generated TypeScript compile under the type-checker. Do the tests pass. These are exit-code and string checks, and Anthropic's own evals guidance is blunt about it: code-based grading with string matching and regular expressions is the best method when it's feasible, because it's fast, reliable, and has no opinion of its own.
When the thing you care about can't be checked that way, tone, completeness, whether it followed the house convention, then you bring in a model to grade, the pattern people call LLM-as-judge. You hand the output to a model along with a rubric, and it returns a pass or a score and a short reason. The discipline here is one rule: use a single strong, pinned judge model for all of your grading, kept separate from whatever model is under test, so the grader itself stays steady from run to run.
A practical middle ground is worth naming, because most real checks live there. You don't have to choose purely between a regex and a full essay-grading rubric. Often the smart move is to do the cheap deterministic checks first as a gate, did it compile, does it contain the required clause, and only spend a judge call on the cases that pass the gate, where the remaining question is genuinely about quality. That ordering keeps your eval fast and cheap on the obvious failures and reserves the expensive, fuzzy grading for the handful of outputs where a human would actually have to read it. It also makes your results easier to trust, because a failure tells you whether it tripped on a hard rule or on a judgment call.
The standard tool for stitching all of this together in 2026 is promptfoo. You write a small config file listing your prompts, the providers to run them against, and a set of checks per test: contains this string, matches this pattern, passes this snippet of JavaScript, stays under this cost, stays under this latency, or satisfies this rubric as judged by a model. You run it from the command line and get a web view of which cases passed and which slipped. It's open source, used by both OpenAI and Anthropic on their own work, and even though OpenAI acquired the project this spring, it stayed open. promptfoo is on GitHub.
Put it together as a habit and it's small. Pin the model. Keep the fixtures folder in the repo. Run the suite before and after you change a skill, a slash command, or CLAUDE.md, and run it again every time the model version bumps. Diff the results and look hard at whatever moved. Take that custom slash command, freeze five inputs and the behavior you expect, run them, edit the command's instructions, then run the exact same five again. If the fourth one used to honor your error-handling convention and now quietly doesn't, you just caught a regression that would otherwise have shipped without a sound.
Four ways evals lie to you
Knowing the failure modes is half the skill, so here are four. A tiny test set overfits: three cases will stay green while real behavior rots, and worse, you start tuning the prompt to pass those three instead of to do the actual job. Judging style instead of correctness: model graders drift toward rewarding longer, more confident answers, and there's documented bias in LLM judges, things like position bias, verbosity bias, and even self-preference, where a model favors its own output. The defense is to lean on code-based grading wherever you can, and to spot-check your judge against your own labels on a handful of cases before you trust it.
Then there's the cost of the evals themselves. Every run burns tokens and eats into the same rate-limit budget from the first half of this episode, so a big judge suite can end up costing more than the feature it's guarding. Use cost and latency checks to keep it honest, grade with cheap Haiku wherever Haiku is good enough, and push large offline runs through that fifty-percent batch path. And finally non-determinism: the same prompt gives different answers on different runs, so a single pass can pass or fail by luck. Run each case a few times and look at the pass rate, or assert on invariants that don't waver, does it compile, do the tests pass, instead of demanding the identical words every time.
That's the loop closed on both halves. Watch what you spend with the cost and usage commands and ccusage, pull the model, caching, and context levers to spend less of it, and wrap the prompts and skills you lean on in a small eval suite so they can't quietly get worse. It's the last rung of doing all of this by hand. From here, we start handing the keyboard over.