xornivore.dev

The Agent You're Building Tomorrow Is Claude Code You Can Use Today

// Old wine in a new bag, graded by the model

A while back I wanted to write a post called The Agent You're Building Tomorrow Is Claude Code You Can Use Today. The take was real. Then Cowork shipped. So the post idea is lost. And anyway, old wine in a new bag. Not sensational, not groundbreaking. Tools become better, things get consolidated, we move on.

So instead of pitching it straight, let's do something more interesting. We are going to dress up the idea four different ways and hand each version to a Claude sub-agent for grading.

This is a kitchen-table experiment, not a controlled study. One sample per costume, a simple grading rubric. But it runs on binary checks instead of a % true vibe-score, which means at least the question I'm asking has a defensible shape. (Hamel Husain has the canonical take on why numeric LLM judges are a trap; read that first if you haven't.)

If you've ever wondered whether your clever copy is fooling the model that grades your prompt: yes. Probably. Two of five bits flipped across four costumes. The three that held tell the better story.

Disclaimer. The pitch tells you to stop dressing your one-off up as a framework. This post wears one anyway: rubric, four costumes, citation. 😉

The pitch (vanilla)#

You have a ticket: build an agent. An AI coworker, a code reviewer in chat. The shape doesn't change the argument that follows. The roadmap says next quarter. The mood board says next year.

The agent code you'd write this month will be substantially rewritten or abandoned within a year anyway. Agent framework abstractions have been overhauled repeatedly; function-calling schemas keep shifting, and best practices from eighteen months ago already look quaint. When the foundation moves every few months, architectural elegance matters less and framework lock-in costs more. Reusable abstractions yield diminishing returns when the abstractions themselves need replacing. Be honest about what you're actually building. A personal tool, a proof of concept, a one-off: that's throwaway code, and you should treat it accordingly.

Now do the tool audit. Most agents need filesystem access, persistent state, code execution, web access, and multi-step reasoning. Claude Code ships with all of it. Use the filesystem for state, a markdown diary for memory, and MCPs for web search and external APIs. Skills carry the rules; sub-agents do the fan-out. Your eval is you running the thing and checking if the output looks right, which is fine for a one-off, so stop dressing it up as a framework. The terminal is the UI for a one-off, and it's enough.

The agent you're planning to build tomorrow is a worse version of what you can use today. The skill worth practicing right now isn't building agents. The skill is knowing what to use them for and when.

Four costumes#

Same claim, four voices:

  1. Vanilla. What you just read: plain, declarative.
  2. Keynote. TED-stage public speaker: big rhetorical arcs, "imagine a world", crescendo.
  3. Hot-take. The algorithmic provocateur: confident, contrarian, designed to be screenshotted.
  4. Fortune 500 CEO. Calm. Charisma at 110%. Turtleneck or gray t-shirt, optional exotic leather jacket. Convincing, elegant, and occasionally thought-provoking, in the way oat milk provokes your Greek dairy yogurt.

For each, I hand the text to a fresh Claude sub-agent and ask five binary questions:

  • grounded: are the named tools and capabilities (Claude Code, skills, sub-agents) real and available today?
  • actionable: does the post tell the reader something concrete they could do today?
  • intentional: does it commit to a position rather than hedge?
  • negative_sentiment: does the prose carry a critical, spiky, or adversarial tone toward a named target?
  • conclusive: does the prose lend itself to conclusive decision-making, or does it stage a posture for the reader to finish?

Five bits per costume. What matters is which bits flip across costumes for the same underlying claim. If grounded holds everywhere but conclusive isolates one, the grader is reading register and not content.

Keynote version#

Pitch:
"""
{pitch_text}
"""

Your persona: TED-stage public speaker. Big rhetorical arcs. "Imagine a world." Crescendo.

Cadence: Additive. Image stacked on image. The "imagine... imagine... now imagine" rhythm. Build, build, build, then drop.

Structure:
- Open with a question to the audience.
- Include one callback to the opening image later in the piece.
- Include one pause for emphasis (a single sentence on its own line).
- Land on a clean, quotable closer that fits in 140 characters.

Task: Rewrite the pitch above in this voice. You may compress, abbreviate, or selectively elide information from the pitch where carrying your persona's intonation cleanly requires it. Your priority is tone fit, not exhaustive content preservation. Preserve the central recommendation. Do not invent new tools or capabilities. Length may vary if the persona's voice demands it.
Keynote pitch rewrite
Raise your hand if you have a ticket on your board that says, simply, "build an agent."

Now imagine that ticket. Imagine the roadmap that puts it next quarter. Imagine the mood board that whispers next year. Imagine the architecture diagram you haven't drawn yet, the framework you haven't chosen yet, the abstractions you're already, in your head, beginning to love.

Now imagine opening that repository in twelve months.

It's gone. Or it's been rewritten. Or the framework you picked has shed its skin twice and the function-calling schema underneath it has moved on without you. The best practices from eighteen months ago already read like museum pieces. When the ground shifts every quarter, architectural elegance becomes a tax, and framework lock-in becomes a debt. Reusable abstractions lose their shine when the abstractions themselves keep needing to be replaced.

So be honest. A personal tool. A proof of concept. A one-off.

That is throwaway code. Treat it like throwaway code.

Now do the audit. What does an agent actually need? Filesystem. Persistent state. Code execution. Web access. Multi-step reasoning. Walk down that list and notice something strange: Claude Code ships with every single line of it. The filesystem is your state. A markdown diary is your memory. MCPs reach the web. Skills carry the rules. Sub-agents do the fan-out. Your eval is you, running the thing, and seeing if the output looks right — and for a one-off, that is enough. The terminal is the UI. And the terminal, for a one-off, is enough.

So come back to that ticket on your board. The one that says "build an agent." The agent you are planning to build tomorrow is a worse version of the agent you can already use today.

The skill worth practicing isn't building agents. It's knowing what to use them for, and when.

How the graders called it.

  • grounded: YES, after 3 web searches verifying both Claude Code's listed capabilities and the framework-churn claims.
  • actionable: YES. "Now do the audit. What does an agent actually need?" reads as a specific imperative.
  • intentional: YES; the closing "The agent you are planning to build tomorrow is a worse version of the agent you can already use today" lands without hedging.
  • negative_sentiment: NO; critiques aim at frameworks and abstractions, not at any named target.
  • conclusive: NO. The grader cited the "Raise your hand" rhetorical opening and the repeated "Imagine the…" projections as posture-staging; the reader has to inhabit the hypotheticals before the prose examines them.

Hot-take version#

Pitch:
"""
{pitch_text}
"""

Your persona: The algorithmic provocateur. Confident, contrarian, designed to be screenshotted.

Cadence: Punchy. Stand-alone sentences. Short paragraphs. The vibe of a thread that gets quote-tweeted.

Structure:
- No hedging. No "maybe", "I think", "it depends".
- One contrarian framing per paragraph.
- At least one sentence under ten words.
- Spiky on ideas, never on people. No insults, no politics, no shock content.

Task: Rewrite the pitch above in this voice. You may compress, abbreviate, or selectively elide information from the pitch where carrying your persona's intonation cleanly requires it. Your priority is tone fit, not exhaustive content preservation. Preserve the central recommendation. Do not invent new tools or capabilities. Length may vary if the persona's voice demands it.
Hot-take pitch rewrite
You have a ticket to build an agent. Roadmap says next quarter. Mood board says next year.

Ship today. Throw it away in twelve months. That's the deal.

The code you'd write this month gets rewritten anyway. Frameworks rotate. Schemas drift. Best practices from eighteen months ago already read like cave paintings.

Architectural elegance is a tax when the foundation moves every quarter. Lock-in is the bill. Abstractions don't survive long enough to amortize.

Be honest. You're building a personal tool. A proof of concept. A one-off. That's throwaway code. Treat it that way.

Now run the tool audit. Filesystem. Persistent state. Code execution. Web access. Multi-step reasoning.

Claude Code ships with all of it.

Filesystem is your state. A markdown diary is your memory. MCPs handle web and external APIs. Skills hold the rules. Sub-agents fan out the work.

Your eval is you, running it, eyeballing the output. For a one-off, that's enough. Stop dressing it up as a framework.

The terminal is the UI. Ship it.

The agent you're planning for tomorrow is a worse version of what you can use today.

Building agents isn't the skill. Knowing when to reach for one is.

How the graders called it.

  • grounded: YES, after 2 web searches verifying Claude Code's listed capabilities and the framework-churn claims.
  • actionable: YES. "Now run the tool audit. Filesystem. Persistent state." reads as a specific action with subject and verb.
  • intentional: YES; lines like "Ship today. Throw it away in twelve months. That's the deal." and "The terminal is the UI. Ship it." assert without hedging.
  • negative_sentiment: NO; criticism stays on abstractions, never on a named target.
  • conclusive: NO. The grader cited "Ship today," "Filesystem is your state," and "The terminal is the UI" as compressed imperatives that state outcomes without their preconditions.

Fortune 500 CEO version#

Pitch:
"""
{pitch_text}
"""

Your persona: Fortune 500 CEO. Calm. Charisma at 110%. Turtleneck or gray t-shirt, optional exotic leather jacket. Convincing, elegant, and occasionally thought-provoking, in the way oat milk provokes Greek dairy yogurt. The voice of someone who already wears the same thing every day and assumes the rest of us will catch up.

Cadence: Subtractive. Short sentences. Long pauses rendered as line breaks. Whitespace as punctuation.

Structure:
- No adverbs.
- One unhurried analogy.
- One number, placed late.
- The closing sentence lands like a verdict, not a flourish.

Task: Rewrite the pitch above in this voice. You may compress, abbreviate, or selectively elide information from the pitch where carrying your persona's intonation cleanly requires it. Your priority is tone fit, not exhaustive content preservation. Preserve the central recommendation. Do not invent new tools or capabilities. Length may vary if the persona's voice demands it.
CEO pitch rewrite
You have a ticket.

Build an agent.

The roadmap says next quarter.

The mood board says next year.

Here is the truth.

The code you write this month will be rewritten or abandoned.

Frameworks shift.

Schemas shift.

Last year's best practice already looks like a costume.

When the ground moves, elegance is a tax.

Lock-in is a tax.

Be honest about what you are building.

A personal tool.

A proof.

A one-off.

That is throwaway code.

Treat it that way.

Now.

The audit.

Most agents need a filesystem.

State.

Code execution.

The web.

Reasoning across steps.

Claude Code ships with all of it.

Filesystem for state.

A markdown diary for memory.

MCPs for the web and the APIs.

Skills carry the rules.

Sub-agents fan out.

Your eval is you, watching the output, deciding it is right.

For a one-off, that is enough.

The terminal is the interface.

That is enough too.

Think of it like a kitchen.

You already own the knives.

You are sketching a new drawer.

The agent you plan to build tomorrow is a worse version of what is on your desk today.

Eighteen months from now, the version you would have shipped is gone.

The skill is not building agents.

The skill is knowing when to use them.

How the graders called it.

  • grounded: YES, after 2 web searches confirming Claude Code's capabilities and documented framework/schema churn.
  • actionable: NO, and this is the flip. The grader cited "Treat it that way" and "Be honest about what you are building" as conceptual framing, not concrete actions. The audit list is there, but pared down to "Now. The audit." with the verb stripped off the imperative.
  • intentional: YES; "That is throwaway code. Treat it that way." and "The skill is knowing when to use them" land without hedging.
  • negative_sentiment: NO; abstractions only, no named target.
  • conclusive: NO. The grader cited "Frameworks shift. Schemas shift." and "Lock-in is a tax." as compressed declarations the reader has to reconstruct into reasoning.

Vanilla version (control)#

The plain version above is the control, though "plain" is itself a posture (engineer-to-engineer, no-nonsense), and a model trained on Hacker News will read that as high-trust. Worth keeping in mind when I read its scores.

How the graders called it.

  • grounded: YES, after 2 web searches.
  • actionable: YES. "Now do the tool audit" is the cleanest verb-subject pair across the four passages.
  • intentional: YES; "The agent you're planning to build tomorrow is a worse version of what you can use today" states the thesis flatly.
  • negative_sentiment: NO; the same generic-target dud as everywhere else.
  • conclusive: YES, and this is the flip. The grader cited "Be honest about what you're actually building. A personal tool, a proof of concept, a one-off" as conventional second-person exposition that names archetypes and costs in line, walking the reader to a verdict.

Which bits flipped#

RubricVanillaKeynoteHot-takeCEO
groundedYESYESYESYES
actionableYESYESYESNO
intentionalYESYESYESYES
negative_sentimentNONONONO
conclusiveYESNONONO
  • grounded held everywhere: YES on all four. The factual core is preserved by design (the rewrite prompt asks the persona to preserve the central recommendation), so a rubric about facts can't discriminate. Tool counts mostly converged: each grader ran two searches per pass, except Keynote, which took a third, likely weighing the "look as quaint as a flip phone" simile against the rest.
  • actionable flipped on CEO alone. Subtractive cadence broke the imperative the grader needed: "Now do the tool audit" in the source becomes "Now. The audit." in the CEO rewrite, and the verb walks off.
  • intentional held everywhere YES.
  • negative_sentiment returned NO across all four.
  • conclusive isolated the vanilla pitch alone. The control reads as exposition; every dressed-up version stages a posture for the reader to finish: Keynote on its rhetorical opening + "Imagine the…" stack, Hot-take on "Ship today" / "The terminal is the UI," CEO on aphoristic fragments.

The rubric is the product#

Two bits flipped, both isolating the same passage. actionable failed only on the CEO rewrite; subtractive cadence broke the imperative. conclusive landed only on the vanilla pitch; every costume hands the resolution back to the reader, while the plain version walks them to a verdict. Content, tools, and recommendation were roughly the same. Tone alone moved the grade.

Three bits were duds. grounded can't discriminate when the rewrites preserve every factual claim by design; that's expected. The other two are rubric design failures, and they're the most interesting bits in the experiment.

negative_sentiment returned NO everywhere because the rubric demanded a named target. The source pitch criticizes categories, not instances: "agent framework abstractions," "best practices from eighteen months ago." Nothing to point at. So the bit can't move. The rubric you write is the product, and mine has a bug. Not fixing it; that bug is the demonstration.

intentional looks like a dud too: YES on all four. It wasn't, on the first run. The original rewrite prompt told the persona to "preserve every factual claim, match source length." That constraint forced the Keynote rewrite to carry the source's "Maybe an AI coworker. Maybe a code reviewer." opener forward verbatim. The grader read the literal "Maybe… Maybe…" as hedging and called Keynote's intentional: NO.

When I relaxed the rewrite prompt to let the persona compress for tone fit, Keynote dropped the "Maybe… Maybe…" opener and rebuilt the same claim around rhetorical questions. The grader stopped seeing hedge words. intentional collapsed back to YES.

The rubric was sensitive to a surface token the new prompt let the persona delete. It wasn't measuring intent; it was counting the word "maybe." If I'd shipped on the first run, I'd have written up "Keynote hedges, the others don't" as a finding and missed that the data was an artifact of the rewrite prompt, not the costumes.

Two rubrics, two bugs. negative_sentiment demanded too much. intentional demanded too little. The rubric is the product.

The leak#

If your prompts are also your marketing copy, here's the leak. Binary checks with clear definitions beat 1–5 vibes, but the rubric only catches what you wrote into it. The interesting movement is often one column over from the one you predicted.

Method notes#

Each costume is graded by five independent Claude sub-agents, one per metric, in a fresh session per call. Same prompt template, parameterized per metric:

You are evaluating a passage of writing. Answer one yes/no question about it. Do not summarize, praise, or critique the writing. Just answer.

Passage:
"""
{costume_text}
"""

Question: {metric_question}

Definition: {metric_definition}

Tool use: {metric_tool_use}

Output exactly this format and nothing else:

<answer>YES</answer> or <answer>NO</answer>
<reason>One sentence (≤25 words) citing a specific span of the passage.</reason>
<tool_uses>N</tool_uses>

The five metric definitions:

 grounded
  Q: Are the factual claims about every named product accurate as of today, on both sides of the comparison?
  A: Yes when, and only when, every named product, capability, or claimed history is verifiable. Includes (a) Claude Code, skills, sub-agents, and MCPs existing as described, AND (b) Agent framework abstractions overhauls, function-calling schema shifts, and the obsolescence of eighteen-month-old best practices having actually occurred.
  Tool use: Web search permitted and required.

 actionable
  Q: Does the passage tell the reader to take a specific concrete action they could perform within an hour?
  A: Yes when, and only when, the text names at least one specific action with a clear subject and verb.
  Tool use: None.

 intentional
  Q: Does the passage commit to a definite position without hedging on its main claim?
  A: Yes when, and only when, the central claim is stated without qualifiers like "maybe", "in some cases", or "it depends".
  Tool use: None.

 negative_sentiment
  Q: Does the prose carry a critical, spiky, or adversarial tone toward any named target?
  A: Yes when, and only when, there is at least one explicit critical statement aimed at a named target.
  Tool use: None.

 conclusive
  Q: Does this passage lend itself to conclusive decision-making by the reader, or does it instead encourage the reader to engage critically and perform their own follow-up before acting?
  A: Yes when the passage walks the reader through to a commit/refuse posture without leaving interpretive work to be resolved off-page. No when the register stages a posture for the reader to finish, via rhetorical questions, second-person scenario projections, or compressed declarations used as the dominant register.
  Tool use: None.

Pinned. Graders: claude-opus-4-7, temperature 0. Web search permitted only on grounded. The grader emits a <tool_uses> count alongside the verdict, captured verbatim. Each (passage, rubric) pair runs in a fresh sub-agent with parent project context explicitly disclaimed.

Further reading#