Skills, Teach Them Well

Watching an agent write code is impressive. Watching one document or audit it is a coin flip. So I wrote a couple of skills to see what teaching them looks like in practice.

For the uninitiated: a skill is a markdown file with a description and a checklist. The agent loads it when the description matches what you're asking for. That's it. The content is up to you.

I've published two so far in xornivore/skills. Both are read-mostly. They give the agent rules, not code to write.

`doxcavate`: docs for humans and agents#

doxcavate produces structured documentation in repos where written docs are sparse. Two modes: survey proposes a doc plan, draft writes one specific doc end-to-end with a factcheck-then-persona review gate.

Why bother? Most repos document themselves through team knowledge. One engineer remembers, not updated, the engineer leaves. Agents inherit that and confidently invent the rest.

doxcavate has one rule: when sources disagree, the code wins. Every produced doc carries a ## Verification block built from a factcheck pass. There's no flag to skip it.

The persona review is the part I'm happiest with. After the doc is drafted and factchecked, a kind-specific persona reviews it. The runbook persona is "an operator paged at an inconvenient hour". The how-it-works persona is "a skeptical, seasoned engineer" with a terminal open.

The reviewer emits at most 7 ranked changes. If it would have to drop substantive items to fit, the skill halts with E_PERSONA_OVERFLOW instead of silently truncating.

npx skills add xornivore/skills@doxcavate --agent claude-code -y

`observablip`: experimental telemetry audits, read-only#

observablip audits source code for observability gaps: missing telemetry, poor practices, and code that's hard to instrument. It returns a ranked, bounded list of findings.

It's read-only. Nothing executes or changes inside the target. If you ask it to fix things, it won't. The point is to surface the gaps and let you decide.

Some rules baked in:

No author attribution. Findings name behaviors, not authors. No git blame, no shame culture.
No verbatim secrets or PII. A finding about a log call passing a raw email field describes the shape, never the line.
No library prescription. If the codebase imports OTel, suggestions use OTel. If it doesn't, suggestions stay neutral. No "you should use X" pushed onto someone else's stack.
Bounded output. Default cap is 20 findings. Overflow surfaces with a count and a "narrow the target" prompt. Never silent truncation.

A fictional example#

Run against an imaginary intake service, the report looks like this (truncated):

Surviving candidates after FP review: 11 (none dropped:
none of the four FP checks fire; the package has no
covering middleware/decorator and no parent-frame span
in the same files). Final ranking applies dimension
order, then severity, then file:line.

[1] missing-telemetry · high · internal/intake/batchingester.go:185
    BatchIngester.Process is the workqueue consumer
    boundary for the batch-ingest path. It performs seven
    distinct external operations with no enclosing span
    and no boundary start/done log. Failures in any
    sub-step are indistinguishable in production traces.
    Suggest: open an OTel span intake.batch.process at
    function entry; record errors via span.RecordError;
    emit one structured clog event intake.batch.process.start
    and one …done with outcome and duration_seconds.

[2] missing-telemetry · med · internal/intake/batchingester.go:335
    The three pipeline commands (Parse, Validate, Index)
    and the infrastructure-vs-business error split are
    entirely unmetered. A regression in Index latency or
    a schema-registry outage cannot be detected without a
    per-phase counter, and the existing isInfrastructureError
    classification, which decides retry vs. deadletter,
    never surfaces as a metric label.
    Suggest: emit intake.pipeline_phase_total{phase, outcome}
    and intake.pipeline_phase_duration_seconds{phase} around
    each of the three calls.

... and 9 more

npx skills add xornivore/skills@observablip --agent claude-code -y

A few notes from writing them#

Things I picked up along the way (and you can too, from reading agentskills.io and experimenting on your own):

Rules with an audit attached (a grep pattern, a count check) survived edits better than soft suggestions. If I couldn't write the audit, the rule was usually a wish.
Splitting SKILL.md into a short entry plus references with their own action checklists kept the loaded context small. One big file would have wasted tokens.
When input exceeded a bound, halting with a message ("narrow the target") was more useful than truncating silently.
Read-only is the cheapest reliability mechanism in the box. You enforce it by not having a tool.
Overfitting to existing patterns is real. An agent reading inconsistent docs will keep producing them, unless the skill hands it a clearer standard.

A few markdown files. That's the share for today. If you found it useful, fork and twist it some more.

doxcavate: docs for humans and agents#

observablip: experimental telemetry audits, read-only#

A fictional example#

A few notes from writing them#

Further reading#

`doxcavate`: docs for humans and agents#

`observablip`: experimental telemetry audits, read-only#