Most documentation has been a writing problem. Generative models inherited that frame and spent a few years authoring prose about systems they could not run. A different architecture has been settling in: tools compile the doc, the model orchestrates the compilation. Reference text comes from introspecting a schema. Examples come from running examples and capturing what happened. The result reads like documentation because deterministic outputs were laid down in a deterministic order — not because a paragraph was generated about them.
- Documentation generation reframes the model as scheduler of tool runs rather than narrator of imagined system behaviour.
- Tool outputs land deterministically; LLM contributions sit at the orchestration seam, not inside the reference text itself.
- Three patterns dominate the field: schema introspection, example execution capture, and live-contract validation.
- Failure shifts from prose hallucination to version skew, stale schema caches, and unattested tool outputs.
- Review collapses into pipeline review — what gets reviewed is the orchestration graph and the validators, not paragraphs.
From narration to compilation
For most of the past decade, automated documentation was a generation task in the literary sense — given an API, produce paragraphs that describe what the API does. The model held the interface in its weights or in a context window and emitted prose. Accuracy was a function of prose discipline: how well the model resisted fabricating a method name, how recently the training data had refreshed against the live schema.
A different framing has been quietly winning in serious documentation pipelines. The model still appears in the loop, but it has stopped being the author of the reference text. Its role is to decide which tool to call, in what order, with what inputs, and what to do with the result. The reference text itself comes from the tool: a schema introspection call returns the parameter list, an example execution returns the actual response, a contract probe returns the live status of an endpoint. Prose, where it still exists, is a thin orchestration layer threaded between deterministic outputs.
The shift is closer in spirit to compilation than to writing. A compiler does not author the binary. It schedules deterministic transformations over an input, validates types along the way, and emits an artefact whose properties are traceable back to the source. Documentation compiled this way reads coherently because the orchestrator picked sensible inputs and assembled the outputs in a sensible order — not because a model invented a paragraph about a method whose signature it might or might not remember correctly.
Three load-bearing patterns
Three patterns repeatedly anchor tool-compiled documentation in production pipelines. Each replaces a class of prose with deterministic capture.
Schema introspection drives reference content. The pipeline points a tool at the canonical source — a schema specification, an interface descriptor, a database definition, or a typed module — and renders structured reference directly from that source. The model chooses which sections to surface and at what depth, but every parameter name, type, and constraint comes from introspection. The room for a hallucinated field name closes by construction.
Example execution capture replaces invented snippets with run-and-record. The orchestrator drafts an example, asks an isolated runner to execute it, and inlines the literal output — including errors when they surface. Readers see what the system actually returned for the inputs shown, not what a model would have predicted. The example becomes a small reproducible experiment rather than an illustration of one.
Contract validation against the live system closes the loop. After draft assembly, a final tool probes endpoints with the documented inputs and compares responses against the documented schema. Drift between the documented and the live behaviour surfaces as a validation failure rather than a quiet inconsistency a reader has to discover. The model's role at this stage is small but consequential: triage the diffs and decide whether to regenerate, escalate, or annotate.
The same three patterns expressed as a build cycle make the orchestration visible:
- Introspect the canonical source and emit structured reference fragments — parameter tables, type definitions, constraints — as machine-attested artefacts.
- Draft worked examples in an isolated runner, capture stdout, stderr, and response payloads verbatim, and bind each capture to a content-addressed identifier.
- Compose the prose orchestration layer — section ordering, connective tissue, conceptual framing — referencing the captured artefacts by identifier rather than restating them.
- Probe the live system with the documented inputs, diff responses against the documented schema, and route any drift into a triage queue before publication.
- Emit the compiled document with provenance for every reference fragment and every example so a later reader can trace any line back to the tool run that produced it.
Prose-authored vs tool-compiled docs
The shift can be seen most clearly as a contrast in disciplines rather than in tools. The same documentation goal — accurate reference, runnable examples, validated contracts — is approached by two different schools.
In a prose-authored discipline, documentation is fundamentally a writing artefact. Editorial passes are how truth is enforced. Reviewers chase factual errors paragraph by paragraph; freshness is a calendar problem; drift between the written reference and the live system tends to be discovered by readers in production. The model's job, when it appears, is to write better than the previous draft.
In a tool-compiled discipline, documentation is a build artefact. Reviewers attend to the orchestration graph and the validators; truth is enforced by binding sections to tool runs and rejecting outputs that fail the contract checks. Freshness is a build-trigger problem; drift surfaces as a failed validation step before the artefact ships. The model's job is to schedule the right tool calls and reason about the diffs.
| Axis | Prose-authored | Tool-compiled |
|---|---|---|
| Source of reference text | Model generation conditioned on training corpus or context | Tool introspection of the canonical schema |
| Example correctness | Plausibility argument | Captured runner output |
| Drift detection | Editorial review or reader complaint | Contract validation failure |
| Review surface | Paragraphs and sentences | Orchestration graph and validators |
| Freshness mechanism | Scheduled rewrite | Trigger on schema change |
| Failure mode | Hallucinated reference | Unattested tool output |
Vocabulary of the shift
Tool-compiled documentation comes with its own vocabulary, mostly inherited from compilation pipelines and generation infrastructure. A small set of terms recur in design discussions and tend to determine how a team scopes its pipeline.
- Tool fidelity
- The degree to which a tool's reported output faithfully represents what the underlying system actually did. Low fidelity — a runner that swallows errors, a schema introspector that omits constraints — undermines every downstream guarantee regardless of orchestration quality.
- Introspection contract
- The stable interface a source system exposes for tools to read its shape. Tool-compiled documentation lives or dies on whether the contract is complete enough to render reference text without paraphrase.
- Output durability
- A generated artefact's property of persisting unchanged across retries and reruns of the orchestration that produced it. In doc pipelines, durability is the difference between a captured example and a regenerated one that quietly shifts between visits.
- Validation seam
- The point in the pipeline where the assembled draft is checked against the live system before publication. Its design predicts the long-run accuracy of the artefact more than any other decision.
- Provenance binding
- An auditable link from a published line back to the tool run that produced it. Mature pipelines carry provenance to the publication artefact rather than discarding it at build time.
Where the stack fails
Failure in tool-compiled documentation does not look like prose hallucination. The model is not authoring the reference, so it has no opportunity to invent a parameter that does not exist. The interesting failure modes have migrated up the stack and become subtler — harder to spot in review, easier to ignore for longer.
Hallucinated tool output is the closest analogue. A tool returns a value, but the value reflects a stale cache, a stub, or a partial-success state the wrapper did not surface. The pipeline trusts the tool, the model trusts the pipeline, and the documentation publishes a line that was never true at the depth a reader will assume. Mitigation lives mostly upstream — content-addressed runner outputs, attested tool execution, capability scoping that refuses ambiguous successes.
Version skew is the cousin pathology. The introspection ran against schema version N, the example ran against N plus one, the contract probe ran against N minus one because that is what production happens to be on. Each tool was internally honest. The composed document is silently inconsistent. Strict version pinning at the orchestration layer — every tool call carries the same schema reference — closes most of this surface, and any tool that cannot honour the pin gets routed into the triage queue rather than into the artefact.
The third pathology is the absence of a review loop. Tool-compiled documentation is so plausible that teams stop reviewing it at all. The pipeline owns truth, the team owns delivery, and reviews dry up. When a tool quietly starts misreporting, no one catches it for a while. The review loop never disappears in a healthy pipeline; it relocates from prose review to validator review and tool-fidelity audits.
Closing the validation loop
The validation seam is the part of the pipeline most teams underestimate at design time and overinvest in once they have a near-miss in production. The seam is where the otherwise quiet promises of tool-compiled documentation get tested, and where the orchestrator's smallest decisions carry the largest downstream weight.
Validation falls into three layers, each catching a different class of error. The first is schema-shape validation: every captured artefact is asserted against the introspection schema before it is composed into the draft. The second is example replay: every example in the draft is re-executed against a frozen test environment immediately before publication, and the captured output is diffed against the original. The third is live-contract validation: the published draft's documented behaviour is probed against the production system on a deterministic schedule after publication, and any drift opens a freshness incident.
Each layer trades latency for confidence. Cheap pipelines do only the first and treat the result as if it were the third. Mature pipelines run all three, accept the latency, and own the discipline of triaging the failures the seams surface. The orchestration model's contribution at this stage is small in volume and large in consequence: it interprets diffs, classifies them into regenerate, annotate, or escalate, and feeds the result back into the build graph.
The most common mistake is treating validation as a publish-time formality. Documentation that compiles cleanly today does not stay compiled. Live systems drift, schemas evolve, runners change behaviour. The seam has to keep running long after the artefact has shipped — or the artefact reverts to being a calendar problem, which is precisely the regime tool-compilation was meant to escape.
What compiles next
Tool-compiled documentation is one instance of a broader pattern: any artefact whose accuracy can be expressed as a contract against a live system becomes a candidate for compilation rather than authorship. Documentation was the first to feel the pull because its accuracy contract was already legible — schemas, examples, responses — and because the cost of drift was visible to users. The same logic has started to surface in adjacent surfaces.
Test suites, especially the parts that document expected behaviour rather than enforce logic, fall into the same regime. Onboarding material that demonstrates real flows can be assembled from captured tool runs rather than narrated from memory. Reference SDKs shift from being a writing artefact toward being a generated artefact attached to the live schema. Release notes for systems whose changes can be observed via tool introspection can be drafted from diffs rather than authored from recollection.
The capability that makes all of this newly tractable is not a more eloquent model. It is the maturation of tool-calling interfaces — schema-stable, capability-scoped, audit-friendly — into something orchestrators can compose into pipelines without bespoke glue per tool. The model's most valuable contribution increasingly looks like scheduling and triage rather than prose.
The trajectory worth watching is whether validation infrastructure keeps pace. Pipelines that compile artefacts without commensurate validation tend to look impressive briefly and then degrade quietly. The teams that get tool-compiled documentation to work well over months, not weeks, share an unusual habit: they invest in the seam more than in the orchestrator. That order of priority tends to settle as the field matures.
