What is tool-compiled documentation?

Tool-compiled documentation is documentation whose reference content and examples come from running tools against the actual system rather than from prose generated by a language model. The model orchestrates which tools to run, but the substantive content of the documentation is produced deterministically by those tools.

How does an LLM act as an orchestrator in documentation generation?

In an orchestrator role, the LLM decides which tools to invoke, in what order, with what inputs, and how to react to their outputs. It does not author the reference text itself. The result is a documentation pipeline where the model's judgment shapes the sequence and selection while the deterministic tool runs supply the words readers consume.

What are the main pitfalls of tool-compiled documentation?

The main pitfalls are hallucinated tool output (a tool returning stale or partial state the orchestrator trusts), version skew across tool calls running against different schema versions, and a quiet collapse of the review loop once the pipeline starts feeling reliable. Each requires explicit countermeasures upstream rather than editorial vigilance downstream.

How does contract validation differ from prose review?

Contract validation probes the documented inputs against the live system and diffs the responses against the documented schema, surfacing drift as a build failure before publication. Prose review depends on a reader noticing something incorrect after publication, which scales poorly and detects only the most legible errors. The two operate at fundamentally different times and confidence levels.

When does prose-authored documentation still make sense?

Prose-authored documentation still makes sense when the subject cannot be introspected by a tool — conceptual overviews, design rationale, architectural narratives, anything sitting outside a schema-defined surface. Tool-compilation pays off most strongly where contracts are dense and the cost of drift is high; for everything else, authored prose remains the cleaner discipline.

When Tools Compile Documentation, the LLM Becomes an Orchestrator

Most documentation has been a writing problem. Generative models inherited that frame and spent a few years authoring prose about systems they could not run. A different architecture has been settling in: tools compile the doc, the model orchestrates the compilation. Reference text comes from introspecting a schema. Examples come from running examples and capturing what happened. The result reads like documentation because deterministic outputs were laid down in a deterministic order — not because a paragraph was generated about them.

Documentation generation reframes the model as scheduler of tool runs rather than narrator of imagined system behaviour.
Tool outputs land deterministically; LLM contributions sit at the orchestration seam, not inside the reference text itself.
Three patterns dominate the field: schema introspection, example execution capture, and live-contract validation.
Failure shifts from prose hallucination to version skew, stale schema caches, and unattested tool outputs.
Review collapses into pipeline review — what gets reviewed is the orchestration graph and the validators, not paragraphs.

From narration to compilation

For most of the past decade, automated documentation was a generation task in the literary sense — given an API, produce paragraphs that describe what the API does. The model held the interface in its weights or in a context window and emitted prose. Accuracy was a function of prose discipline: how well the model resisted fabricating a method name, how recently the training data had refreshed against the live schema.

A different framing has been quietly winning in serious documentation pipelines. The model still appears in the loop, but it has stopped being the author of the reference text. Its role is to decide which tool to call, in what order, with what inputs, and what to do with the result. The reference text itself comes from the tool: a schema introspection call returns the parameter list, an example execution returns the actual response, a contract probe returns the live status of an endpoint. Prose, where it still exists, is a thin orchestration layer threaded between deterministic outputs.

The shift is closer in spirit to compilation than to writing. A compiler does not author the binary. It schedules deterministic transformations over an input, validates types along the way, and emits an artefact whose properties are traceable back to the source. Documentation compiled this way reads coherently because the orchestrator picked sensible inputs and assembled the outputs in a sensible order — not because a model invented a paragraph about a method whose signature it might or might not remember correctly.

Three load-bearing patterns

Three patterns repeatedly anchor tool-compiled documentation in production pipelines. Each replaces a class of prose with deterministic capture.

Schema introspection drives reference content. The pipeline points a tool at the canonical source — a schema specification, an interface descriptor, a database definition, or a typed module — and renders structured reference directly from that source. The model chooses which sections to surface and at what depth, but every parameter name, type, and constraint comes from introspection. The room for a hallucinated field name closes by construction.

Example execution capture replaces invented snippets with run-and-record. The orchestrator drafts an example, asks an isolated runner to execute it, and inlines the literal output — including errors when they surface. Readers see what the system actually returned for the inputs shown, not what a model would have predicted. The example becomes a small reproducible experiment rather than an illustration of one.

Contract validation against the live system closes the loop. After draft assembly, a final tool probes endpoints with the documented inputs and compares responses against the documented schema. Drift between the documented and the live behaviour surfaces as a validation failure rather than a quiet inconsistency a reader has to discover. The model's role at this stage is small but consequential: triage the diffs and decide whether to regenerate, escalate, or annotate.

The same three patterns expressed as a build cycle make the orchestration visible:

Introspect the canonical source and emit structured reference fragments — parameter tables, type definitions, constraints — as machine-attested artefacts.
Draft worked examples in an isolated runner, capture stdout, stderr, and response payloads verbatim, and bind each capture to a content-addressed identifier.
Compose the prose orchestration layer — section ordering, connective tissue, conceptual framing — referencing the captured artefacts by identifier rather than restating them.
Probe the live system with the documented inputs, diff responses against the documented schema, and route any drift into a triage queue before publication.
Emit the compiled document with provenance for every reference fragment and every example so a later reader can trace any line back to the tool run that produced it.

Prose-authored vs tool-compiled docs

The shift can be seen most clearly as a contrast in disciplines rather than in tools. The same documentation goal — accurate reference, runnable examples, validated contracts — is approached by two different schools.

In a prose-authored discipline, documentation is fundamentally a writing artefact. Editorial passes are how truth is enforced. Reviewers chase factual errors paragraph by paragraph; freshness is a calendar problem; drift between the written reference and the live system tends to be discovered by readers in production. The model's job, when it appears, is to write better than the previous draft.

In a tool-compiled discipline, documentation is a build artefact. Reviewers attend to the orchestration graph and the validators; truth is enforced by binding sections to tool runs and rejecting outputs that fail the contract checks. Freshness is a build-trigger problem; drift surfaces as a failed validation step before the artefact ships. The model's job is to schedule the right tool calls and reason about the diffs.

Axis	Prose-authored	Tool-compiled
Source of reference text	Model generation conditioned on training corpus or context	Tool introspection of the canonical schema
Example correctness	Plausibility argument	Captured runner output
Drift detection	Editorial review or reader complaint	Contract validation failure
Review surface	Paragraphs and sentences	Orchestration graph and validators
Freshness mechanism	Scheduled rewrite	Trigger on schema change
Failure mode	Hallucinated reference	Unattested tool output

Vocabulary of the shift

Tool-compiled documentation comes with its own vocabulary, mostly inherited from compilation pipelines and generation infrastructure. A small set of terms recur in design discussions and tend to determine how a team scopes its pipeline.

Tool fidelity: The degree to which a tool's reported output faithfully represents what the underlying system actually did. Low fidelity — a runner that swallows errors, a schema introspector that omits constraints — undermines every downstream guarantee regardless of orchestration quality.
Introspection contract: The stable interface a source system exposes for tools to read its shape. Tool-compiled documentation lives or dies on whether the contract is complete enough to render reference text without paraphrase.
Output durability: A generated artefact's property of persisting unchanged across retries and reruns of the orchestration that produced it. In doc pipelines, durability is the difference between a captured example and a regenerated one that quietly shifts between visits.
Validation seam: The point in the pipeline where the assembled draft is checked against the live system before publication. Its design predicts the long-run accuracy of the artefact more than any other decision.
Provenance binding: An auditable link from a published line back to the tool run that produced it. Mature pipelines carry provenance to the publication artefact rather than discarding it at build time.

Where the stack fails

Failure in tool-compiled documentation does not look like prose hallucination. The model is not authoring the reference, so it has no opportunity to invent a parameter that does not exist. The interesting failure modes have migrated up the stack and become subtler — harder to spot in review, easier to ignore for longer.

Hallucinated tool output is the closest analogue. A tool returns a value, but the value reflects a stale cache, a stub, or a partial-success state the wrapper did not surface. The pipeline trusts the tool, the model trusts the pipeline, and the documentation publishes a line that was never true at the depth a reader will assume. Mitigation lives mostly upstream — content-addressed runner outputs, attested tool execution, capability scoping that refuses ambiguous successes.

Version skew is the cousin pathology. The introspection ran against schema version N, the example ran against N plus one, the contract probe ran against N minus one because that is what production happens to be on. Each tool was internally honest. The composed document is silently inconsistent. Strict version pinning at the orchestration layer — every tool call carries the same schema reference — closes most of this surface, and any tool that cannot honour the pin gets routed into the triage queue rather than into the artefact.

The third pathology is the absence of a review loop. Tool-compiled documentation is so plausible that teams stop reviewing it at all. The pipeline owns truth, the team owns delivery, and reviews dry up. When a tool quietly starts misreporting, no one catches it for a while. The review loop never disappears in a healthy pipeline; it relocates from prose review to validator review and tool-fidelity audits.

Closing the validation loop

The validation seam is the part of the pipeline most teams underestimate at design time and overinvest in once they have a near-miss in production. The seam is where the otherwise quiet promises of tool-compiled documentation get tested, and where the orchestrator's smallest decisions carry the largest downstream weight.

Validation falls into three layers, each catching a different class of error. The first is schema-shape validation: every captured artefact is asserted against the introspection schema before it is composed into the draft. The second is example replay: every example in the draft is re-executed against a frozen test environment immediately before publication, and the captured output is diffed against the original. The third is live-contract validation: the published draft's documented behaviour is probed against the production system on a deterministic schedule after publication, and any drift opens a freshness incident.

Each layer trades latency for confidence. Cheap pipelines do only the first and treat the result as if it were the third. Mature pipelines run all three, accept the latency, and own the discipline of triaging the failures the seams surface. The orchestration model's contribution at this stage is small in volume and large in consequence: it interprets diffs, classifies them into regenerate, annotate, or escalate, and feeds the result back into the build graph.

The most common mistake is treating validation as a publish-time formality. Documentation that compiles cleanly today does not stay compiled. Live systems drift, schemas evolve, runners change behaviour. The seam has to keep running long after the artefact has shipped — or the artefact reverts to being a calendar problem, which is precisely the regime tool-compilation was meant to escape.

What compiles next

Tool-compiled documentation is one instance of a broader pattern: any artefact whose accuracy can be expressed as a contract against a live system becomes a candidate for compilation rather than authorship. Documentation was the first to feel the pull because its accuracy contract was already legible — schemas, examples, responses — and because the cost of drift was visible to users. The same logic has started to surface in adjacent surfaces.

Test suites, especially the parts that document expected behaviour rather than enforce logic, fall into the same regime. Onboarding material that demonstrates real flows can be assembled from captured tool runs rather than narrated from memory. Reference SDKs shift from being a writing artefact toward being a generated artefact attached to the live schema. Release notes for systems whose changes can be observed via tool introspection can be drafted from diffs rather than authored from recollection.

The capability that makes all of this newly tractable is not a more eloquent model. It is the maturation of tool-calling interfaces — schema-stable, capability-scoped, audit-friendly — into something orchestrators can compose into pipelines without bespoke glue per tool. The model's most valuable contribution increasingly looks like scheduling and triage rather than prose.

The trajectory worth watching is whether validation infrastructure keeps pace. Pipelines that compile artefacts without commensurate validation tend to look impressive briefly and then degrade quietly. The teams that get tool-compiled documentation to work well over months, not weeks, share an unusual habit: they invest in the seam more than in the orchestrator. That order of priority tends to settle as the field matures.