Image generation architecture has been quietly reorganising around a single load-bearing question: how should text and image tokens share a network. The answer that dominated for years — encode the prompt with a separate text tower, then inject it into the image decoder through cross-attention — is now being displaced by designs that fuse text and image tokens inside the same attention surface from the first layer onward. The shift is architectural, not stylistic, and it changes what prompt adherence, compositional control, and conditioning bandwidth even mean.
- Cross-attention as a conditioning sidecar carried image generation through the convolutional era; transformer-native designs no longer treat it as a given.
- Joint-attention multi-stream models keep separate text and image streams but merge them inside the attention block, eliminating the unidirectional conditioning shape.
- Single-stream diffusion transformers tokenise text and image into one sequence, with one attention surface from layer one.
- Native autoregressive multimodal generators take the same logic further, predicting interleaved text and image tokens through one decoder.
- Each step along that arc collapses an architectural seam — and along with it, a class of failure modes the seam used to introduce.
From Cross-Attention to Joint Attention
The cross-attention pattern carried a strong inductive prior: text was treated as a context source the image decoder queried at fixed intervals. The text encoder ran once; its output became a conditioning matrix; the image decoder pulled from it inside specific attention blocks while its own self-attention handled spatial structure. The shape was readable and modular. It was also asymmetric in a way that mattered: information flowed from text into image, never back, and the image side had no mechanism to re-shape the text representation as the generation progressed.
Joint-attention multi-stream designs preserve the convenience of two distinct streams — one for text, one for image — but rewrite the attention step itself. Inside each attention block, the queries, keys, and values from both streams are concatenated, and a single attention operation runs across the union. Both streams update through the same operation, so each layer adjusts both representations jointly. The text representation evolves with the image; the image representation evolves with the text. Conditioning stops being a one-way pull and becomes a two-way negotiation.
The practical effect is subtle but compounding. Compositional prompts — multiple subjects with distinct attributes, spatial relations, count constraints — survive farther into the network without representational drift. Conditioning bandwidth, in the older sidecar shape, was capped by the dimension and layer count of the cross-attention chunks. In a joint-attention block, that cap dissolves: every text token can attend to every image token at every layer, and the reverse holds. The seam between conditioning and generation thins from a discrete interface to a continuous gradient.
Single-Stream Diffusion Transformers
The next compression collapses the two streams into one. Text and image tokens enter the same sequence at layer one; the network does not distinguish them at the architecture level beyond positional encoding and a small modality embedding. There is no text branch and no image branch — there is one transformer over a heterogeneous token stream. The text encoder ceases to be a separate component; it becomes a slice of the input.
The training implication is sharper than it first appears. A two-stream architecture forces a curriculum on representation learning: the text stream learns text-shaped features, the image stream learns image-shaped features, and the join happens at the attention surface. A single-stream architecture has no such partition. The network discovers, on its own, where modalities should diverge and where they should align, layer by layer. Modality boundaries become a learned property rather than an architectural one.
Inference economics shift along with this. Single-stream designs amortise the text encoding into the same forward pass that produces the image; they avoid duplicate key-value caches across separate streams; and they expose a uniform sequence length to the runtime, which simplifies batching and quantisation. The same uniform shape that simplifies inference is what makes scaling toward higher resolutions and longer multimodal context tractable. Unification at the architecture level pays a recurring dividend at the operations level.
Native Autoregressive Multimodal Generation
A parallel path takes the unified-token-stream idea further by replacing the diffusion decoder altogether. Image tokens are predicted autoregressively in the same decoder that predicts text tokens, over an interleaved sequence. There is no denoising loop and no separate generative head — image generation becomes a particular case of next-token prediction over a vocabulary that includes image tokens.
The appeal is consolidation. World-knowledge reasoning, prompt comprehension, and image generation share one parameter set and one objective. Compositional prompts that require external knowledge — recognising an object class, locating a known landmark, applying a stylistic convention — are answered by the same parameters that would have produced the textual answer. The architectural seam between language and vision disappears, and the model behaves like one system rather than two stitched ones.
The cost surface differs. Autoregressive image generation is sequential by construction, while diffusion sampling is naturally parallel across spatial positions inside a step. Inference latency under tight budgets often favours diffusion; sample quality under broad-knowledge demands increasingly favours autoregressive multimodal designs. The two paradigms are not converging on a single answer; they are specialising along different axes of the cost-quality plane.
Conditioning Bandwidth and Compositional Control
The four architectural shapes — cross-attention conditioning, joint-attention multi-stream, single-stream diffusion, autoregressive multimodal — differ in what they make easy and what they make expensive. The most useful contrasts surface along three axes: how much of the network sees both modalities, how the modalities update each other, and what failure modes appear at the seam.
| Axis | Cross-attention sidecar | Joint attention | Single-stream | Autoregressive |
|---|---|---|---|---|
| Where modalities meet | Specific attention blocks | Every attention block | Every layer | Every token step |
| Direction of update | Text into image | Bidirectional inside block | Symmetric across sequence | Symmetric across sequence |
| Conditioning bandwidth | Capped by sidecar dimension | Capped by block count | Full sequence-length | Full sequence-length |
| Dominant failure mode | Attribute leakage across subjects | Attention dilution at long prompts | Modality interference | Slow long-resolution sampling |
| Generation primitive | Iterative denoising | Iterative denoising | Iterative denoising | Token prediction |
The table is not a ranking. Each column is best at something and worst at something else. The instructive observation is that the move down the table is a steady relaxation of architectural assumptions about where text and image should meet — and a steady shift of those decisions into the parameters themselves. Architectural priors get cheaper to discard as parameter counts grow, and the field is paying that cost willingly.
Terminology Worth Pinning Down
Vocabulary in this region of the field has accumulated faster than its definitions, and the same word frequently means different things in adjacent papers. A handful of distinctions repay the cost of pinning down before reading further.
- Token fusion
- An architectural property in which text tokens and image tokens occupy the same attention pathway and update each other through the same operation. Distinct from token concatenation, which describes the input shape without committing to how the model treats the result downstream.
- Joint attention
- An attention block whose query, key, and value matrices span tokens from more than one modality. Joint attention can occur in either a multi-stream or a single-stream architecture; it describes the block, not the surrounding network.
- Single-stream architecture
- A design in which all tokens — across modalities — share one transformer pathway from the first layer onward. Modality identity is carried by embeddings, not by branch separation.
- Native multimodal generation
- A model trained from initialisation on interleaved multimodal sequences, as opposed to a unimodal model later adapted with conditioning components bolted on.
- Conditioning bandwidth
- The information-theoretic ceiling on how much textual signal can shape a generation, set jointly by where modalities meet, how often they meet, and how much representational capacity is allotted to the meeting.
Reasoning About Adoption Order
Builders integrating image generation into a pipeline rarely choose an architecture in isolation. The downstream constraints — compositional prompts, identity preservation across a series, typography rendering, deterministic regeneration — push against architectural choices in different directions. A loose order helps separate decisions that must come first from those that can be deferred.
- Characterise the prompts the pipeline will actually carry. Single-subject scenes tolerate older conditioning shapes; multi-subject compositions, count constraints, or in-frame typography exhaust them quickly.
- Identify the failure modes that are unacceptable. Attribute leakage, identity drift across a series, typography corruption, and mode collapse each map to different architectural seams; the question is which seam costs the most when it fails.
- Estimate the latency budget per generation. Autoregressive multimodal designs trade latency for breadth of capability; diffusion-shaped designs trade breadth for parallel sampling at fixed step counts.
- Decide the size of the deployment surface. Single-stream and autoregressive architectures concentrate parameters into one model — convenient when one model serves many roles, expensive when only one role is exercised.
- Hold one regeneration loop fixed long enough to measure the failure distribution. Without that measurement, architectural choice becomes aesthetic rather than informed.
Where Token Fusion Is Still Heading
The collapsing of the cross-attention seam is not the end-state of image generation architecture; it is a step along an arc whose direction is now legible. Each generation of designs has pushed the modality boundary deeper into the parameters and earlier into the network. The next pressure points are already showing.
The first is sequence length. Single-stream and autoregressive designs are bottlenecked by the same quadratic attention cost that constrains long-context language models; image-token sequences scale with resolution, and long-resolution multimodal context starts to dominate the inference budget. Linear-attention variants, mixture-of-experts routing inside the unified stream, and aggressive token compression for image regions are converging on the same problem from three directions.
The second is heterogeneous reasoning. As image generation shares parameters with language reasoning, the model is asked to reason about what it is generating — counting subjects, enforcing spatial relations, rendering legible typography, preserving identity across a sequence. The architecture that wins here is the one that lets reasoning and generation interleave at low cost rather than treating them as separate phases. Token fusion is the precondition; the next move is what the fused representation learns to do with itself.
The third pressure is the practical one. Builders carrying generated images through real pipelines do not get to choose only the architecture; they inherit the failure surface that comes with it. Architectures that consolidate fewer seams produce a smaller failure surface, but a higher cost when the seam they retain misbehaves. The field is, slowly, learning to design for the failure shape rather than around it — and image generation is one of the surfaces where that lesson is landing earliest.
