What is a Transformer? | AI & LLM Glossary

A transformer is a neural network architecture introduced in 2017 that uses self-attention mechanisms to process input sequences in parallel, enabling it to capture long-range dependencies in text. It is the foundational architecture behind virtually all modern large language models.

Before transformers, sequence models like RNNs and LSTMs processed text one word at a time, making them slow to train and limited in their ability to capture relationships between distant words. The transformer architecture, introduced in the landmark paper "Attention Is All You Need," revolutionized natural language processing by replacing recurrence with self-attention.

The key innovation of transformers is the self-attention mechanism, which allows every token in a sequence to attend to every other token simultaneously. This means the model can directly capture relationships between words regardless of their distance in the text. For example, in "The cat that sat on the mat was happy," self-attention helps the model connect "cat" with "was happy" even though many words separate them.

Transformers consist of stacked layers, each containing multi-head self-attention and feed-forward neural networks. Multi-head attention runs several attention computations in parallel, allowing the model to capture different types of relationships simultaneously, such as syntactic structure, semantic meaning, and coreference. Layer normalization and residual connections help with training stability.

Modern LLMs like GPT, Claude, LLaMA, and Gemini are all based on the transformer architecture, typically using the decoder-only variant for text generation. The architecture's ability to be parallelized during training has enabled scaling to billions of parameters, which has been the driving force behind the remarkable capabilities of today's language models.

How It Works

Input embedding

Input tokens are converted to dense vector representations (embeddings) and combined with positional encodings that give the model information about each token's position in the sequence.

Self-attention computation

For each token, the model computes attention scores against all other tokens by comparing query, key, and value projections, producing a weighted combination that captures contextual relationships across the entire sequence.

Feed-forward processing

The attention output passes through feed-forward neural networks that apply non-linear transformations, enabling the model to learn complex patterns and features from the attention-enriched representations.

Layer stacking and output

Multiple transformer layers are stacked, with each layer refining the representations. The final layer's output is projected to vocabulary-sized logits for next-token prediction or other downstream tasks.

Examples

Text generation with GPT models

GPT-4 and similar models use a decoder-only transformer architecture with hundreds of billions of parameters. Each generated token attends to all previous tokens through causal self-attention, enabling coherent long-form text generation.

Machine translation

The original transformer was designed for translation, using an encoder-decoder architecture where the encoder processes the source language sentence and the decoder generates the translation, with cross-attention connecting the two.

Code understanding and generation

Code-focused models like Codex use transformers to understand programming language syntax and semantics, leveraging self-attention to track variable references, function definitions, and control flow across entire files.

Why It Matters

The transformer architecture is the foundation of the current AI revolution. Its ability to efficiently process sequences in parallel, scale to enormous sizes, and capture complex patterns in data has enabled the creation of LLMs that can understand and generate human-quality text across virtually every domain.

Frequently Asked Questions

Why are transformers better than RNNs for language tasks?

Transformers process all tokens in parallel rather than sequentially, making them much faster to train. Their self-attention mechanism directly captures long-range dependencies without information being lost through sequential processing. This combination of parallelism and effective context modeling makes transformers significantly more powerful for language tasks.

What is self-attention?

Self-attention is a mechanism where each token in a sequence computes a relevance score with every other token. These scores determine how much each token influences the representation of every other token, allowing the model to dynamically focus on the most relevant parts of the input for each position.

What is the difference between encoder and decoder transformers?

Encoder transformers (like BERT) process the entire input bidirectionally and are used for understanding tasks. Decoder transformers (like GPT) generate text left-to-right and can only attend to previous tokens. Encoder-decoder models (like the original transformer, T5) use both for tasks like translation.

How large are modern transformer models?

Modern LLMs range from a few billion parameters (e.g., LLaMA 7B) to hundreds of billions or more (e.g., GPT-4). The trend toward larger models has driven much of the recent progress in AI capabilities, though techniques like mixture-of-experts allow effective scaling without proportional compute increases.

Monitor Transformer-Based Models with Respan

Respan provides deep observability into transformer-based LLM performance. Track inference latency across different model sizes, monitor attention-related performance metrics, compare different transformer architectures, and ensure your chosen model delivers consistent quality for your specific use cases.

Try Respan free

What is a Transformer? | AI & LLM Glossary

How It Works

Input embedding

Input tokens are converted to dense vector representations (embeddings) and combined with positional encodings that give the model information about each token's position in the sequence.

Self-attention computation

Feed-forward processing

Layer stacking and output

Examples

Text generation with GPT models

Machine translation

Code understanding and generation

Why It Matters

Frequently Asked Questions

Why are transformers better than RNNs for language tasks?

What is self-attention?

What is the difference between encoder and decoder transformers?

How large are modern transformer models?

Monitor Transformer-Based Models with Respan

Try Respan free

What is a Transformer? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Monitor Transformer-Based Models with Respan

What is a Transformer? | AI & LLM Glossary

How It Works

Examples

Why It Matters

Related Terms

Frequently Asked Questions

Monitor Transformer-Based Models with Respan