A language model doesn't read text the way you do. You see words. It sees numbers. Before GPT can think about what you wrote, it must first transform words into a format it understands. This happens in stages.

The answer is split into three steps. First: tokenization (chopping text into pieces). Second: embeddings (giving each piece a numerical identity). Third: learning (the model figures out what the numbers mean through training). Let's start with the first step.

Tokenization - Breaking Text Into Pieces

An LLM can't read text directly. It needs a fixed vocabulary. Think of it like a dictionary: the model only knows words that are in its book.

The problem: There are infinite possible sentences, but a model needs a finite list.

The solution: Break text into small pieces called tokens. Each token gets a number.

Three Ways to Tokenize

You could break text at different levels. Each choice has tradeoffs.

Option 1: Letters

Text: "Hello"
Tokens: ["H", "e", "l", "l", "o"]
Vocabulary size: ~26 letters (tiny!)

Good: Very small vocabulary. Any word can be spelled.

Bad: Very inefficient. "Hello" needs 5 tokens. Long sequences are slow to process.

Option 2: Subwords

Text: "Hello running"
Tokens: ["Hello", " run", "ning"]
Vocabulary size: ~50,000 to ~200,000 (balanced)

Good: Mostly full words. Common words stay whole. Rare words break into pieces. Balanced efficiency and coverage.

Bad: Tokenization can be unpredictable. Similar words might split differently.

Option 3: Full words

Text: "Hello"
Tokens: ["Hello"]
Vocabulary size: ~500,000 words (huge!)

Good: Very efficient. One token per word.

Bad: Huge vocabulary. Every verb form needs its own entry: "run", "runs", "ran", "running". Can't handle typos or rare words.

Why 50,000 tokens? It's a sweet spot. Small enough to train efficiently. Large enough to keep common words whole. English text averages about 1.3 tokens per word with this approach.

From Text to Numbers

Once text is tokenized, each token gets a unique number from the vocabulary.

Text: "Hello world"
Tokens: ["Hello", " world"]
Token IDs: [15339, 2684]

The model sees only the numbers: [15339, 2684]. It has no idea what "Hello" or "world" mean. It just knows these are positions 15339 and 2684 in its vocabulary.

Why This Matters

Different tokens are completely unrelated numbers to the model:

Token 15339 = "Hello"
Token 47234 = "hello"
The model sees: 15339 vs 47234 (totally different)

The numbers are arbitrary. Token 100 and 101 have no special relationship just because they're close numerically. The distance between numbers is meaningless.

The problem: How can a model learn that "Hello" and "hello" are similar if they're just unrelated integers?

The answer: Embeddings. That's the next tab.

Summary: Text must become tokens because LLMs need a fixed vocabulary. We use subword tokenization (50,000 tokens) to balance efficiency and coverage. Each token becomes a number. But these numbers don't carry meaning on their own. The model must learn what they mean.

Embeddings - Giving Tokens Meaning

Remember the problem from tokenization: Token 15339 ("Hello") and Token 47234 ("hello") are just unrelated numbers. The model has no way to know they're similar.

The solution: Turn each token into a vector of numbers. Not just one number, but many.

From One Number to Many

An embedding is a list of numbers that represents a token. Instead of one dimension, you get hundreds.

Token ID: 15339
Embedding (simplified to 8 dimensions):
[0.24, -0.51, 0.83, -0.12, 0.67, -0.33, 0.91, -0.45]

In reality, models use embeddings with 768, 1024, or even 4096 dimensions. But the idea is the same: each token becomes a point in a high-dimensional space.

Why vectors? Vectors let us measure similarity. Two vectors that point in similar directions are similar. Two vectors pointing in opposite directions are different. This gives the model a way to understand that "Hello" and "hello" are related.

Visualizing Embeddings

Imagine a 2D space (even though real embeddings have hundreds of dimensions). Each token is a point:

Token "cat" → [0.8, 0.6]
Token "dog" → [0.7, 0.5]
Token "kitten" → [0.75, 0.65]
Token "run" → [-0.3, 0.9]

Here's what this looks like in 2D space:

← Dimension 2 →
← Dimension 1 →
cat
dog
kitten
run

Notice how "cat", "dog", and "kitten" cluster together in the top-right. "run" is far away in the top-left.

Measuring similarity with cosine similarity:

The standard way to compare embeddings is cosine similarity. It measures how much two vectors point in the same direction, regardless of their length.

Cosine similarity = (vector A · vector B) / (|A| × |B|)
Range: -1 (opposite) to +1 (same direction)
"cat" ↔ "kitten"
Very similar (cosine: 0.98)
"cat" ↔ "dog"
Similar (cosine: 0.82)
"cat" ↔ "run"
Not similar (cosine: 0.18)

Similar tokens have cosine similarity close to 1. Unrelated tokens have similarity close to 0. This is the standard metric used in practice.

Why cosine similarity instead of distance?
Cosine similarity measures the angle between vectors, not their length. This matters because embeddings can have different magnitudes but similar meanings. Two vectors pointing in the same direction are similar, even if one is longer. In practice, people use cosine similarity or dot product after normalizing vectors (which is equivalent). Euclidean distance can work if vectors are normalized, but cosine is the default in embedding comparisons.

The Embedding Table

How does this work? The model has an embedding table: a big lookup table that maps each token ID to its vector.

Vocabulary size: 50,000 tokens
Embedding dimension: 768
Embedding table size: 50,000 × 768 = 38.4 million numbers

When the model sees token 15339, it looks up row 15339 in this table and gets the corresponding 768-dimensional vector.

Input: Token ID 15339
Lookup: Row 15339 in embedding table
Output: [0.24, -0.51, 0.83, ..., -0.45] (768 numbers)

Embeddings Are Learned

Here's the crucial part: the model doesn't know what these embeddings should be at the start.

At initialization, the embedding table is filled with random numbers. Token "cat" might start as [0.01, -0.23, 0.44, ...]. Token "dog" might be [-0.88, 0.12, -0.03, ...].

During training, the model learns better embeddings. It adjusts the numbers so that:

  • Similar words get similar vectors
  • Related concepts cluster together
  • The embeddings help predict what comes next

How does it learn? The model tries to predict the next token. When it gets it wrong, it adjusts the embeddings (and other parameters) to do better next time. Over billions of training examples, the embeddings become meaningful representations.

Why This Matters

Embeddings solve the problem from tokenization. Remember:

Before: Token 15339 and Token 47234 are unrelated numbers
After: Their embeddings are similar vectors

Now the model can understand that "Hello" (token 15339) and "hello" (token 47234) mean similar things, even though their token IDs are different.

The same goes for any relationship: "king" and "queen", "run" and "running", "Paris" and "France". The model learns these relationships through the embeddings.

The Magic of High Dimensions

Why use 768 dimensions instead of just 2 or 3? Because language is complex.

The word "bank" can mean:

  • A financial institution
  • The side of a river
  • To tilt an airplane

With hundreds of dimensions, the model can encode multiple aspects of meaning at once. One dimension might capture "finance", another "geography", another "motion".

The model doesn't explicitly assign dimensions to concepts. It just learns what works. But the result is that embeddings capture rich, multi-faceted meaning.

Summary: Embeddings turn tokens (single numbers) into vectors (lists of numbers). These vectors live in a high-dimensional space where similar tokens are close together. The embeddings are learned during training, not hand-coded. They give the model a way to understand that different tokens can have similar meanings.

But embeddings are just the start. The real power comes from what happens next: learning.

Attention - Letting Tokens Share Context

Embeddings give each token a rich vector, but the model still needs a way to decide which tokens matter to one another. Attention is the coordination step. It lets every token look around the entire sequence and pull in the information it needs.

The Challenge: Ambiguity in Language

Consider the sentence: "The bank by the river"

The word "bank" is ambiguous. It could mean:

  • A financial institution (money bank)
  • The side of a river (riverbank)
  • To tilt or lean

A human reader looks at "river" and immediately knows this is the riverside meaning. The model needs a similar ability: to look at context and adjust understanding. Attention provides exactly that.

Queries, Keys, and Values

Every token in the sequence creates three new vectors through learned linear projections. For each attention head:

Q = X·W_Q (embedding multiplied by Query weight matrix)
K = X·W_K (embedding multiplied by Key weight matrix)
V = X·W_V (embedding multiplied by Value weight matrix)
Query (Q)
What am I looking for?
The "question" this token asks, created by Q = X·W_Q. For "bank", its query signals what context would help. Compared to keys via dot product.
Key (K)
What do I advertise?
The "topic tags" this token offers, created by K = X·W_K. Dot-product with queries determines attention scores (scaled by √d_k).
Value (V)
The information I share
Created by V = X·W_V. This is what gets blended by attention weights. When a token receives attention, its value is what contributes to the output.

Positional Information

Attention alone doesn't know the order of tokens. Without position information, "the cat sat" would be identical to "sat cat the". The model injects order through positional encodings added to embeddings before attention:

Common approaches:
• Sinusoidal: Mathematical formula based on position and frequency
• Learned absolute: Each position gets a learned vector
• RoPE (Rotary): Rotation-based, improves extrapolation
• ALiBi: Attention bias based on relative positions

Without these, the transformer would be permutation-invariant: the model would treat "the cat sat" the same as any reordering.

The Attention Mechanism Step-by-Step

Let's trace what happens when the model processes "The bank by the river" (including positional info):

the
bank
by
the
river

Step 1: Compute Similarity Scores

For the token "bank", compute how similar its Query is to each token's Key using dot product:

attention_scores = Query(bank) · Keys(all tokens)
Scores: [the: 0.2, bank: 0.5, by: 0.1, the: 0.2, river: 0.8]

The dot product measures similarity: vectors pointing in similar directions get higher scores. Notice that "river" gets the highest score (0.8). This is right: "river" is the most relevant context for "bank".

Why dot product? Isn't that cosine similarity?
Great question! Dot product and cosine similarity are closely related. Cosine similarity = (A · B) / (|A| × |B|). In practice, attention mechanisms use dot product after normalizing vectors (or they normalize internally). Either way, both measure the same thing: how much two vectors point in the same direction. The dot product is computationally simpler and just as effective. In embeddings comparisons, people typically use cosine similarity or normalized dot product interchangeably.
the
0.2
bank
0.5
by
0.1
the
0.2
river
0.8

Step 2: Stabilize and Normalize

Raw scores can get too large. We scale them by dividing by √(d_k), where d_k is the per-head key/query dimension (not the full embedding dimension). Then we apply softmax to turn scores into probabilities:

Example: d_model=768, num_heads=12 → d_k = 768/12 = 64
scaled_scores = scores / √(64) = scores / 8
attention_weights = softmax(scaled_scores)
Weights: [the: 0.10, bank: 0.15, by: 0.08, the: 0.10, river: 0.57]

Notice: "river" now has attention weight 0.57. It gets more than half the focus. The others get much less. This is the attention in action.

Attention weights visualization (for "bank"):
the (10%)
bank (15%)
by (8%)
the (10%)
river (57%)

Step 3: Blend Values

Multiply each token's Value vector by its attention weight, then sum:

output = (0.10 × Value(the) + 0.15 × Value(bank) + 0.08 × Value(by)
+ 0.10 × Value(the) + 0.57 × Value(river))

Since "river" dominates with 0.57 weight, its value (which encodes geographical meaning) flows heavily into the output. The "bank" token's representation is updated to incorporate this river-related context.

Multi-Head Attention

One attention head might miss important patterns. Transformers use multiple heads in parallel, each with its own Query, Key, and Value matrices:

Head 1: Grammar
Tracks subjects, verbs, objects
Attends to: "the" → "bank" (grammatical structure)
+
Head 2: Semantics
Tracks meaning and context
Attends to: "river" (geographical context)
+
Head 3: ...
Other patterns
Multiple specializations
→ Concatenate & Project

Each head produces an output vector. They're concatenated and passed through a final weight matrix to produce the final attention output. This way, the model gathers multiple types of information at once: grammar, semantics, long-range dependencies, and more.

Putting It Together: The Attention Mechanism

Scaled Dot-Product Attention (single head)
Attention(Q, K, V) = softmax(QKT / √dk) V
Q = Queries, K = Keys, V = Values, dk = dimension of keys

Summary: Attention is how tokens learn to focus on relevant parts of the sequence. Queries and Keys determine which tokens attend to which others through similarity scoring. Values carry the actual information. Softmax turns scores into probabilities. Multiple heads in parallel capture different types of relationships. This mechanism allows the model to understand long-range context, resolve ambiguity (like "bank" meaning), and build rich contextual representations.

Learning - How Matrix Multiplications and Attention Create Understanding

We have embeddings now. Each token is a vector of 768 numbers. But these vectors don't mean much at first. They're random. The model must learn what they should represent.

This is where the transformer architecture comes in. Two main mechanisms work together: attention (which you learned about in the previous tab) and feedforward networks (made of matrix multiplications). They're the tools the model uses to transform embeddings and discover patterns across billions of training examples.

What Is a Matrix Multiplication?

A matrix multiplication is a mathematical operation that transforms vectors. Think of it as a function: you put a vector in, you get a different vector out.

Input vector: [0.8, 0.6, 0.3]
Matrix: [[0.5, 0.2], [0.1, 0.9], [0.7, 0.3]]
Output vector: [0.67, 0.79]

The matrix contains weights (numbers). When you multiply the input vector by the matrix, you get a new vector. The weights control how the transformation happens.

How does matrix multiplication work mathematically?
Each element in the output is a weighted sum of the input. For example: output[0] = (input[0] × 0.5) + (input[1] × 0.1) + (input[2] × 0.7). The matrix tells you what weights to use. In an LLM, these weights are learned during training. The model adjusts them to transform embeddings in useful ways.

How Transformers Work: Attention + Feedforward

Each layer in a transformer has two parts that work together. Here's what happens:

1
Start with embeddings
Token vectors: [0.24, -0.51, 0.83, ...]
2
Multi-Head Attention
Tokens look at each other using Q, K, V (from the Attention tab). Mix information based on relevance.
3
Feedforward Network
Matrix multiplications transform and refine the context-aware vectors
4
Output to next layer
Richer representations with multiple types of learned patterns

This cycle repeats in every transformer layer. But each layer also includes critical components that make depth work:

Each transformer layer:
1. Add positional encoding (if first layer)
2. Attention block:
• Multi-head attention (mixes information ACROSS tokens)
• Add residual connection: output = attention_out + input
• Layer normalization
3. Feedforward block (MLP):
• Matrix multiply, apply GELU nonlinearity, matrix multiply again
• Mixes features WITHIN each token (not across tokens)
• Add residual connection: output = mlp_out + attention_out
• Layer normalization

Key distinction: Attention mixes information across all tokens. The MLP mixes features within each token separately. This separation allows the model to attend first, then reason.

Early layers focus on local patterns; later layers use that context for semantics and reasoning.

Example: Understanding Context (Attention + Feedforward)

Consider the sentence: "The bank by the river"

Token "bank" starts with embedding: [0.2, 0.5, ...]
ATTENTION LAYER:
• "bank" attends to all tokens via queries/keys/values
• Highest attention to "river" (0.57 weight, as you learned)
• Context vector: [0.1, 0.9, ...] (mixed with "river")
FEEDFORWARD LAYER:
• Matrix multiplications refine: [0.1, 0.9, ...] → [0.05, 0.95, ...]
• Result encodes: bank = riverside, not financial

The attention layer decides which tokens matter. The feedforward layer learns what those relationships mean. Together, they let the model combine information from "river" into the representation of "bank". This is how context affects meaning.

Learning Through Training

At the start of training, the weight matrices are random. The model makes terrible predictions. But here's how it improves:

📝
1. Try to predict
Model sees "The cat sat on the" and tries to predict next token
2. Get it wrong
Predicts "car" but correct answer is "mat"
🔧
3. Adjust weights
Change matrix values to make "mat" more likely next time
🔄
4. Repeat billions of times
Gradually, predictions improve across all patterns

Each time the model is wrong, it adjusts the weight matrices slightly using backpropagation. The training process:

1. Forward pass: Text → tokens → embeddings → 30+ layers → logits
2. Loss: Softmax cross-entropy between predicted and true next token
3. Backward pass: Compute gradients for all layers
4. Update: Use optimizer (e.g., AdamW) to adjust:
• Transformer weight matrices (attention, MLP)
• Embedding table entries
• Positional encodings (if learned)
5. Repeat billions of times across massive datasets

Over billions of examples, the weights converge to values that predict well.

Key insight: The model doesn't learn rules like "adjectives come before nouns." It learns patterns in the weight matrices. These matrices encode everything: grammar, facts, reasoning patterns. All through repeated adjustments from training examples.

Why Multiple Layers?

LLMs don't do just one transformation. They do this hundreds of times, organized into layers. Each layer contains both attention and feedforward networks.

Layer 1
Attention: Local patterns. Feedforward: Basic features (punctuation, common sequences)
Layer 10
Attention: Broader context. Feedforward: Syntax (sentence structure, grammar)
Layer 20
Attention: Long-range dependencies. Feedforward: Semantics (meaning, relationships)
Layer 30+
Attention: Abstract concepts. Feedforward: Reasoning (logic, inference, facts)

In early layers, attention looks at nearby tokens and feedforward learns simple patterns. In later layers, attention looks across the entire sequence and feedforward learns complex reasoning. Each layer builds on the previous one, gradually extracting more sophisticated understanding from the raw tokens.

How many parameters does an LLM have?
Each number in a weight matrix is a parameter. A model with 7 billion parameters has 7 billion numbers spread across all its weight matrices. Larger models (70B, 405B parameters) have more capacity to learn complex patterns. Training adjusts all these parameters simultaneously to minimize prediction errors.

Putting It All Together: The Complete Transformer Pipeline

Here's the complete pipeline:

1. Text → Tokens: "Hello world" → [15339, 2684]
2. Tokens → Embeddings: [15339, 2684] → [[0.24, -0.51, ...], [0.12, 0.83, ...]]
3. Embeddings → Layers: Process through 30+ transformer layers
Each layer contains:
• Attention: Which tokens to focus on (Q, K, V dot products)
• Feedforward: What patterns to learn (matrix multiplications)
4. Layers → Prediction: Output probabilities for next token

Attention and feedforward networks power step 3. They work together, repeatedly, to transform the embeddings. Attention finds patterns (relationships between tokens), feedforward learns what those patterns mean. This repeats 30+ times, building deeper understanding with each layer.

The weights in both attention and feedforward are learned. That's what "training" means: adjusting billions of numbers until the model predicts well on billions of examples.

Summary: Learning happens through two mechanisms working together in every layer: Attention finds what's relevant (which tokens to focus on), and feedforward networks learn what those patterns mean (matrix multiplications). The model adjusts billions of weight matrices to make better predictions. Over billions of training examples, these weights encode everything the model knows about language.

The Big Picture: Everything Connected

You now understand the full transformer architecture:

  • Tokenization: Text becomes numbers (token IDs)
  • Embeddings: Token IDs become vectors (learned representations)
  • Attention: Tokens look at each other to find relevant context (Q, K, V dot products)
  • Feedforward: Matrix multiplications learn what those relationships mean
  • Layers: These two mechanisms repeat 30+ times, each layer building deeper understanding

Every word you feed to an LLM goes through this journey. Attention figures out what matters. Feedforward networks learn what it means. Repeat 30+ times. The magic is in the learned weight matrices and attention patterns, built from billions of training examples.