Understanding LLM Input Space: Tokens, Embeddings, and Learning
A language model doesn't read text the way you do. You see words. It sees numbers. Before GPT can think about what you wrote, it must first transform words into a format it understands. This happens in stages.
The answer is split into three steps. First: tokenization (chopping text into pieces). Second: embeddings (giving each piece a numerical identity). Third: learning (the model figures out what the numbers mean through training). Let's start with the first step.
Tokenization - Breaking Text Into Pieces
An LLM can't read text directly. It needs a fixed vocabulary. Think of it like a dictionary: the model only knows words that are in its book.
The problem: There are infinite possible sentences, but a model needs a finite list.
The solution: Break text into small pieces called tokens. Each token gets a number.
Three Ways to Tokenize
You could break text at different levels. Each choice has tradeoffs.
Option 1: Letters
Good: Very small vocabulary. Any word can be spelled.
Bad: Very inefficient. "Hello" needs 5 tokens. Long sequences are slow to process.
Option 2: Subwords
Good: Mostly full words. Common words stay whole. Rare words break into pieces. Balanced efficiency and coverage.
Bad: Tokenization can be unpredictable. Similar words might split differently.
Option 3: Full words
Good: Very efficient. One token per word.
Bad: Huge vocabulary. Every verb form needs its own entry: "run", "runs", "ran", "running". Can't handle typos or rare words.
Why 50,000 tokens? It's a sweet spot. Small enough to train efficiently. Large enough to keep common words whole. English text averages about 1.3 tokens per word with this approach.
From Text to Numbers
Once text is tokenized, each token gets a unique number from the vocabulary.
The model sees only the numbers: [15339, 2684]. It has no idea what "Hello" or "world" mean. It just knows these are positions 15339 and 2684 in its vocabulary.
Why This Matters
Different tokens are completely unrelated numbers to the model:
The numbers are arbitrary. Token 100 and 101 have no special relationship just because they're close numerically. The distance between numbers is meaningless.
The problem: How can a model learn that "Hello" and "hello" are similar if they're just unrelated integers?
The answer: Embeddings. That's the next tab.
Summary: Text must become tokens because LLMs need a fixed vocabulary. We use subword tokenization (50,000 tokens) to balance efficiency and coverage. Each token becomes a number. But these numbers don't carry meaning on their own. The model must learn what they mean.
Embeddings - Giving Tokens Meaning
Remember the problem from tokenization: Token 15339 ("Hello") and Token 47234 ("hello") are just unrelated numbers. The model has no way to know they're similar.
The solution: Turn each token into a vector of numbers. Not just one number, but many.
From One Number to Many
An embedding is a list of numbers that represents a token. Instead of one dimension, you get hundreds.
In reality, models use embeddings with 768, 1024, or even 4096 dimensions. But the idea is the same: each token becomes a point in a high-dimensional space.
Why vectors? Vectors let us measure similarity. Two vectors that point in similar directions are similar. Two vectors pointing in opposite directions are different. This gives the model a way to understand that "Hello" and "hello" are related.
Visualizing Embeddings
Imagine a 2D space (even though real embeddings have hundreds of dimensions). Each token is a point:
Here's what this looks like in 2D space:
Notice how "cat", "dog", and "kitten" cluster together in the top-right. "run" is far away in the top-left.
Measuring similarity with cosine similarity:
The standard way to compare embeddings is cosine similarity. It measures how much two vectors point in the same direction, regardless of their length.
Similar tokens have cosine similarity close to 1. Unrelated tokens have similarity close to 0. This is the standard metric used in practice.
The Embedding Table
How does this work? The model has an embedding table: a big lookup table that maps each token ID to its vector.
When the model sees token 15339, it looks up row 15339 in this table and gets the corresponding 768-dimensional vector.
Embeddings Are Learned
Here's the crucial part: the model doesn't know what these embeddings should be at the start.
At initialization, the embedding table is filled with random numbers. Token "cat" might start as [0.01, -0.23, 0.44, ...]. Token "dog" might be [-0.88, 0.12, -0.03, ...].
During training, the model learns better embeddings. It adjusts the numbers so that:
- Similar words get similar vectors
- Related concepts cluster together
- The embeddings help predict what comes next
How does it learn? The model tries to predict the next token. When it gets it wrong, it adjusts the embeddings (and other parameters) to do better next time. Over billions of training examples, the embeddings become meaningful representations.
Why This Matters
Embeddings solve the problem from tokenization. Remember:
Now the model can understand that "Hello" (token 15339) and "hello" (token 47234) mean similar things, even though their token IDs are different.
The same goes for any relationship: "king" and "queen", "run" and "running", "Paris" and "France". The model learns these relationships through the embeddings.
The Magic of High Dimensions
Why use 768 dimensions instead of just 2 or 3? Because language is complex.
The word "bank" can mean:
- A financial institution
- The side of a river
- To tilt an airplane
With hundreds of dimensions, the model can encode multiple aspects of meaning at once. One dimension might capture "finance", another "geography", another "motion".
The model doesn't explicitly assign dimensions to concepts. It just learns what works. But the result is that embeddings capture rich, multi-faceted meaning.
Summary: Embeddings turn tokens (single numbers) into vectors (lists of numbers). These vectors live in a high-dimensional space where similar tokens are close together. The embeddings are learned during training, not hand-coded. They give the model a way to understand that different tokens can have similar meanings.
But embeddings are just the start. The real power comes from what happens next: learning.
Attention - Letting Tokens Share Context
Embeddings give each token a rich vector, but the model still needs a way to decide which tokens matter to one another. Attention is the coordination step. It lets every token look around the entire sequence and pull in the information it needs.
The Challenge: Ambiguity in Language
Consider the sentence: "The bank by the river"
The word "bank" is ambiguous. It could mean:
- A financial institution (money bank)
- The side of a river (riverbank)
- To tilt or lean
A human reader looks at "river" and immediately knows this is the riverside meaning. The model needs a similar ability: to look at context and adjust understanding. Attention provides exactly that.
Queries, Keys, and Values
Every token in the sequence creates three new vectors through learned linear projections. For each attention head:
Positional Information
Attention alone doesn't know the order of tokens. Without position information, "the cat sat" would be identical to "sat cat the". The model injects order through positional encodings added to embeddings before attention:
Without these, the transformer would be permutation-invariant: the model would treat "the cat sat" the same as any reordering.
The Attention Mechanism Step-by-Step
Let's trace what happens when the model processes "The bank by the river" (including positional info):
Step 1: Compute Similarity Scores
For the token "bank", compute how similar its Query is to each token's Key using dot product:
The dot product measures similarity: vectors pointing in similar directions get higher scores. Notice that "river" gets the highest score (0.8). This is right: "river" is the most relevant context for "bank".
Step 2: Stabilize and Normalize
Raw scores can get too large. We scale them by dividing by √(d_k), where d_k is the per-head key/query dimension (not the full embedding dimension). Then we apply softmax to turn scores into probabilities:
Notice: "river" now has attention weight 0.57. It gets more than half the focus. The others get much less. This is the attention in action.
Step 3: Blend Values
Multiply each token's Value vector by its attention weight, then sum:
Since "river" dominates with 0.57 weight, its value (which encodes geographical meaning) flows heavily into the output. The "bank" token's representation is updated to incorporate this river-related context.
Multi-Head Attention
One attention head might miss important patterns. Transformers use multiple heads in parallel, each with its own Query, Key, and Value matrices:
Each head produces an output vector. They're concatenated and passed through a final weight matrix to produce the final attention output. This way, the model gathers multiple types of information at once: grammar, semantics, long-range dependencies, and more.
Putting It Together: The Attention Mechanism
Summary: Attention is how tokens learn to focus on relevant parts of the sequence. Queries and Keys determine which tokens attend to which others through similarity scoring. Values carry the actual information. Softmax turns scores into probabilities. Multiple heads in parallel capture different types of relationships. This mechanism allows the model to understand long-range context, resolve ambiguity (like "bank" meaning), and build rich contextual representations.
Learning - How Matrix Multiplications and Attention Create Understanding
We have embeddings now. Each token is a vector of 768 numbers. But these vectors don't mean much at first. They're random. The model must learn what they should represent.
This is where the transformer architecture comes in. Two main mechanisms work together: attention (which you learned about in the previous tab) and feedforward networks (made of matrix multiplications). They're the tools the model uses to transform embeddings and discover patterns across billions of training examples.
What Is a Matrix Multiplication?
A matrix multiplication is a mathematical operation that transforms vectors. Think of it as a function: you put a vector in, you get a different vector out.
The matrix contains weights (numbers). When you multiply the input vector by the matrix, you get a new vector. The weights control how the transformation happens.
How Transformers Work: Attention + Feedforward
Each layer in a transformer has two parts that work together. Here's what happens:
This cycle repeats in every transformer layer. But each layer also includes critical components that make depth work:
Key distinction: Attention mixes information across all tokens. The MLP mixes features within each token separately. This separation allows the model to attend first, then reason.
Early layers focus on local patterns; later layers use that context for semantics and reasoning.
Example: Understanding Context (Attention + Feedforward)
Consider the sentence: "The bank by the river"
The attention layer decides which tokens matter. The feedforward layer learns what those relationships mean. Together, they let the model combine information from "river" into the representation of "bank". This is how context affects meaning.
Learning Through Training
At the start of training, the weight matrices are random. The model makes terrible predictions. But here's how it improves:
Each time the model is wrong, it adjusts the weight matrices slightly using backpropagation. The training process:
Over billions of examples, the weights converge to values that predict well.
Key insight: The model doesn't learn rules like "adjectives come before nouns." It learns patterns in the weight matrices. These matrices encode everything: grammar, facts, reasoning patterns. All through repeated adjustments from training examples.
Why Multiple Layers?
LLMs don't do just one transformation. They do this hundreds of times, organized into layers. Each layer contains both attention and feedforward networks.
In early layers, attention looks at nearby tokens and feedforward learns simple patterns. In later layers, attention looks across the entire sequence and feedforward learns complex reasoning. Each layer builds on the previous one, gradually extracting more sophisticated understanding from the raw tokens.
Putting It All Together: The Complete Transformer Pipeline
Here's the complete pipeline:
Attention and feedforward networks power step 3. They work together, repeatedly, to transform the embeddings. Attention finds patterns (relationships between tokens), feedforward learns what those patterns mean. This repeats 30+ times, building deeper understanding with each layer.
The weights in both attention and feedforward are learned. That's what "training" means: adjusting billions of numbers until the model predicts well on billions of examples.
Summary: Learning happens through two mechanisms working together in every layer: Attention finds what's relevant (which tokens to focus on), and feedforward networks learn what those patterns mean (matrix multiplications). The model adjusts billions of weight matrices to make better predictions. Over billions of training examples, these weights encode everything the model knows about language.
The Big Picture: Everything Connected
You now understand the full transformer architecture:
- Tokenization: Text becomes numbers (token IDs)
- Embeddings: Token IDs become vectors (learned representations)
- Attention: Tokens look at each other to find relevant context (Q, K, V dot products)
- Feedforward: Matrix multiplications learn what those relationships mean
- Layers: These two mechanisms repeat 30+ times, each layer building deeper understanding
Every word you feed to an LLM goes through this journey. Attention figures out what matters. Feedforward networks learn what it means. Repeat 30+ times. The magic is in the learned weight matrices and attention patterns, built from billions of training examples.