How LLMs Understand Text — From Tokens to Meaning (Beginner-Friendly)
How LLMs
Understand Text — From Tokens to Meaning (Beginner-Friendly)
๐ค The Computer's Language Problem
Computers
don’t understand language like we do.
If you say:
“The cat
sat on the mat.”
You
immediately picture a cat sitting on a mat.
But a
computer? It sees this as just a series of symbols. It doesn't “understand”
anything — unless we first convert that sentence into numbers it can work with.
➡️ Text needs to be converted into
numbers (tokens) for LLMs to understand.
๐งฉ Step 1: Tokenization – Breaking Text
into Pieces
Tokenization means splitting a sentence into
smaller parts (called tokens) and converting each into a unique number (token
ID).
Example:
Sentence: “The cat sat on the mat”
Tokens: ["The", "cat", "sat",
"on", "the", "mat"]
Token IDs: [201, 503, 621, 104, 201, 891]
⚠️ Why Not Just Use Letters?
Why not
break it down into characters like "T", "h", "e"...?
Because of token
limits. LLMs can only handle a fixed number of tokens (e.g., 1024) per
input.
Character-based: 22 tokens
Word-based: 6 tokens
➡️ Using full words or subwords is more efficient.
๐ง Types of Tokenization Techniques
1. Word-Level
Tokenization
Each word is
a token
✅ Simple
❌ Struggles with rare/new words
2. Character-Level
Tokenization
Each letter
is a token
✅ Handles any input
❌ Long sequences, weak meaning
3. Subword
Tokenization (e.g., BPE) ๐ฅ
Byte Pair
Encoding (BPE)
merges frequently seen character pairs.
“playing” → ["play",
"ing"]
“played” → ["play", "ed"]
➡️ Helps handle rare words while
keeping token count low.
๐ข Step 2: Vector Embeddings – Giving
Meaning to Tokens
Now we have
token IDs — but those are just numbers.
How do we
make the model understand meaning?
We use embeddings
— each token gets converted into a vector of numbers that captures what it
represents.
๐ก What Are Embeddings?
Each token
becomes a vector — a list of numbers with direction and distance in a
high-dimensional space.
"cat"
→ [0.21, -0.14, 0.88, ...]
"dog" → [0.23, -0.13, 0.85, ...]
"car" → [-0.83, 0.92, -0.45, ...]
➡️ Similar meanings = closer vectors.
๐ง Semantic Clustering
Models learn
that:
- “cat”, “dog”, “animal” often
appear together → grouped
- “car”, “truck”, “bus” → vehicle
cluster
- “ran”, “sat”, “jumped” → action
cluster
๐ง How Are Embeddings Made?
A separate
neural network is trained on huge datasets to:
- Learn co-occurrence patterns
- Assign similar vectors to
related tokens
๐ Embeddings = meaning captured as
numbers
⏳ Step 3: Positional Encoding –
Understanding Word Order
Just knowing
the words isn’t enough — the order matters.
“The cat
chased the dog”
vs
“The dog chased the cat”
➡️ Same words, different meanings!
๐งญ What is Positional Encoding?
We add a position
vector to each word's embedding to tell the model where the word
appears.
Example:
- “The” = base embedding
- position(1) = first word
- Final vector = word vector +
position vector
๐งฎ Types of Positional Encoding
- Learned
→ Model learns position embeddings during training - Sinusoidal
→ Uses math to generalize position across long inputs
➡️ Both help the model understand
sentence structure.
๐ Recap: What Happens to Your Text?
Let’s go
through this with:
“The cat
sat on the mat”
- Tokenization
→ ["The", "cat", "sat", ...]
→ [201, 503, 621, ...] - Embeddings
→ Each token turns into a vector of meaning - Positional Encoding
→ Order is added to each embedding
➡️ The result? A sequence of meaningful,
position-aware vectors ready for deeper processing.
๐ก Why This Matters
- ๐ง Prompt Better: Know how
to write efficient prompts
- ⏳ Save Space: Keep inputs short, clear
- ๐ Understand Context: How
meaning and order are captured
- ๐ Get Better Outputs: Know
what the model "sees" under the hood
✅ Key Takeaways
✔️ Tokenization splits text
✔️ Embeddings give meaning
✔️ Positional encoding adds order
✔️ All combine to help LLMs understand your input
๐ฎ What’s Next?
Now that the input is processed...
๐จ How does the model decide what’s
important in the sentence?
๐จ How does it focus on “cat” when answering your
question?
That’s where
Self-Attention and Transformers come in — the real magic behind
LLMs.
✨ We’ll explore these core ideas in
the next blog, where you’ll learn how LLMs actually “think” and
generate responses.
Thanks for
reading! ๐
Comments
Post a Comment