Module 1: What is an LLM?
Before you build with AI, you need to understand what it actually is. Not the hype version. The real one.
This is Module 1 of a 12-part curriculum: Build Software Products with AI — From First Principles to Production Pipeline.
Everyone is using LLMs. Most people have no idea how they work. That’s fine for casual use. It’s a problem when you’re building with them.
If you don’t understand what an LLM actually is, you’ll hit walls you can’t explain, make architectural decisions that don’t make sense, and write prompts that work by accident rather than by design.
This module fixes that. No maths. No academic abstractions. Just a clean mental model you can build on.
The One-Sentence Version
A Large Language Model is a system trained to predict the most probable next token given everything it’s seen so far.
That’s it. Everything else — the apparent intelligence, the reasoning, the creativity — emerges from doing that prediction at enormous scale, across enormous amounts of human-generated text.
Let’s unpack what that actually means.
Input Step What happens ───── ──── ──────────── "The cat sat" → 1. tokenise → [The] [cat] [sat] 2. read context everything in the context window 3. score all tokens "on" = 0.42 ←── sampled ✓ "down" = 0.18 "away" = 0.09 ... 4. append + repeat → "The cat sat on"This repeats for every token until a stop condition is reached. There is no understanding — only prediction, done at enormous scale.
Temperature controls step 4: low = always top token, high = sample broadly Context window controls step 2: tokens outside the window are invisible
What is a Token?
Before anything else, you need to understand tokens. They are the atomic unit of everything an LLM processes.
A token is roughly ¾ of a word. The word “building” is one token. “Unbelievable” might be two. Punctuation, spaces, and special characters each consume tokens too.
Why does this matter practically?
- Cost. Every API call to a model like Claude or GPT-4 is priced per token — both input (what you send) and output (what it returns). A prompt with 10,000 tokens costs roughly 10× more than one with 1,000.
- Limits. Every model has a context window — a maximum number of tokens it can hold in memory at once. Claude Sonnet’s context window is 200,000 tokens. GPT-4 Turbo is 128,000. Once you hit the limit, the model can’t “see” earlier parts of the conversation.
- Truncation. If your prompt + history + expected response exceeds the context window, something gets cut. Usually the oldest content. This is why long-running conversations degrade — the model literally forgets the beginning.
Quick rule of thumb: 1,000 tokens ≈ 750 words ≈ 1.5 pages of text. Internalize this and you’ll write better prompts, design better systems, and predict costs accurately.
How Does a Transformer Actually Work?
You don’t need to understand the mathematics. But you do need the intuition.
A transformer is the neural network architecture that powers every major LLM. The key innovation is the attention mechanism — the model’s ability to consider relationships between all tokens in the input simultaneously, not just sequentially.
When the model reads “the cat sat on the mat because it was tired,” the attention mechanism lets it figure out that “it” refers to “the cat,” not “the mat” — by weighing the relationship between every word and every other word.
Training works like this: take hundreds of billions of words of human text. Repeatedly hide a token and ask the model to predict what it should be. Adjust the model’s weights based on whether it was right. Do this billions of times.
The result is a model that has, through sheer repetition, built an extraordinarily dense internal representation of how human language, ideas, and concepts relate to each other. It doesn’t “know” things the way a database does. It has compressed statistical patterns about how ideas tend to co-occur in human writing.
This distinction matters. An LLM doesn’t look things up. It recalls patterns. That’s why it can be confident and wrong simultaneously — a pattern can be strong without being accurate.
What is Temperature?
Temperature controls how much randomness is injected into the model’s predictions.
At temperature 0, the model always picks the most probable next token. Output is deterministic and consistent. Good for tasks where you want reliable, repeatable results — code generation, data extraction, structured output.
At temperature 1, the model samples from the probability distribution more freely. Output varies between runs. Good for creative writing, brainstorming, generating diverse options.
At temperature 2+, output becomes increasingly chaotic and incoherent.
In practice, most production applications run between 0 and 0.7. The right setting depends on the task. Analytical and factual tasks: low. Creative and generative tasks: higher.
What an LLM is NOT
This is as important as what it is.
It is not a search engine. A search engine indexes documents and retrieves them by keyword or semantic match. An LLM generates text. It doesn’t fetch sources. When it cites something, it’s pattern-matching on how citations tend to look in training data — not actually looking the source up. This is the root cause of hallucinations.
It is not a database. It doesn’t store facts you can reliably query. It stores compressed statistical patterns. Ask it “what is 2+2” and it will likely say 4 — not because it computed it, but because “2+2=4” is an overwhelmingly dominant pattern in its training data. Ask it for a specific figure from an obscure 2019 report and you’ll often get a confident fabrication.
It is not deterministic. Even at temperature 0, minor differences in how you phrase a prompt can produce materially different outputs. This is a feature (flexibility) and a bug (unreliability). Plan for it.
It does not have memory by default. Each API call starts fresh. The model has no recollection of previous conversations unless you explicitly inject that history into the context window. This has profound implications for building agents — which we’ll get to later (see module 4)
Context is Everything
If there’s one thing to take from this module, it’s this: the model only knows what’s in its context window.
The context window is everything the model can “see” at the time of inference: your system prompt, the conversation history, any documents you’ve attached, the tools you’ve described. Nothing else.
There is no background knowledge being fetched. There is no persistent memory being accessed. There is no “model brain” somewhere maintaining state between calls.
When you build with LLMs, your primary job is context management — deciding what information goes into the window, in what form, and in what order. Get this right and the model performs remarkably well. Get it wrong and you’ll blame the model for failures that are actually your architecture’s fault.
Why This Matters for Building
A few direct implications for everything that follows in this curriculum:
-
Prompts are code. They’re the primary interface between your intent and the model’s output. Writing a vague prompt is like writing a function with no specification. You’ll get output — just not reliably the output you want.
-
Context windows are your RAM. They’re finite and expensive. Everything you stuff into them has a cost and a displacement effect. Be intentional.
-
Hallucination is a structural property, not a bug to be patched. The model is trained to produce plausible completions, not verified facts. Design your systems accordingly — treat model output as a first draft that needs validation, not ground truth.
-
Model choice is an architectural decision. Different models have different context lengths, cost profiles, capability levels, and latency characteristics. You’ll pick the right tool for the right job once you understand these tradeoffs.
A Note on Scale
The word “large” in Large Language Model is doing real work. Modern frontier models like Claude 3.5 Sonnet or GPT-4o have hundreds of billions of parameters — numerical weights that encode the patterns learned during training.
Scale is why capabilities emerged that nobody fully predicted. Reasoning, code generation, instruction following, multi-step problem solving — these weren’t individually engineered. They emerged from training large enough models on enough data.
This also means you can’t fully predict what a model can or can’t do by reading a spec sheet. You have to probe it empirically. Capability boundaries are blurry and shift with every new model version.
What’s Next
You now have the mental model. An LLM is a token prediction engine trained on human text, with a finite context window, no persistent memory, and a tendency to confabulate when it doesn’t know something.
Next, in Module 2, we go one level up: how to actually talk to these things. System prompts, context injection, few-shot patterns, and why prompt engineering is really just software engineering with a different syntax.
Further Reading
-
[YouTube] Intro to Large Language Models — Andrej Karpathy — The definitive 1-hour primer. Karpathy explains the full stack from tokens to RLHF with unusual clarity. Watch this before anything else.
-
[Paper] Attention Is All You Need — Vaswani et al., 2017 — The original transformer paper. You don’t need to read all of it, but the abstract and architecture diagram are worth your time.
-
[Tool] OpenAI Tokenizer — Paste any text and see exactly how it tokenizes. The fastest way to build intuition about token costs.
-
[Docs] Anthropic Model Documentation — Current context windows, pricing, and capability comparisons across Claude models.
-
[Blog] Simon Willison on LLMs — The most consistently useful practitioner writing on LLMs. Sceptical, rigorous, and genuinely experienced.
-
[Course] MIT 6.S191 - Intro to Deep Learning — MIT’s annual deep learning course. Covers transformers, LLMs, and generative AI. Free YouTube lectures, updated each year. One of the best structured entry points to the full stack.
-
[Course] Stanford CS229 - Machine Learning — Andrew Ng’s original ML course. Full lecture notes free. Solid grounding in the statistical foundations that underpin everything LLMs are built on.
-
[Course] Harvard CS50’s Introduction to AI with Python — The best entry-level AI course for people new to CS. Covers search, language, and neural nets. Approachable and rigorous in equal measure.
-
[Course] fast.ai - Practical Deep Learning for Coders — Bottom-up, code-first. Jeremy Howard’s philosophy: build something that works first, then understand why. Excellent if you learn by doing rather than reading.
-
[Course] Stanford CS25 - Transformers United — Entirely focused on transformers and modern foundation models. Guest lectures from the people who actually built them. Worth watching for the practitioner perspective.
Referenced from @nikovijay on x.com
“Most people think AGI means ‘smarter than humans.’ Hassabis thinks deeper. His definition: Systems that can do anything the human brain can do, even theoretically.” — @TheHarmonX
Subscribe to get notified when new modules and courses drop. No drip — just updates when there's something worth reading.
Subscribe on Substack →