Phase 01 — Foundations

Module 01 of 12 · 8 min read · Free

Module 1: What is an LLM?

Before you build with AI, you need to understand what it actually is. Not the hype version. The real one.

This is Module 1 of a 12-part curriculum: Build Software Products with AI — From First Principles to Production Pipeline.

Everyone is using LLMs. Most people have no idea how they work. That’s fine for casual use. It’s a problem when you’re building with them.

If you don’t understand what an LLM actually is, you’ll hit walls you can’t explain, make architectural decisions that don’t make sense, and write prompts that work by accident rather than by design.

This module fixes that. No maths. No academic abstractions. Just a clean mental model you can build on.

The One-Sentence Version

A Large Language Model is a system trained to predict the most probable next token given everything it’s seen so far.

That’s it. Everything else — the apparent intelligence, the reasoning, the creativity — emerges from doing that prediction at enormous scale, across enormous amounts of human-generated text.

Let’s unpack what that actually means.

→ THE PREDICTION LOOP — what an LLM does on every single token

Input Step What happens ───── ──── ──────────── "The cat sat" → 1. tokenise → [The] [cat] [sat] 2. read context everything in the context window 3. score all tokens "on" = 0.42 ←── sampled ✓ "down" = 0.18 "away" = 0.09 ... 4. append + repeat → "The cat sat on"

This repeats for every token until a stop condition is reached. There is no understanding — only prediction, done at enormous scale.

Temperature controls step 4: low = always top token, high = sample broadly Context window controls step 2: tokens outside the window are invisible

What is a Token?

Before anything else, you need to understand tokens. They are the atomic unit of everything an LLM processes.

A token is roughly ¾ of a word. The word “building” is one token. “Unbelievable” might be two. Punctuation, spaces, and special characters each consume tokens too.

Why does this matter practically?

Cost. Every API call to a model like Claude or GPT-4 is priced per token — both input (what you send) and output (what it returns). A prompt with 10,000 tokens costs roughly 10× more than one with 1,000.
Limits. Every model has a context window — a maximum number of tokens it can hold in memory at once. Claude Sonnet’s context window is 200,000 tokens. GPT-4 Turbo is 128,000. Once you hit the limit, the model can’t “see” earlier parts of the conversation.
Truncation. If your prompt + history + expected response exceeds the context window, something gets cut. Usually the oldest content. This is why long-running conversations degrade — the model literally forgets the beginning.

Quick rule of thumb: 1,000 tokens ≈ 750 words ≈ 1.5 pages of text. Internalize this and you’ll write better prompts, design better systems, and predict costs accurately.

How Does a Transformer Actually Work?

You don’t need to understand the mathematics. But you do need the intuition.

A transformer is the neural network architecture that powers every major LLM. The key innovation is the attention mechanism — the model’s ability to consider relationships between all tokens in the input simultaneously, not just sequentially.

When the model reads “the cat sat on the mat because it was tired,” the attention mechanism lets it figure out that “it” refers to “the cat,” not “the mat” — by weighing the relationship between every word and every other word.

Training works like this: take hundreds of billions of words of human text. Repeatedly hide a token and ask the model to predict what it should be. Adjust the model’s weights based on whether it was right. Do this billions of times.

The result is a model that has, through sheer repetition, built an extraordinarily dense internal representation of how human language, ideas, and concepts relate to each other. It doesn’t “know” things the way a database does. It has compressed statistical patterns about how ideas tend to co-occur in human writing.

This distinction matters. An LLM doesn’t look things up. It recalls patterns. That’s why it can be confident and wrong simultaneously — a pattern can be strong without being accurate.

What is Temperature?

Temperature controls how much randomness is injected into the model’s predictions.

At temperature 0, the model always picks the most probable next token. Output is deterministic and consistent. Good for tasks where you want reliable, repeatable results — code generation, data extraction, structured output.

At temperature 1, the model samples from the probability distribution more freely. Output varies between runs. Good for creative writing, brainstorming, generating diverse options.

At temperature 2+, output becomes increasingly chaotic and incoherent.

In practice, most production applications run between 0 and 0.7. The right setting depends on the task. Analytical and factual tasks: low. Creative and generative tasks: higher.

What an LLM is NOT

This is as important as what it is.

It is not a search engine. A search engine indexes documents and retrieves them by keyword or semantic match. An LLM generates text. It doesn’t fetch sources. When it cites something, it’s pattern-matching on how citations tend to look in training data — not actually looking the source up. This is the root cause of hallucinations.

It is not a database. It doesn’t store facts you can reliably query. It stores compressed statistical patterns. Ask it “what is 2+2” and it will likely say 4 — not because it computed it, but because “2+2=4” is an overwhelmingly dominant pattern in its training data. Ask it for a specific figure from an obscure 2019 report and you’ll often get a confident fabrication.

It is not deterministic. Even at temperature 0, minor differences in how you phrase a prompt can produce materially different outputs. This is a feature (flexibility) and a bug (unreliability). Plan for it.

It does not have memory by default. Each API call starts fresh. The model has no recollection of previous conversations unless you explicitly inject that history into the context window. This has profound implications for building agents — which we’ll get to later (see module 4)

Context is Everything

If there’s one thing to take from this module, it’s this: the model only knows what’s in its context window.

The context window is everything the model can “see” at the time of inference: your system prompt, the conversation history, any documents you’ve attached, the tools you’ve described. Nothing else.

There is no background knowledge being fetched. There is no persistent memory being accessed. There is no “model brain” somewhere maintaining state between calls.

When you build with LLMs, your primary job is context management — deciding what information goes into the window, in what form, and in what order. Get this right and the model performs remarkably well. Get it wrong and you’ll blame the model for failures that are actually your architecture’s fault.

Why This Matters for Building

A few direct implications for everything that follows in this curriculum:

Prompts are code. They’re the primary interface between your intent and the model’s output. Writing a vague prompt is like writing a function with no specification. You’ll get output — just not reliably the output you want.
Context windows are your RAM. They’re finite and expensive. Everything you stuff into them has a cost and a displacement effect. Be intentional.
Hallucination is a structural property, not a bug to be patched. The model is trained to produce plausible completions, not verified facts. Design your systems accordingly — treat model output as a first draft that needs validation, not ground truth.
Model choice is an architectural decision. Different models have different context lengths, cost profiles, capability levels, and latency characteristics. You’ll pick the right tool for the right job once you understand these tradeoffs.

A Note on Scale

The word “large” in Large Language Model is doing real work. Modern frontier models like Claude 3.5 Sonnet or GPT-4o have hundreds of billions of parameters — numerical weights that encode the patterns learned during training.

Scale is why capabilities emerged that nobody fully predicted. Reasoning, code generation, instruction following, multi-step problem solving — these weren’t individually engineered. They emerged from training large enough models on enough data.

This also means you can’t fully predict what a model can or can’t do by reading a spec sheet. You have to probe it empirically. Capability boundaries are blurry and shift with every new model version.

What’s Next

You now have the mental model. An LLM is a token prediction engine trained on human text, with a finite context window, no persistent memory, and a tendency to confabulate when it doesn’t know something.

Next, in Module 2, we go one level up: how to actually talk to these things. System prompts, context injection, few-shot patterns, and why prompt engineering is really just software engineering with a different syntax.

Referenced from @nikovijay on x.com

“Most people think AGI means ‘smarter than humans.’ Hassabis thinks deeper. His definition: Systems that can do anything the human brain can do, even theoretically.” — @TheHarmonX

N+1 Newsletter

Enjoyed this module?

Subscribe to get notified when new modules and courses drop. No drip — just updates when there's something worth reading.

Subscribe on Substack →

Before you dive in —