50 Minutes | Theory + Concepts + Discussion
What is an LLM? — The big picture (8 min)
Tokens & Training — How LLMs actually work (12 min)
Context Windows — AI's short-term memory (8 min)
Why LLMs Fail — Hallucinations & other limits (15 min)
Discussion & Exit Ticket (7 min)
Who has used ChatGPT, Claude, or Gemini? What did you use it for?
And — has anyone ever seen it give a wrong or weird answer?
By the end of today you'll understand exactly why those wrong answers happen — and be able to explain it to someone else.
An AI trained on massive amounts of text to predict what word (or token) comes next
💡 Think: autocomplete on your phone — but trained on billions of pages of text and vastly more sophisticated
by OpenAI
by Anthropic
by Google
by Meta (open source)
All of these work on the same core principle.
1 token ≈ ¾ of a word, or about 4 characters
= 2 tokens
= 3 tokens
un · believ · able
= 4 tokens
≈ 5–6 tokens
Guess the token count: "The quick brown fox jumps over the lazy dog"
Answer: ~10 tokens. Why does this matter? Every model has a token limit.
If a model costs $0.01 per 1,000 tokens, and your app sends 500 messages a day averaging 200 tokens each — how much does that cost per month?
Answer: 500 × 200 = 100,000 tokens/day × 30 = 3M tokens = $30/month. Real product design involves this kind of thinking.
"If we train an LLM on internet data — what problems could that cause?"
Expected: bias, misinformation, offensive content, outdated information, underrepresentation of some languages
"The cat sat on the ___"
The model guesses: mat, floor, chair, table
It gets feedback on which predictions are statistically likely.
It repeats this billions of times across billions of sentences.
The model is learning statistical patterns, not facts about the world. This distinction explains almost every limitation we'll cover today.
After basic training, companies specialise the model for specific tasks or behaviours:
Fine-tuned to be helpful, harmless, and honest in conversation
Fine-tuned specifically for writing and explaining code
Fine-tuned on clinical notes, research papers, patient data
You'd fine-tune (or prompt-engineer) for your specific use case
🧠 Imagine a friend who can only remember the last 10 sentences you said — everything before that is gone.
Example: You have a 10,000-word tutoring session with a model that has an 8,000-word limit.
→ The model silently forgets the first 2,000 words.
→ If you refer back to something from the start, the model has no idea what you mean.
You're building an AI study buddy. A student uses it for 2 hours. How do you stop the model losing important context?
Ideas to draw out: summarise key facts at intervals, save student name/topic/goals separately, prompt the model to recap periodically
The AI is not lying. It has no concept of truth. It is always doing the same thing — predicting the most plausible next token. Sometimes that token is wrong.
Lawyers submitted a brief with completely made-up court cases that ChatGPT generated
Students receive realistic-looking but entirely fabricated academic citations
Pattern matching without understanding
LLM sees "[Country]'s capital is [City]" enough times. Ask about a made-up country — it invents a plausible city.
Gaps in training data
If the event, place, or person isn't well represented in training data, the model guesses from similar things it does know.
Ambiguous prompts
"Tell me about the Paris incident" — which one? The model assumes and presents assumptions as facts.
I'll show you 3 statements. For each: Real fact or Hallucination? Discuss with a partner for 30 seconds, then we vote.
Statement 1: "The Eiffel Tower was built between 1887 and 1889." → ✓ Real
Statement 2: "The Great Wall of China is visible from space with the naked eye." → ✗ Myth — widely repeated, so LLMs repeat it too
Statement 3: "Einstein won the Nobel Prize for his theory of relativity." → ⚠️ Partially wrong — he won it for the photoelectric effect. Most dangerous type of hallucination.
📅 No real-time info
Training has a cutoff date. No live news, weather,
prices, or recent events without external tools.
🤖 No true understanding
Pattern recognition ≠ comprehension. No lived
experience, emotions, or common sense reasoning.
⚖️ Biased outputs
Reflects biases in training data — gender, culture, and
language. English hugely overrepresented vs other languages.
🧠 Context window limits
Forgets long conversations. Critical to design
around if building real products.
"Why might an LLM perform significantly better in English than in Arabic?"
Answer: The internet has vastly more English-language content than Arabic. The model learned from what existed — so its English patterns are far richer, more nuanced, and more accurate than its Arabic ones.
This means AI tools built on these models may serve Arabic speakers worse — a real equity issue in AI development.
How LLMs Work
Tokens: unit of text (~¾ word)
Training: data → patterns → fine-tune
Context window: short-term memory with a hard token limit
Why They Fail
Hallucinations: 3 causes — patterns, data gaps, bad prompts
No real-time info: knowledge cutoff
Bias + no true understanding
Answer all 3 in 5 minutes:
In one sentence: what is a context window and what happens when it's exceeded?
Name one cause of hallucinations and give a real-world example of why it's dangerous.
Why might an AI product work better for some groups of people than others?
You'll test a real AI tool, try to make it hallucinate, and explore what jobs exist in AI. Come ready to experiment.