Large language models (LLMs) have become an integral part of our society. Like everything else I've ever used, I've always wondered how it all comes together at the most fundamental level of abstraction.
So I turned to the paper that started it all: Attention Is All You Need. And using Andrej Karpathy's lecture based on the same paper, I built up the simplest form of a chatbot: a text generator. To give it a personal twist, I made it speak my mother tongue, Shona.
At a much higher level of abstraction. LLMs like GPT-4 are advanced neural networks trained on massive text datasets to predict and generate human-like text. Built on the Transformer architecture, they use self-attention mechanisms to understand word relationships and context, enabling capabilities like translation, writing, and question-answering. While powerful, they have limitations—they don't truly "understand" content, can produce biased or fabricated outputs, and rely more on training data quality than just parameter count (despite marketing hype). Their strength lies in pattern recognition from vast data, not genuine reasoning, making them versatile tools that still require careful oversight.
This blog isn't a demo; you'll find a demo here. It's an educational walk on the mechanics of transformers. I say "educational" because what you'll see here runs on a couple of hundred kilobytes of text and a model with less than a million parameters. In contrast, ChatGPT and its cousins operate on hundreds of billions of tokens and models with tens to hundreds of billions of parameters. That scale enables conversation, reasoning, and memory, but the fundamental building blocks are the same. This post dives into those building blocks, from scratch.
The Tech Stack
- Python
- PyTorch
- Shona dataset (discussed below)
The Dataset
The challenge of finding a dataset for this project nearly matched the challenge of building the model itself. And somewhere in that struggle, a more uncomfortable realization set in: how much of African languages and culture will be left out of this new AI-powered era?
To be honest, I hadn't been paying attention to that part — not until I was forced to. But this isn't just a technical gap. It's a cultural one. The language models of the future will shape how knowledge is accessed, how ideas are expressed, and whose stories are told. If we don't make sure our languages — our metaphors, idioms, and ways of thinking — are part of that, we risk being rendered invisible by default.
Finding Shona datasets is like looking for poetry on a calculator. Most corpora are either academic relics or religious texts. Here's what I scraped together:
- Leipzig Shona Corpus – A clean set of 10,000–30,000 web sentences. Simple and structured.
- Tatoeba – Community-submitted Shona sentences (some translated from English).
- Manual Copy-Paste – Yes. I did that. Anything vaguely coherent and Shona got fed into the model.
In total: about 200 KB of usable data, pitiful by modern standards, but enough for a toy model to start hallucinating.
LOW-LEVEL ABSTRACTION
To truly grasp the capabilities of Large Language Models (LLMs), it's essential to understand four fundamental concepts that form the backbone of these remarkable systems: token prediction, attention mechanisms, self-supervised learning, and positional context.
What Are Tokens and Token Prediction?
The Building Blocks of Language Understanding
Tokens are the fundamental units that LLMs use to process text. Think of them as the "atoms" of language processing—small pieces of text that can represent anything from individual words to parts of words, punctuation marks, or even spaces. For instance, the sentence "Mhoro Nyika!" might be broken down into tokens like ["Mhoro", " Nyika", "!"].
Modern LLMs don't work with raw text directly. Instead, they convert human-readable text into numerical representations called tokens through a process known as tokenization. Each token is assigned a unique numerical identifier that the model can process mathematically, so our text would now be [1,2,3].
How Token Prediction Works
Token prediction is the core mechanism by which LLMs generate text. At its heart, this process involves predicting the most likely next token in a sequence based on all the previous tokens. This approach is called autoregressive generation—the model generates text one token at a time, using previous outputs as input for the next prediction.
Here's how it works step-by-step:
- Input Processing: The user provides a prompt, which gets tokenized into numerical representations.
- Contextual Analysis: The model analyzes the input using neural networks to understand relationships between tokens
- Probability Calculation: For each possible next token, the model calculates a probability based on patterns learned during training
- Token Selection: The model selects the next token using various strategies (like picking the most likely one or sampling from the top candidates)
- Iteration: The chosen token gets added to the sequence, and the process repeats until a stopping condition is met
What makes this particularly powerful is that LLMs predict probability distributions over their entire vocabulary. This means they assign probabilities to every possible next token, allowing for both deterministic and creative text generation depending on the sampling strategy used.
Attention Mechanisms: The Key to Understanding Context
The Revolutionary Breakthrough
The attention mechanism represents one of the most significant breakthroughs in modern AI. It allows models to focus selectively on different parts of the input sequence when making predictions, much like how humans pay attention to relevant information while ignoring less important details.
Traditional models like Recurrent Neural Networks (RNNs) processed text sequentially, one word at a time, which made it difficult to capture relationships between distant words (The RNN movie). The attention mechanism revolutionized this by allowing models to examine the entire sequence simultaneously and decide which parts deserve focus.
How Attention Works Mathematically
The core attention mechanism can be expressed through a mathematical formula:
Where:
- Q (Query): Represents what we're looking for
- K (Key): Represents what we're looking at
- V (Value): Represents the actual information we want to extract
- d_k: A scaling factor to prevent numerical instability
This formula essentially computes the attention each part of the input should receive when generating each output token.
Multi-Head Attention: Multiple Perspectives
Modern transformers use multi-head attention, which runs several attention mechanisms in parallel. Each attention "head" can focus on different types of relationships. For example, one head might focus on grammatical relationships while another focuses on semantic meaning, allowing the model to capture multiple types of dependencies simultaneously.
Self-Attention: Looking Inward
Self-attention is a special case where the model computes attention over the same sequence for queries, keys, and values. This enables each position in the sequence to attend to all other positions, allowing the model to understand how different parts of the input relate to each other.
Self-Supervised Learning: Learning Without Labels
The Power of Learning from Structure
Self-supervised learning is a training paradigm that allows models to learn from data without requiring human-labeled examples. Instead of relying on external labels, these models generate supervisory signals from the structure of the data itself.
In the context of language models, self-supervised learning typically involves pretext tasks—artificial learning objectives created from the data structure. The most common pretext task for LLMs is predicting the next word (or token) in a sequence, using the preceding context as input.
How It Works in Practice
Consider a simple example: Given the sentence "Kunze kwakanaka uye kune zuva," a self-supervised model might mask out the word "uye" and try to predict it from the context "Kunze kwakanaka {****} kune zuva." The model learns by repeatedly performing this task across millions of text examples, gradually building an understanding of language patterns, grammar, and meaning.
This approach is powerful because:
- No Manual Labeling Required: The model can learn from any text data without human annotation.
- Scalable: It can leverage vast amounts of unlabeled text available on the internet, or in this case, unavailable.
- Rich Representations: It learns deep, contextual understanding rather than simple pattern matching, which would not work for an entire language.
Pretext vs. Downstream Tasks
Self-supervised learning typically involves two phases:
- Pretext Task: The model learns general language representations through self-supervised objectives like next-word prediction.
- Downstream Task: The pre-trained model is fine-tuned for specific applications like translation, summarization, or question-answering
This two-stage approach allows a single pre-trained model to be adapted for numerous specific tasks, making it incredibly versatile and efficient.
Positional Context: Understanding Order and Sequence
The Challenge of Position in Transformers
Unlike RNNs that process text sequentially and naturally encode position information, transformer models process all tokens in parallel. This parallel processing is what makes transformers fast and efficient, but it creates a problem: how does the model know the order of words in a sentence?
This is where positional encoding becomes crucial. The position of words fundamentally matters in language "Ndino rara pamba" and "Ndino pamba rara" illustrates this.
How Positional Encoding Works
Positional encoding adds information about each token's position in the sequence to help the model understand word order. The most common approach uses sinusoidal functions to create unique positional representations:
For position k and dimension i:
Where d is the model dimension and k is the position in the sequence.
These positional encodings are added element-wise to the token embeddings, so each token's representation contains both its semantic meaning and its position in the sequence.
Why This Mathematical Approach Works
The sinusoidal approach has several advantages:
- Uniqueness: Each position gets a unique encoding pattern
- Scalability: The model can handle sequences longer than those seen during training
- Relative Relationships: The mathematical properties allow the model to learn relative distances between positions
- Smooth Transitions: Adjacent positions have similar encodings, helping the model understand local relationships
Positional Bias and Context Windows
Modern LLMs often exhibit positional bias—they perform differently depending on where relevant information appears in their input. This has led to phenomena like "lost in the middle," where models struggle with information placed in the center of long contexts.
Researchers continue developing improved positional encoding methods to address these challenges, including techniques like Multi-scale Positional Encoding (Ms-PoE) and Position Interpolation.
How These Concepts Work Together
The Complete Picture
These four concepts work synergistically in modern LLMs:
- Tokenization converts raw text into processable units
- Positional encoding ensures the model understands word order
- Self-attention allows the model to understand relationships between all tokens
- Token prediction generates new text one piece at a time
- Self-supervised learning enables the model to learn these patterns from vast amounts of text data
The Generation Process in Action
When you prompt an LLM with "Nhasi kuri ", here's what happens:
- The text is tokenized: ["Nhasi", " kuri"]
- Each token receives positional encoding indicating its place in the sequence
- The self-attention mechanism analyzes relationships between all tokens
- The model predicts probability distributions for the next token (perhaps "kutonhora," "kupisa," "kurara")
- A token is selected based on the sampling strategy
- The process repeats with the new token added to the context
This cycle continues until the model generates a complete response, with each new token informed by the growing context and the patterns learned during self-supervised training.
Understanding these four concepts provides a solid foundation for comprehending how modern LLMs work. Token prediction gives us the generation mechanism, attention provides contextual understanding, self-supervised learning enables scalable training, and positional encoding ensures proper sequence comprehension. Together, they form the backbone of the language understanding systems that are transforming how we interact with AI.
The implementation here uses character-level tokenization, which means the model predicts one character at a time. So when it manages to generate a word like "kuyambuka", even if it stumbles elsewhere, you've got to give it some credit — it's assembling words letter by letter, without ever having seen the word as a whole.
To improve it meaningfully, we'd need more — and better — data. I couldn't find a digitized, open-source Shona book corpus. Most of what's available is dry and formulaic — which is why the model currently sounds like a newsreader stuck in 2011.
Long-term, a language-specific tokenizer would help. Shona has its own structure: roots, prefixes, infixes — a rhythm to how meaning is built. Reconnecting with that, even computationally, brought me back to my O-Level Shona lessons, and gave me a new respect for my teacher, Ms. Chihewure, who made that structure make sense long before transformers did.