How LLMs Come Into Being: A Beginner-Friendly Guide to Data, Tokens, Training, and Inference
Revised
Published
~ 8 min read
If large language models (LLMs) feel mysterious, you are not alone. Most people see a chat interface and a stream of text, but that final output is the end of a long engineering process.
This guide explains that process in plain language first, then adds the technical detail. By the end, you should understand how models are made, what tokens really are, and what happens after you press Enter.
Quick Mental Model
Think of an LLM as a system with three major phases:
- Data pipeline: collect and clean huge amounts of text and code (Stages 1–2)
- Training pipeline: teach a neural network to predict the next token, then shape its behaviour (Stages 3–7)
- Serving pipeline: run the trained model quickly and safely for real users (Stages 8–9)
If any one of these is weak, the final product is weak. The sections below walk through each stage in detail.
The End-to-End Lifecycle
At a high level, this is the lifecycle (the numbered stages in this section match the detailed stages below):
- Collect and clean data
- Build a tokeniser and vocabulary
- Define the model architecture
- Pretrain a base model
- Continue training for domain adaptation
- Post-train for instruction following and alignment
- Run evaluation and safety gates
- Optimise for speed and cost
- Serve the model in production
Teams repeat this cycle many times. There is rarely one “final” training run.
Stage 1: Data Collection and Cleaning
Plain English
Before a model can learn, it needs examples. Lots of them. The training data is usually a mixture of web text, documentation, books, code, and curated instruction data.
Technical details
A production-grade data pipeline normally includes:
- De-duplication: remove duplicate and near-duplicate content
- Normalisation: clean encoding, strip boilerplate, standardise formatting
- Language/domain filtering: keep the right mix of sources
- Quality scoring: rank and keep higher-quality documents
- Legal and policy filtering: remove data that should not be used
The exact data mixture matters a lot. Too much low-quality text can damage reasoning quality. Too much of one domain can hurt performance in others.
Stage 2: Tokenisation (How Text Becomes Numbers)
Tokenisation is one of the most misunderstood parts of LLMs.
Plain English
The model cannot read raw text directly. Text must be split into smaller pieces called tokens, then mapped to numeric IDs.
What is a token?
A token can be:
- A full word
- Part of a word
- Punctuation
- Whitespace + text chunk
- A byte-level fallback for unusual strings
This is why token count is not the same as word count.
Why tokenisation matters so much
Tokenisation affects:
- Cost: pricing is usually token-based
- Context usage: fewer tokens means more room in the context window
- Speed: fewer tokens generally means faster processing
- Quality: poor token boundaries can hurt multilingual and rare-word behaviour
Common tokeniser approaches
Most LLMs use Byte Pair Encoding (BPE)-style or unigram-style subword tokenisers. BPE works by repeatedly merging the most frequent pairs of characters or subwords into single tokens until a target vocabulary size is reached. This produces a compact vocabulary that can represent any text while giving common words and substrings short, efficient representations. These methods balance vocabulary size with efficient text coverage across many languages and domains.
Stage 3: Model Architecture (The Neural Network Itself)
Plain English
Before training can begin, engineers define the shape of the network: how many layers it has, how wide each layer is, and how information flows through it. At this point the model does not know anything, its weights are random. The architecture is just the blueprint.
Technical details
Most modern LLMs are transformer decoder models.
Main components include:
- Token embeddings (map token IDs to vectors)
- Positional information (so order is preserved)
- Stacked transformer blocks (attention + feed-forward layers)
- Output head (predict next-token probabilities)
At this point, the model weights are usually random. The model only becomes useful after training.
Stage 4: Pretraining (Learning Language Patterns)
Plain English
Pretraining teaches the model to predict what token should come next.
Technical objective
Given tokens t1..tn, the model learns:
P(t_k | t_1, t_2, ..., t_{k-1})
Training minimises cross-entropy loss between predicted and actual next token.
What actually happens during pretraining
For each training step:
- Load token batches
- Run forward pass on graphics processing units (GPUs) and other accelerators
- Compute loss and gradients
- Update weights with an optimiser
- Save checkpoints and evaluate periodically
At scale, no single machine can hold the model or process data fast enough. Distributed training splits the work across hundreds or thousands of accelerators using strategies such as data parallelism (each machine sees different data), tensor parallelism (each machine holds part of a layer), and pipeline parallelism (each machine handles different layers). Mixed precision and strong fault tolerance are also essential at this scale.
Stage 5: Continued Training and Domain Adaptation
Plain English
After base pretraining, teams often run a shorter additional training phase to improve specific areas. This is cheaper and faster than restarting from scratch, so it is almost always preferred over repeating Stage 4 with new data.
Technical details
Many teams run an intermediate phase after base pretraining. The goal is to adapt to specific domains or newer data.
Examples:
- More high-quality code data
- Regulated-domain corpora (if licensed)
- Additional multilingual coverage
- Fresh data for changing topics
This stage still uses language-model training objectives, usually with adjusted learning rates and data weights.
Stage 6: Post-Training and Alignment
A pretrained model can be smart but still not a good assistant. Post-training fixes that.
Supervised Fine-Tuning (SFT)
The model is trained on prompt-response examples that demonstrate desired behaviour:
- Follow instructions
- Respect format constraints
- Ask clarifying questions when needed
- Respond in a helpful style
Preference training (Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), etc.)
The model learns from ranked outputs (better vs worse responses) to improve helpfulness and reduce undesirable behaviour.
Safety stack
Safety is usually layered, not a single feature:
- Data filtering
- Alignment fine-tuning
- Moderation/policy classifiers
- Runtime safeguards
- Red-team and jailbreak testing
Stage 7: Evaluation and Release Gates
Plain English
A model that passes training is not automatically ready to ship. Teams run a structured battery of tests first. A model can fail release not because it is unintelligent, but because it is too slow, too expensive to run, or too likely to produce harmful output.
Technical details
Before release, candidate models are tested across capability, safety, and systems metrics.
Typical categories:
- Capability: reasoning, coding, language tasks
- Reliability: hallucination rate, stability, instruction adherence
- Safety: policy compliance and adversarial robustness
- Systems: latency, throughput, and serving cost
A model can be very capable but still fail release if it is too slow, too expensive, or too risky.
Stage 8: Optimising for Production
Plain English
A model that passed evaluation is still not ready for users. The trained checkpoint is often enormous and slow to run at the volume real products demand. This stage is about making the model practical: cheaper, faster, and small enough to deploy efficiently.
Technical details
Large checkpoints are expensive to serve. Teams usually apply optimisations:
- Quantisation: lower precision to reduce memory and cost
- Distillation: train smaller models to mimic stronger models
- Routing and mixture-of-experts (MoE) strategies: activate only relevant model parts
- Inference-engine tuning: better batching and cache reuse
These steps are about practical trade-offs between quality, speed, and cost.
Stage 9: Inference (What Happens When You Send a Prompt)
Plain English
Every time you send a message to an LLM product, the serving infrastructure runs the model on your input and streams the result back. The speed of that first word appearing, time to first token (TTFT), is one of the most important user experience metrics in production.
Runtime sequence
- Build request context (system/developer/user messages and tool context)
- Tokenise input text
- Run prefill pass and create a key-value (KV) cache
- Decode output tokens step by step
- Convert tokens back to text
- Stream response and execute tool calls when needed
Important runtime controls
These settings change output behaviour at inference time:
- Temperature
- Top-p / top-k
- Max output tokens
- Frequency/presence penalties
- Stop sequences
These are decoding controls, not training controls.
Token Economics (Why Product Teams Care)
Tokenisation (Stage 2) has direct consequences for cost. A simplified model:
Total cost ~= (input_tokens * input_price_per_token)
+ (output_tokens * output_price_per_token)
+ infrastructure overhead
This is why prompt length, retrieval quality, and output limits matter so much in production.
Common Beginner Confusions
- “The model stores facts like a database”: not exactly. It stores statistical patterns in weights.
- “Tokens are basically words”: not reliable. Token boundaries are subword-based.
- “Alignment equals RLHF”: incomplete. Alignment includes SFT, preference training, safety layers, and runtime policy.
- “Only bigger models matter”: not always. Smaller specialised models can win on latency and cost.
- “I should retrain the model to give it new information”: often not. Retrieval-augmented generation (RAG), fetching relevant documents at inference time, is usually faster and cheaper than either full pretraining from scratch or continued training (Stage 5) when the goal is to add or refresh knowledge. Those forms of retraining make more sense when you need to change behaviour or capabilities, not just inject new facts.
A Practical Way to Think About LLM Systems
Treat an LLM product as three connected engineering products:
- Data product: what you collect, filter, and refresh
- Training product: how you optimise and evaluate model behaviour
- Serving product: how fast, safe, and affordable responses are in production
Most real progress comes from improving the connections between these layers, not only from increasing model size.
Glossary (Quick Reference)
- Token: numeric unit the model processes
- Vocabulary: full set of token IDs known by the model
- Pretraining: large-scale next-token learning phase
- SFT: supervised fine-tuning on instruction examples
- Alignment: shaping behaviour toward human and policy preferences
- Inference: running the trained model on new prompts
- TTFT: time to first token, a key user experience (UX) latency metric
- KV cache: stored attention state used to speed decoding
Final Summary
LLMs come into being through a pipeline, not a single training event. Data quality sets the ceiling, tokenisation sets important efficiency constraints, pretraining builds core capability, post-training shapes behaviour, and inference engineering determines user experience.
If you are just starting out, focus on this order:
- Understand tokenisation and context limits (Stage 2 and Token Economics)
- Learn the difference between pretraining and post-training (Stages 4 and 6)
- Track latency and token cost in any real product (Stages 8 and 9)
That foundation will make the rest of the LLM ecosystem much easier to understand.