How LLMs Actually Work: A Developer's Guide

You use ChatGPT, Claude, Copilot every day. But do you actually know what happens when you press Enter? As a developer using these tools, understanding the basics isn't just curiosity - it makes you better at prompting, debugging, and building with AI. You'll know why your prompt costs what it does, why the model hallucinates sometimes, why it can write poetry but can't multiply large numbers, and how to design your applications to work with these strengths and weaknesses instead of against them.

This isn't an academic paper. Think of it as the explanation I wish someone had given me when I first started building with LLMs - practical, developer-focused, and just deep enough to be useful without drowning in linear algebra.

The 30-Second Version

Before we dive in, here's the entire process in one paragraph. Keep this mental model as you read - everything else is just adding detail to these steps:

Your text input gets split into tokens (roughly word-pieces).
Each token is converted into a vector - a list of numbers that represents its meaning.
These vectors flow through dozens of transformer layers, where attention mechanisms let each token look at every other token to build up context.
At the end, the model outputs a probability distribution over the entire vocabulary - basically a ranked list of what token is most likely to come next.
One token is sampled from those probabilities and added to the output.
Repeat from step 3 until the model produces a stop token or hits the max length.

That's it. Every response from every LLM you've ever used was generated this way - one token at a time, each one influenced by everything that came before it. Now let's unpack each step.

Tokenization diagram showing how the sentence 'How do large language models work?' is split into 7 tokens, each mapped to a numeric ID — Tokenization: 6 words become 7 tokens, each with a numeric ID the model can process.

Tokenization: How Text Becomes Numbers

Neural networks don't understand text. They understand numbers. So the very first thing that happens when you send a prompt is tokenization - converting your text into a sequence of integers that the model can process.

Here's the key insight: tokens are not words, and they're not characters. They're somewhere in between - subword units. The most common approach is called Byte Pair Encoding (BPE), and it works by finding the most frequently occurring sequences of characters in the training data and turning them into single tokens.

Common words like "the" or "and" become single tokens. Less common words get split into pieces. A word like "tokenization" might become ["token", "ization"] - two tokens. A rare word like "defenestration" might become ["def", "en", "est", "ration"] - four tokens.

You can see this in action. Here's how a sentence gets tokenized by a GPT-style tokenizer:

// Using tiktoken (OpenAI's tokenizer library)
import { encoding_for_model } from "tiktoken";

const enc = encoding_for_model("gpt-4");
const tokens = enc.encode("How do large language models work?");

console.log(tokens);
// [2437, 656, 3544, 4221, 4211, 990, 30]

console.log(tokens.length);
// 7 tokens for 6 words

// Each token maps back to text:
// "How" → 2437
// " do" → 656     (notice the leading space is part of the token)
// " large" → 3544
// " language" → 4221
// " models" → 4211
// " work" → 990
// "?" → 30

Notice something important: the space before a word is often part of the token itself. The model doesn't see "spaces" as separate things - they're baked into the token.

Why tokenization matters for developers

Understanding tokenization directly affects how you build with LLMs:

Cost: API pricing is per token, not per word. A dense, jargon-heavy prompt costs more than simple English because uncommon words split into multiple tokens.
Context window: When a model says it supports "128K tokens," that's roughly 96K words of English - but far fewer words in languages like Japanese or Korean, where each character might be its own token.
Math struggles: The number "12345" might tokenize as ["123", "45"]. The model doesn't see it as one number - it sees two separate tokens. This is a big reason LLMs struggle with arithmetic.
Prompting: Shorter, simpler words are generally more reliable because they're single tokens. The model has seen them more consistently during training.

Embeddings: Numbers That Understand Meaning

Once text is tokenized into integers, those integers need to become something the model can actually reason about. That's where embeddings come in.

Each token ID gets looked up in a giant table - the embedding matrix - and converted into a vector: a list of numbers, typically 4096 to 12288 dimensions long (depending on the model). Think of each dimension as capturing some aspect of the token's meaning.

The famous example that made this click for a lot of people comes from Word2Vec (2013):

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

This isn't a parlor trick - it reveals something profound. The model learns that the relationship between "king" and "man" is the same as the relationship between "queen" and "woman." These relationships are encoded as directions in high-dimensional space.

For a developer, the useful intuition is this: similar concepts end up close together in vector space. "JavaScript" and "TypeScript" are close. "JavaScript" and "banana" are far apart. "Python" (the language) and "Python" (the snake) start as the same token but get pushed to different regions of the space as the transformer layers process surrounding context. This is how the model handles ambiguity.

If you've worked with vector databases (Pinecone, pgvector, Weaviate), you're already using embeddings. When you generate an embedding for a document chunk and store it, you're storing a point in this same kind of high-dimensional meaning space. Similarity search works because nearby points mean similar things.

2D scatter plot of embedding space showing programming languages clustered together, fruits clustered together, and vector arithmetic: king minus man plus woman equals queen — Embedding space: similar words cluster together, and relationships are encoded as directions.

The Transformer Architecture

Now we get to the architecture that changed everything. In 2017, a team at Google published a paper called "Attention Is All You Need" that introduced the transformer. Before this paper, the state of the art for processing sequences (text, audio, time series) was Recurrent Neural Networks (RNNs) and LSTMs. They worked, but they were slow and struggled with long-range dependencies - by the time the model processed the 500th word in a paragraph, it had largely forgotten what the 1st word was.

The transformer solved this with a fundamentally different approach: instead of processing tokens one at a time in sequence, it processes all tokens simultaneously and lets every token attend to every other token. This is the attention mechanism, and it's the single most important idea in modern AI.

The original transformer had two halves: an encoder (reads the input) and a decoder (generates the output). This was designed for translation - encode a French sentence, decode an English sentence. But GPT and most modern LLMs only use the decoder half. They're called "decoder-only" models, and they generate text by predicting the next token given all previous tokens.

A modern LLM is essentially a very deep stack of identical transformer blocks. GPT-3 has 96 layers. Claude and GPT-4 likely have more (the exact architectures aren't public). Each block does two main things:

Multi-head self-attention: Each token looks at every other token and figures out which ones are relevant to it.
Feed-forward network: A simple neural network that processes each token independently, transforming its representation.

Between these operations, there are residual connections (add the input of a layer back to its output) and layer normalization. The residual connections are crucial - they create a "highway" that lets information flow from early layers all the way to the end without degrading. Without them, deep networks would be impossible to train.

Here's a simplified pseudocode view of one transformer block:

function transformerBlock(tokens) {
  // Step 1: Self-attention (tokens look at each other)
  let attended = selfAttention(tokens);
  tokens = layerNorm(tokens + attended);  // residual connection

  // Step 2: Feed-forward (each token processed independently)
  let transformed = feedForward(tokens);
  tokens = layerNorm(tokens + transformed);  // residual connection

  return tokens;
}

// The full model is just many of these stacked:
function llm(inputTokens) {
  let hidden = embed(inputTokens);  // token IDs → vectors

  for (let i = 0; i < NUM_LAYERS; i++) {  // 96+ layers
    hidden = transformerBlock(hidden);
  }

  return predictNextToken(hidden);  // vectors → probability over vocabulary
}

Each layer refines the representation. Early layers tend to capture syntax and basic word relationships. Middle layers build up semantic understanding. Late layers focus on the specific task at hand - what token should come next given all this accumulated context. But this isn't a rigid pipeline; information flows in complex ways through the residual connections.

Self-attention diagram showing how the word 'it' in 'The cat sat on the mat because it was tired' attends most strongly to 'cat' with attention weight 0.62 — Self-attention: line thickness shows how strongly "it" attends to each word. The model learns that "it" refers to "cat."

Attention: The Key Innovation

Attention is the mechanism that makes transformers work, and it's worth understanding well because it explains a lot of LLM behavior you see in practice.

The core idea is simple: when processing a token, the model needs to figure out which other tokens in the sequence are relevant to it. In the sentence "The cat sat on the mat because it was tired," when processing "it," the model needs to realize that "it" refers to "cat," not "mat." Attention is the mechanism that makes this connection.

Query, Key, Value: The Database Analogy

The best analogy I've found is a database query. For each token, the model computes three vectors:

Query (Q): "What am I looking for?" - this represents what information the current token needs.
Key (K): "What do I contain?" - this represents what information each token can offer.
Value (V): "Here's my actual content" - this is the information that gets passed along when a match happens.

The attention score between two tokens is the dot product of one token's Query with another token's Key. High score means they're relevant to each other. These scores get normalized into weights (using softmax), and then each token's output is a weighted sum of all the Values.

// Simplified attention for a single token
function attention(query, keys, values) {
  // How relevant is each other token to me?
  const scores = keys.map(key => dotProduct(query, key));

  // Normalize scores to weights (0 to 1, summing to 1)
  const weights = softmax(scores / sqrt(keyDimension));

  // My output is a weighted blend of all values
  return weightedSum(weights, values);
}

// Example: processing "it" in "The cat sat on the mat because it was tired"
// query = what "it" is looking for (something animate, a subject)
// keys  = what each word advertises (cat: "I'm animate, a subject", mat: "I'm a surface")
// The dot product of "it"'s query with "cat"'s key will be high
// So "it"'s output will be heavily influenced by "cat"'s value

Multi-Head Attention

A single attention operation can only capture one type of relationship. But language has many types of relationships simultaneously - syntactic (subject-verb), semantic (co-reference like "it" → "cat"), positional (adjacent words), and more.

The solution is multi-head attention: run multiple attention operations in parallel, each with its own learned Q/K/V transformations. Each "head" can learn to focus on a different type of relationship. One head might track grammatical structure, another might track co-references, another might focus on semantic similarity.

GPT-3 uses 96 attention heads per layer. That's 96 different "perspectives" on how tokens relate to each other, at every single layer.

Why attention beats RNNs

With an RNN, information from token 1 has to travel through every single intermediate step to reach token 500. It's like a game of telephone - the signal degrades. With attention, token 500 can directly look at token 1 in a single operation. The "distance" between any two tokens is always 1 step. This is why transformers handle long documents so much better than older architectures.

How Training Works

Understanding training helps explain both the capabilities and the limitations of LLMs. There are typically two (sometimes three) phases.

Phase 1: Pre-training (Next-Token Prediction)

The foundation of every LLM is a deceptively simple task: predict the next token. Take a massive corpus of text - think: a significant chunk of the internet, books, academic papers, code repositories - and train the model to predict what comes next, one token at a time.

// Training example (simplified):
Input:  "The capital of France is"
Target: "Paris"

// The model sees "The capital of France is" and tries to predict "Paris"
// If it predicts "London", the loss function penalizes it
// Gradients flow backward, adjusting billions of parameters slightly
// Repeat this billions of times with different text

The scale of this is hard to overstate. GPT-3 was trained on roughly 300 billion tokens. GPT-4 and Claude likely used trillions. The training run for a frontier model costs tens of millions of dollars in compute - thousands of GPUs running for months.

What's remarkable is that this simple objective - predict the next token - produces something that appears to "understand" language, logic, code, math, and more. The model has to learn grammar, facts, reasoning patterns, and common sense just to be good at prediction. If the training text says "water boils at 100 degrees" enough times, the model learns that fact - not because anyone labeled it as a fact, but because knowing it improves next-token prediction.

Phase 2: Fine-tuning

A pre-trained model is a powerful text predictor, but it's not yet a useful assistant. It might continue your prompt with a plausible-sounding paragraph, but it doesn't know how to follow instructions or have a conversation. Fine-tuning adjusts the model on curated examples of the behavior you want: instruction-response pairs, conversation formats, specific tasks.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

This is what turned GPT-3 into ChatGPT. The process:

Generate multiple responses to the same prompt.
Have humans rank them from best to worst.
Train a "reward model" to predict human preferences.
Use reinforcement learning (PPO or similar) to optimize the LLM to produce responses the reward model rates highly.

RLHF is why ChatGPT says "I'm an AI language model" instead of making up a persona, why it tries to be helpful instead of just completing text, and why it refuses certain requests. It's also why different models (Claude, GPT, Gemini) feel different to use - their RLHF training reflects different priorities and values.

Three-phase LLM training pipeline: Phase 1 Pre-training on internet text, Phase 2 Fine-tuning on instruction pairs, Phase 3 RLHF with human preference rankings — The three phases of LLM training: from raw text predictor to aligned assistant.

Inference: What Happens When You Press Enter

You've typed a prompt and hit send. Here's what happens next, step by step.

The Autoregressive Loop

LLMs generate text one token at a time. This is the autoregressive loop:

Your entire prompt gets tokenized and fed through the model.
The model outputs probabilities for what the next token should be.
A token is selected from those probabilities.
That token gets appended to the sequence.
The whole sequence (prompt + generated tokens so far) gets processed again.
Repeat until done.

This means generating a 500-token response requires 500 forward passes through the model. Each pass has to process the entire sequence up to that point. This is why longer responses take longer - it's not just streaming delay, the model literally does more computation.

Temperature and Sampling

The model doesn't output one answer - it outputs a probability distribution over its entire vocabulary (50,000+ tokens). "Temperature" controls how that distribution gets used:

Temperature = 0: Always pick the most probable token. Deterministic, repetitive, "safe." Use this for factual tasks.
Temperature = 0.7: Usually pick likely tokens, but allow some randomness. Good for creative writing, conversation. This is where most models default.
Temperature = 1.5+: High randomness. Outputs become creative, surprising, and often incoherent. Rarely useful in production.

// Simplified sampling with temperature
function sample(logits, temperature) {
  // Scale the raw scores by temperature
  const scaled = logits.map(l => l / temperature);

  // Convert to probabilities
  const probs = softmax(scaled);

  // Low temperature → one token dominates (deterministic)
  // High temperature → more uniform distribution (random)

  return randomSample(probs);
}

// Temperature 0.1: "The capital of France is Paris" (always)
// Temperature 1.0: "The capital of France is Paris" (usually)
// Temperature 2.0: "The capital of France is magnificent" (creative/wrong)

Three bar charts comparing temperature settings: Temperature 0 gives 98% to Paris, Temperature 0.7 gives 72% to Paris, Temperature 1.5 spreads probability across many words — How temperature reshapes the probability distribution: low = deterministic, high = creative but risky.

Top-k and Top-p (Nucleus Sampling)

Temperature alone isn't great - even at reasonable temperatures, there's always a small chance of sampling a wildly unlikely token. Two additional techniques help:

Top-k: Only consider the top k most probable tokens. If k=50, the model ignores everything outside the top 50 candidates.
Top-p (nucleus sampling): Only consider tokens whose cumulative probability adds up to p. If p=0.9, keep adding tokens (most probable first) until you hit 90% cumulative probability, then sample from that set.

Top-p is generally preferred because it adapts to the situation. When the model is very confident (one token has 95% probability), top-p with p=0.9 effectively just picks that token. When the model is uncertain (many tokens around 5% each), it allows exploration. Most APIs expose these parameters, and they meaningfully affect output quality.

The KV Cache: Why Context Windows Have Real Cost

Here's an optimization detail that matters for production: the KV cache. Remember that attention computes Keys and Values for every token. When generating the 100th token, the K and V vectors for tokens 1-99 don't change - only the new token needs new K/V vectors. So the model caches them.

This is why "context window" isn't just about capability - it's about memory. A 128K context window means the model needs to store K and V vectors for up to 128K tokens across all layers and all attention heads. This eats GPU memory fast. It's a big part of why API pricing scales with both input and output tokens, and why long context queries are expensive.

Why This Matters for Developers

Now the payoff. Understanding how LLMs work under the hood directly improves how you build with them.

Better prompting through understanding tokens

Knowing about tokenization means you understand why "use simple words" is actually practical advice - simple words are single tokens that the model has seen millions of times. Multi-token words have more room for error. You also understand why formatting matters: markdown, XML tags, and clear delimiters help the model because they create strong, unambiguous token patterns.

Why LLMs hallucinate

LLMs don't "know" things - they predict probable next tokens. If a confident-sounding continuation is statistically likely, the model will generate it even if it's factually wrong. The model has no internal fact-checker. It has no way to say "I'm not sure about this specific detail" at the token-probability level - though RLHF training helps it learn to express uncertainty at a higher level.

This means you should never trust LLM output for critical facts without verification. Build your applications with this assumption. Use RAG (Retrieval-Augmented Generation) to give the model grounded sources. Add validation layers. Show sources to users.

Why LLMs are bad at math

Two reasons, both related to what we covered. First, tokenization: numbers don't tokenize in a way that preserves their mathematical properties. "1847" might be two tokens with no inherent numerical relationship. Second, the model was trained on next-token prediction, not arithmetic. It can do math it's memorized from training data, but it can't reliably compute novel calculations. This is why chain-of-thought prompting helps - it forces the model to work through steps using tokens, turning a computation problem into a text-prediction problem.

Context window management

Understanding the KV cache and attention helps you appreciate why context window management matters. Attention has quadratic complexity with sequence length - processing 100K tokens is not 10x the cost of 10K tokens, it's closer to 100x. Put important information at the beginning and end of long prompts (models attend more strongly to these positions - this is called the "lost in the middle" problem). Trim unnecessary context. Use summarization for conversation history.

Building better AI features

When you understand that LLMs generate one token at a time, you understand why streaming matters for UX - users see output appearing in real time instead of waiting for the full response. When you understand temperature, you know to set it low for code generation and data extraction, higher for creative tasks. When you understand attention, you know why structured prompts (with clear sections and labels) work better than walls of text.

The Landscape in 2026

The field is moving fast. Here are the most important trends shaping how developers work with LLMs right now.

Mixture of Experts (MoE)

Instead of one monolithic model, MoE architectures use multiple specialized "expert" sub-networks and a router that decides which experts to activate for each token. This means you can have a model with hundreds of billions of parameters total but only activate a fraction of them for any given input - much more efficient. Mixtral and GPT-4 (reportedly) use this approach.

Longer context windows

We've gone from GPT-3's 4K tokens to models supporting 1M+ tokens. This is enabled by improvements in positional encoding (like RoPE and ALiBi), more efficient attention mechanisms (flash attention, ring attention), and better KV cache management. For developers, this means you can send entire codebases or document sets in a single prompt - but cost and latency still scale with length, so use it wisely.

Multimodal models

Modern models like GPT-4o, Claude, and Gemini handle images, audio, and video alongside text. Architecturally, this works by encoding non-text inputs into the same vector space as text tokens - an image becomes a sequence of "visual tokens" that flow through the same transformer layers. For developers, this opens up entirely new application categories.

On-device inference

Smaller, quantized models running on phones, laptops, and edge devices. Llama 3 and Mistral have shown that you can get useful performance from 7B or even 3B parameter models. Combined with techniques like quantization (reducing precision from 16-bit to 4-bit) and speculative decoding, on-device AI is becoming practical for real applications.

The model landscape

The days of "just use GPT-4" are over. In 2026, the landscape includes:

Claude (Anthropic) - excels at complex reasoning, long context, and following nuanced instructions. Strong safety focus.
GPT-4 / GPT-4o (OpenAI) - the generalist incumbent. Broad capabilities, huge ecosystem.
Gemini (Google) - deeply integrated with Google's ecosystem. Strong multimodal capabilities.
Llama (Meta) - open-weight. Run it yourself, fine-tune it, deploy it anywhere. The backbone of the open-source AI ecosystem.
Mistral - European AI lab producing efficient, open-weight models. Pioneered the MoE approach in open models.

As a developer, the right approach is understanding the trade-offs and picking the right model for each task - or even using different models for different parts of your application.

Resources to Go Deeper

If this article got you interested and you want to go deeper, these are the resources I'd recommend - all free and all excellent:

Andrej Karpathy's "Intro to Large Language Models" - A one-hour YouTube talk that's the single best introduction to how LLMs work. Karpathy is the former head of AI at Tesla and a founding member of OpenAI, and he explains things brilliantly.
Andrej Karpathy's "Neural Networks: Zero to Hero" - If you want to build a language model from scratch (literally, character by character), this video series takes you there. Requires basic Python.
The Illustrated Transformer (Jay Alammar) - The best visual explainer of the transformer architecture. Every diagram you need to understand attention, Q/K/V, and multi-head attention.
Hugging Face NLP Course - A hands-on course that lets you work with transformer models directly. Great for developers who learn by doing.
"Attention Is All You Need" (original paper) - The 2017 paper that started it all. Surprisingly readable for an ML paper, especially now that you understand the concepts.

Wrapping Up

You don't need to understand every detail of transformer math to be effective with LLMs. But knowing the basics - that text becomes tokens, tokens become vectors, attention lets tokens communicate, and the whole thing is just predicting the next most likely token - gives you a huge advantage as a developer.

You'll write better prompts because you understand what the model actually sees. You'll debug weird outputs because you know about tokenization edge cases. You'll build better applications because you understand the fundamental strengths and limitations of the technology. And you'll be able to evaluate new models and techniques as they come out because you have the mental framework to understand what's actually changing.

The developers who thrive in the AI era won't be the ones who treat LLMs as magic black boxes. They'll be the ones who understand the machine well enough to use it masterfully.