A large language model (LLM) is an AI system trained on enormous quantities of text — typically hundreds of billions to trillions of words — that learns to predict the next token of language given everything that came before it. That single prediction objective, repeated trillions of times across the internet, books, code, and structured data, produces a system that can hold conversations, write code, summarize documents, translate languages, draft contracts, and reason through multi-step problems. LLMs are the engines behind ChatGPT, Claude, Gemini, Copilot, and every major generative AI product released in the last three years.
The “large” in large language model is not marketing. The smallest commercially significant LLM in 2026 has 7 billion parameters; the frontier models cross 1.5 trillion. Scale matters because the capabilities that make LLMs commercially useful — instruction-following, in-context learning, multi-step reasoning — emerge only above certain size thresholds. A 1B-parameter model cannot do what a 70B-parameter model can do, and a 70B model cannot do what a 500B+ frontier model can do. Understanding why scale produces capability, and which capabilities scale unlocks, is the foundation of working with LLMs in 2026.
How a large language model actually works
At the core of every modern LLM sits a transformer architecture — a neural network design introduced by Google researchers in 2017 that uses a mechanism called self-attention to let every token in a sequence reference every other token when producing its output. Before transformers, language models processed text sequentially and lost track of long-range context; transformers process the whole context window in parallel and can use information from thousands of tokens earlier with no degradation.
The training process has three phases. First, pretraining: the model is shown a vast corpus of text and asked to predict the next token at every position. Each correct prediction nudges the model’s weights; each wrong prediction nudges them harder. Run this loop for months on tens of thousands of GPUs and the model learns grammar, facts, world knowledge, code patterns, and the latent structure of arguments. Second, supervised fine-tuning on instruction-response pairs teaches the model to follow human instructions rather than continue arbitrary text. Third, preference tuning — typically RLHF or one of its successors like DPO and Constitutional AI — aligns the model’s outputs with what humans actually find helpful, harmless, and honest.
The output of all this training is a giant statistical model of language. When you send a prompt to an LLM, the system tokenizes your text, runs it through the transformer, and at each position computes a probability distribution over the next possible token. It samples one (with some randomness controlled by temperature), appends it to the context, and repeats. Streaming output you see in ChatGPT and Claude is literally this process happening one token at a time.
Why context windows matter
An LLM can only “see” what fits inside its context window. In 2023, frontier models had 8K-token windows; in 2026, frontier windows reach 1M-2M tokens — enough to hold an entire codebase, a 10-hour transcript, or thousands of pages of legal discovery. This single change has reshaped how engineers build with LLMs. Workflows that used to require retrieval-augmented generation for any document over a few thousand words now run as straightforward in-context tasks.
But context length is not free. Cost scales linearly with input tokens, and latency scales worse than that for many architectures. The pragmatic 2026 pattern is: fit what you need into context when you can, fall back to retrieval when the corpus exceeds what’s economical to load every call, and use specialized long-context models when the task fundamentally requires whole-document reasoning.
The model families to know in 2026
Four labs ship the frontier general-purpose LLMs: Anthropic (Claude Opus, Sonnet, Haiku), OpenAI (GPT-5.5 and the o-series reasoning models), Google DeepMind (Gemini 3.x), and Meta (Muse Spark). xAI’s Grok ships at the upper-mid tier. Below the frontier, a robust open-weights ecosystem — Mistral, Qwen, DeepSeek, Llama, Cohere — provides models you can host yourself with capabilities approaching but not matching the closed frontier.
Specialized model families round out the landscape. Coding-tuned models like Claude Sonnet for coding, GPT-5.5-codex, and Qwen-Coder dominate AI coding agent workflows. Vision-language models like GPT-5.5-V and Gemini 3.x handle images and video natively. Audio-native models like GPT-4o-audio and Gemini 2.5 Native Audio handle speech end-to-end without separate ASR/TTS pipelines.
How LLMs differ from older NLP systems
Older natural language processing systems were narrow — a sentiment classifier, a named-entity recognizer, a translation system, each trained on a specific dataset for a specific output format. LLMs are the opposite: one model, trained once, asked at inference time to do whatever the prompt describes. This is sometimes called the “general-purpose model” property, and it is the single most important change LLMs introduced.
The implication for builders is significant. You no longer train a new model for each task. You write a prompt, optionally provide a few examples (few-shot prompting), optionally fine-tune on a small dataset, and you have a working system. The cost structure has flipped from “expensive to build, cheap to run” to “cheap to build, ongoing inference costs.” Whether that’s better for your use case depends on your volumes — but the development velocity has increased by orders of magnitude.
What LLMs are good at, and where they still fail
LLMs in 2026 reliably handle: drafting documents, summarizing long inputs, code generation and refactoring, structured data extraction, multi-turn dialogue, translation across major languages, basic mathematical reasoning, document classification, and tool-use orchestration as part of agent workflows. Specialized reasoning models handle complex multi-step problems — competition math, advanced coding, scientific literature synthesis — at levels that match or exceed expert humans on benchmark tests.
They still fail at: tasks requiring genuine novelty beyond the training distribution, tasks requiring perfect factual precision when the source isn’t in context, tasks requiring complex multi-modal reasoning across video and physical-world dynamics, and tasks that require sustained agentic operation over hours or days without supervision. Hallucination — confident output of wrong facts — remains a known failure mode, mitigated but not eliminated by retrieval, citation requirements, and post-hoc verification.
The economics of running an LLM
Frontier API pricing in 2026 ranges from roughly $1-3 per million input tokens for fast/cheap models to $15-30 per million for the top reasoning models. A typical paragraph of text is about 100-200 tokens. A 10K-token retrieval-augmented prompt with a 1K-token response might cost two to four cents on a frontier model — small per call but consequential at production volumes.
Hosting open-weights models yourself trades API cost for infrastructure cost. A single H200 or Blackwell B200 can serve 7B-70B models comfortably; multi-GPU clusters serve 70B-500B models. The break-even between API and self-hosting depends on volume, latency requirements, and how much engineering time you can spend on inference optimization.
Where to learn next
If you’re new to AI, start with the AI for Beginners 2026 introduction. To pick the right LLM for a specific task, work through the 2026 AI Model Buyer’s Guide. To build production applications with LLMs, the RAG in Production 2026 playbook covers the most common deployment pattern. To get more from any LLM you use, Prompt Engineering 101 is the highest-leverage skill in the field.
The AI Learning Guides Free Library has comprehensive deep-dive playbooks for every major industry that’s deploying LLMs at scale — healthcare, legal, financial services, pharma, manufacturing, retail, marketing, education, and cybersecurity — plus technical playbooks for engineers building with LLMs in production.