Language Models are Few-Shot Learners
We trained a 175-billion-parameter language model — GPT-3 — and showed that by simply presenting it with a handful of task examples in plain text, it can perform competitively across dozens of NLP benchmarks without any gradient updates or fine-tuning whatsoever.
The surprise was not just the performance but the flexibility: the same frozen model that answers trivia questions can translate French, solve arithmetic, and write convincing news articles, all driven purely by the examples you show it in the prompt.
Perhaps most unsettling: human evaluators could distinguish GPT-3-generated news articles from real ones only 52% of the time — barely better than chance.
The Problem We Were Trying to Solve
Fine-tuning a new model for every new task is an unsustainable paradigm — and it quietly bakes in a kind of brittleness we rarely talk about.
By 2020, the dominant recipe for NLP was: pretrain a large model on raw text, then fine-tune it on thousands or tens of thousands of labeled examples for your specific task. BERT, RoBERTa, T5 — all brilliant systems, all following this template. And the results were undeniably good. But the approach carries hidden costs that compound as you scale.
First, labeled data is expensive. To apply a language model to a new task — say, detecting sarcasm in social media posts, or classifying contract clauses — you need a fresh dataset. That has to be collected, annotated, quality-checked, and maintained. For many useful tasks, this is simply infeasible.
Second, and more subtle: a model fine-tuned on a narrow distribution tends to exploit spurious correlations in that distribution. The model learns shortcuts specific to the benchmark dataset rather than the underlying skill. This inflates scores in ways that don't generalize. A model scoring at "human level" on a dataset is not necessarily performing at human level on the underlying task in the wild.
This adaptability — what we call meta-learning — is something humans deploy constantly. When I ask a friend to "tell me if this sentence sounds too formal," they don't need fifty annotated examples to understand the task. They recognize it almost instantly from context. We wanted our language models to do the same.
The Core Idea: In-Context Learning
No gradient updates. No weight changes. Pure forward-pass adaptation — the model reads your examples and figures out the task on the fly.
The key insight is deceptively simple. A language model is trained to predict the next token in a sequence. If you carefully construct your input sequence to contain examples of a task, the model can recognize the pattern and continue it. We formalized this into three distinct settings:
Zero-shot: Give the model only a natural language instruction. No examples at all.
One-shot: Give the model one input-output demonstration, then the new query.
Few-shot: Give the model K demonstrations — typically 10 to 100, constrained by the 2048-token context window — before the new query.
Here's the concrete translation example we use in the paper. Imagine you want to translate English into French.
The model reads those three input→output pairs, recognizes the translation pattern, and completes the final one correctly: fromage. No fine-tuning. No gradient step. Just pattern completion over a forward pass.
A useful analogy: imagine hiring a chef. Fine-tuning is like enrolling them in a six-week cooking school tailored to your specific cuisine. Few-shot learning is like handing them a recipe card at the kitchen door. The chef already knows how to cook — you're just specifying what you want today.
Whether the model is truly "learning" the task from scratch in context, or recognizing a pattern it encountered during pretraining, is actually an open and fascinating question we discuss later. But either way, the practical behavior is the same: show it examples, it gets better at the task.
How GPT-3 Works
Same architecture as GPT-2 — a decoder-only Transformer — but scaled by roughly two orders of magnitude, to 175 billion parameters.
The architecture uses alternating dense and locally banded sparse attention patterns (inspired by the Sparse Transformer), with pre-layer normalization and the same BPE tokenizer as GPT-2. We trained eight model sizes simultaneously to study how capability scales with compute — from 125M parameters up to 175B.
Model Size Ladder
| Model | Parameters | Layers | d_model | Attention Heads | Batch Size |
|---|---|---|---|---|---|
| GPT-3 Small | 125M | 12 | 768 | 12 | 0.5M tokens |
| GPT-3 Medium | 350M | 24 | 1,024 | 16 | 0.5M tokens |
| GPT-3 Large | 760M | 24 | 1,536 | 16 | 0.5M tokens |
| GPT-3 XL | 1.3B | 24 | 2,048 | 24 | 1M tokens |
| GPT-3 2.7B | 2.7B | 32 | 2,560 | 32 | 1M tokens |
| GPT-3 6.7B | 6.7B | 32 | 4,096 | 32 | 2M tokens |
| GPT-3 13B | 13B | 40 | 5,140 | 40 | 2M tokens |
| GPT-3 175B ★ | 175B | 96 | 12,288 | 96 | 3.2M tokens |
All models trained on 300 billion tokens total, on V100 GPUs across a high-bandwidth cluster provided by Microsoft. The context window is 2,048 tokens for all sizes.
Training Data Mix
We trained primarily on a filtered version of Common Crawl — an enormous web scrape — but raw Common Crawl has quality problems. So we filtered it against high-quality reference corpora, deduplicated aggressively at the document level, and blended in curated sources. Importantly, we intentionally oversampled higher-quality data rather than sampling proportionally to raw size.
| Dataset | Tokens | Weight in Training Mix | Epochs During Training |
|---|---|---|---|
| Common Crawl (filtered) | 410B | 60% | 0.44× |
| WebText2 | 19B | 22% | 2.9× |
| Books1 | 12B | 8% | 1.9× |
| Books2 | 55B | 8% | 0.43× |
| Wikipedia | 3B | 3% | 3.4× |
Notice that WebText2 — despite being only 19B tokens — carries 22% of the training weight. We see roughly three passes over it during training. This reflects a deliberate quality-over-quantity philosophy. Wikipedia, similarly, is tiny in raw tokens but gets 3.4 full passes. The model is learning more per token from these curated sources.
What We Found
GPT-3 achieves state-of-the-art on several benchmarks without any fine-tuning — results that should not have been possible under conventional wisdom about large language models.
LAMBADA: The +18% Shock
LAMBADA is a cloze task (fill-in-the-blank): predict the last word of a sentence that requires reading a full paragraph of context. Prior state-of-the-art, held by the 17B-parameter Turing-NLG model, was 68.0% accuracy. We expected incremental gains.
| Setting | LAMBADA Accuracy | LAMBADA Perplexity |
|---|---|---|
| Previous SOTA (Turing-NLG 17B) | 68.0% | 8.63 |
| GPT-3 Zero-Shot | 76.2% | 3.00 |
| GPT-3 One-Shot | 72.5% | 3.35 |
| GPT-3 Few-Shot | 86.4% | 1.92 |
The key insight here is that few-shot framing allowed us to restructure the task as a fill-in-the-blank rather than open generation. This let the model infer that exactly one word was desired — something a standard language model has no way of knowing from the training objective alone. The few-shot format was doing real work.
TriviaQA: Matching Fine-Tuned Retrieval Systems (Closed-Book)
TriviaQA tests factual question answering — no external documents allowed. GPT-3 must answer entirely from knowledge stored in its parameters. The comparison is stark:
| System | NaturalQS | WebQS | TriviaQA |
|---|---|---|---|
| RAG (Fine-tuned + Retrieval) | 44.5 | 45.5 | 68.0 |
| T5-11B + SSM (Closed-Book, Fine-tuned) | 36.6 | 44.7 | 60.5 |
| T5-11B (Fine-tuned, Closed-Book) | 34.5 | 37.4 | 50.1 |
| GPT-3 Zero-Shot | 14.6 | 14.4 | 64.3 |
| GPT-3 One-Shot | 23.0 | 25.3 | 68.0 |
| GPT-3 Few-Shot | 29.9 | 41.5 | 71.2 |
Arithmetic From Scratch: An Emergent Ability
We hadn't designed GPT-3 to do arithmetic. It wasn't in our objectives. So what happened when we just asked it, in natural language, "Q: What is 48 plus 76?" was genuinely surprising.
The performance drop-off with digit count is informative. At 4 digits the model achieves around 25%, and at 5 digits around 9-10%. It's clearly struggling to carry digits reliably — it makes the kinds of mistakes a student makes when they forget to "carry the 1." But it's actually trying to compute, not pattern-match to memorized answers. That's a qualitatively different kind of failure than random guessing.
Translation: Strong Into English, Weaker Out of English
GPT-3's training data is 93% English. Yet it achieves competitive translation performance on several language pairs — and notably surpasses previous unsupervised translation baselines when translating into English, reflecting the strength of its English language model.
| System | En→Fr | Fr→En | En→De | De→En | En→Ro | Ro→En |
|---|---|---|---|---|---|---|
| Supervised SOTA | 45.6 | 35.0 | 41.2 | 40.2 | 38.5 | 39.9 |
| MASS (Unsupervised) | 37.5 | 34.9 | 28.3 | 35.2 | 35.2 | 33.1 |
| GPT-3 Zero-Shot | 25.2 | 21.2 | 24.6 | 27.2 | 14.1 | 19.9 |
| GPT-3 One-Shot | 28.3 | 33.7 | 26.2 | 30.4 | 20.6 | 38.6 |
| GPT-3 Few-Shot | 32.6 | 39.2 | 29.7 | 40.6 | 21.0 | 39.5 |
Word Scrambling and SuperGLUE
GPT-3 achieves 71.8 on the SuperGLUE benchmark in few-shot mode — exceeding fine-tuned BERT-Large (69.0) using only 32 examples per task, with zero gradient updates. On ReCoRD it comes within striking distance of SOTA at 91.1 F1. On word scrambling tasks — where it must recover original words from anagrammed or permuted letters — we see spectacular few-shot learning curves, particularly at 175B scale where the ability seems to emerge discontinuously relative to smaller models.
Where It Still Struggles
Natural language inference, tasks involving pairwise comparison, and certain reading comprehension formats remain stubborn weak spots — and the reasons are illuminating.
We want to be honest about this. GPT-3 does not uniformly improve on everything. There are clear failure modes, and understanding them points toward what must change architecturally or algorithmically.
Natural Language Inference (ANLI, RTE)
NLI (Natural Language Inference) asks: does sentence A imply, contradict, or remain neutral with respect to sentence B? On Adversarial NLI (ANLI), which was specifically designed to fool language models, all our models smaller than 175B perform at essentially random chance (~33%). Even GPT-3 175B only partially closes the gap from chance to SOTA — and the improvement is inconsistent across rounds.
Our hypothesis is architectural: GPT-3 is a left-to-right autoregressive model. Tasks that require carefully comparing two spans — "does sentence A entail sentence B," "is this word used the same way in both sentences" — may fundamentally benefit from bidirectional attention. BERT-style models, which can attend in all directions simultaneously, have a natural advantage here.
Reading Comprehension: QuAC and RACE
On QuAC (conversational QA requiring modeling of structured dialog turns), GPT-3 lands 13 F1 points below an ELMo baseline — a relatively modest 2019 model. On RACE (multiple-choice reading comprehension from Chinese middle and high school exams), it sits 45% below SOTA and competes only with early contextual representation models. These are formats that require carefully re-reading a passage and then selecting from options — again, something bidirectionality seems to help with.
Long-Document Coherence
In open-ended text generation, GPT-3 samples sometimes repeat themselves semantically at the document level, lose coherence over long passages, contradict earlier statements, or generate occasional non-sequiturs. The quality is high enough to pass human scrutiny on short passages (as the news article experiment shows), but sustained long-form generation remains a qualitatively harder problem.
The Human Evaluation That Unnerved Us
We asked humans to distinguish GPT-3-generated news articles from human-written ones. At the 175B scale, they could do so only 52% of the time — barely above chance.
We generated ~200-word news articles using GPT-3 at various model sizes, conditioned on real article titles and subtitles. We then asked human evaluators to rate each article: human or AI-generated? To calibrate the task, we also generated articles from a "control model" — a tiny 160M parameter model with high output randomness, deliberately producing bad output.
| Model | Mean Accuracy | 95% Confidence Interval | "I don't know" Rate |
|---|---|---|---|
| Control (deliberately bad) | 86% | 83%–90% | 3.6% |
| GPT-3 Small (125M) | 76% | 72%–80% | 4.9% |
| GPT-3 Medium (350M) | 61% | 58%–65% | 6.0% |
| GPT-3 Large (760M) | 68% | 64%–72% | 8.7% |
| GPT-3 XL (1.3B) | 62% | 59%–65% | 7.5% |
| GPT-3 2.7B | 62% | 58%–65% | 7.1% |
| GPT-3 6.7B | 60% | 56%–63% | 6.2% |
| GPT-3 13B | 55% | 52%–58% | 7.1% |
| GPT-3 175B | 52% | 49%–54% | 7.8% |
The articles that tripped up evaluators most often contained subtle tells rather than obvious ones: occasional factual inaccuracies (the model has no access to what actually happened on a given date), mild repetition in argument structure, and unusual phrasing choices that weren't quite wrong but weren't quite right either. These cues were apparently too subtle for most people to detect reliably under normal reading conditions.
What does this mean? At minimum, it suggests we are approaching an inflection point in synthetic text quality — one where the usual assumption that humans can reliably detect AI-generated content no longer holds. The downstream implications for misinformation, fraud, and content authenticity are significant and require serious attention beyond what any single paper can provide.
Broader Concerns
A model this capable trained on internet-scale data inherits internet-scale biases — and its ability to generate convincing text raises misuse risks we need to address directly.
We include this section not as a disclaimer but as a genuine part of the scientific contribution. Characterizing failure modes and societal risks is as important as reporting benchmark gains.
Gender Bias
When prompted with "The [occupation] was a ___", GPT-3 associated 83% of 388 tested occupations with male gender identifiers. High-education roles (legislator, professor, banker) and physically demanding roles (mason, sheriff) skewed heavily male. Roles skewing female included midwife, nurse, receptionist, and housekeeper — which closely track real-world stereotypes in the training data.
Interestingly, when we added "competent" as a modifier, the male lean increased for most occupations. The average occupation bias metric (log ratio of female to male probability) shifted from −1.11 (neutral prompt) to −2.14 (competent prompt). Adding "incompetent" returned the value close to neutral. The model is amplifying existing societal associations rather than correcting for them.
Race and Religion
When asked to complete text beginning with "The [race] people are known for...", GPT-3's completions vary systematically by racial group in ways that reflect and in some cases amplify stereotypes present in training corpora. For religion, sentiment analysis of completions about different faiths shows that Christianity and Judaism receive more positive sentiment on average; Islam receives less positive sentiment and is associated more frequently with words relating to violence.
Misuse Risk
Any activity relying on generating high-quality text at scale — misinformation, phishing, fake reviews, academic fraud, social engineering — becomes easier as text generation quality improves. The 52% human-detection result we discuss above is a concrete milestone in that direction. We monitor forums where malicious actors discuss AI tools and found limited evidence of active exploitation as of 2020, but we assess that sufficiently reliable, steerable generation would increase this risk substantially.
Energy Cost
Training GPT-3 consumed several thousand petaflop/s-days of compute — roughly 100× more than GPT-2. This is energy-intensive and contributes to the environmental costs of large-scale ML. We note that inference is comparatively efficient: generating 100 pages of text from the trained model costs on the order of 0.4 kWh. But we cannot fully amortize training cost across all future uses, and the field needs better accounting for these costs as models continue to grow.
What This Opens Up
GPT-3 is a proof of concept for in-context learning at scale — but it also clearly marks where pure autoregressive self-supervised pretraining runs into fundamental walls.
The most obvious next question is: does scaling continue to help? The power-law trend in validation loss held for two additional orders of magnitude, and downstream task performance largely tracks this. There's no obvious bend in the curve suggesting we've hit a ceiling. But raw scaling is unlikely to be the whole answer.
The Bidirectionality Gap
GPT-3's struggles on NLI, WiC, and other comparison tasks suggest that unidirectional left-to-right modeling has structural limitations. A bidirectional model at the scale of GPT-3 — or a hybrid that combines bidirectional encoding with generative decoding — could potentially achieve the "best of both worlds": strong few-shot and zero-shot generalization, plus the pairwise comparison abilities where GPT-3 underperforms. This remains an important open direction.
Grounding and Other Modalities
Language models are not grounded in the physical world. GPT-3 struggles with intuitive physics — "If I put cheese in the fridge, will it melt?" is a question type where it fails despite surface fluency. Real understanding of the world requires more than text: video, embodied interaction, sensorimotor feedback. Extending the pretraining paradigm to include images (as we later explored with CLIP and DALL·E) or other modalities seems like a natural and necessary next step.
The Limits of Prediction as an Objective
Our training objective weights every token equally — predicting "the" carries the same loss as predicting a key named entity. This is a crude objective for a system that should ultimately understand what matters. More targeted objectives, human feedback, or reinforcement learning could help the model focus its capacity on what's actually meaningful. The direction of learning the objective from human feedback — which the field later pursued aggressively — was already visible as a promising path from where we stood in 2020.
Sample Efficiency
GPT-3 still sees far more text during pretraining than a human encounters in a lifetime. Its in-context sample efficiency (learning from 10–100 examples at inference time) is impressive. Its pretraining sample efficiency is not. Closing this gap — achieving human-like learning from the amount of data a human actually encounters — requires fundamentally new ideas, possibly involving richer world models, causal reasoning, or persistent memory.
Citation & Paper Link
The full paper, including all experimental details, appendices, and 500 uncurated sample completions:
arxiv.org/abs/2005.14165 — Language Models are Few-Shot Learners