OpenAI Research · 2020

Language Models are Few-Shot Learners

Originally published May 2020 · This post: an author's reflection
TL;DR

We trained a 175-billion-parameter language model — GPT-3 — and showed that by simply presenting it with a handful of task examples in plain text, it can perform competitively across dozens of NLP benchmarks without any gradient updates or fine-tuning whatsoever.

The surprise was not just the performance but the flexibility: the same frozen model that answers trivia questions can translate French, solve arithmetic, and write convincing news articles, all driven purely by the examples you show it in the prompt.

Perhaps most unsettling: human evaluators could distinguish GPT-3-generated news articles from real ones only 52% of the time — barely better than chance.

The Problem We Were Trying to Solve

Fine-tuning a new model for every new task is an unsustainable paradigm — and it quietly bakes in a kind of brittleness we rarely talk about.

By 2020, the dominant recipe for NLP was: pretrain a large model on raw text, then fine-tune it on thousands or tens of thousands of labeled examples for your specific task. BERT, RoBERTa, T5 — all brilliant systems, all following this template. And the results were undeniably good. But the approach carries hidden costs that compound as you scale.

First, labeled data is expensive. To apply a language model to a new task — say, detecting sarcasm in social media posts, or classifying contract clauses — you need a fresh dataset. That has to be collected, annotated, quality-checked, and maintained. For many useful tasks, this is simply infeasible.

Second, and more subtle: a model fine-tuned on a narrow distribution tends to exploit spurious correlations in that distribution. The model learns shortcuts specific to the benchmark dataset rather than the underlying skill. This inflates scores in ways that don't generalize. A model scoring at "human level" on a dataset is not necessarily performing at human level on the underlying task in the wild.

🤔 The deeper question: Humans can learn a new language task from a single instruction, or a couple of examples shown to us by a colleague. We don't retrain our brains. Why should an AI system need thousands of examples to do the same?

This adaptability — what we call meta-learning — is something humans deploy constantly. When I ask a friend to "tell me if this sentence sounds too formal," they don't need fifty annotated examples to understand the task. They recognize it almost instantly from context. We wanted our language models to do the same.

"When we built GPT-2, we started noticing something interesting in the outputs: the model seemed to recognize tasks from their structure — answering questions in a Q&A format, continuing code when shown a snippet. It wasn't doing this reliably, but the seeds were there. GPT-3 is us asking: what happens if we take those seeds and scale everything by 100x?"

The Core Idea: In-Context Learning

No gradient updates. No weight changes. Pure forward-pass adaptation — the model reads your examples and figures out the task on the fly.

The key insight is deceptively simple. A language model is trained to predict the next token in a sequence. If you carefully construct your input sequence to contain examples of a task, the model can recognize the pattern and continue it. We formalized this into three distinct settings:

Zero-shot: Give the model only a natural language instruction. No examples at all.

One-shot: Give the model one input-output demonstration, then the new query.

Few-shot: Give the model K demonstrations — typically 10 to 100, constrained by the 2048-token context window — before the new query.

Here's the concrete translation example we use in the paper. Imagine you want to translate English into French.

Few-Shot Prompt (English → French)
sea otter => loutre de mer peppermint => menthe poivrée plush girafe => girafe peluche cheese =>

The model reads those three input→output pairs, recognizes the translation pattern, and completes the final one correctly: fromage. No fine-tuning. No gradient step. Just pattern completion over a forward pass.

A useful analogy: imagine hiring a chef. Fine-tuning is like enrolling them in a six-week cooking school tailored to your specific cuisine. Few-shot learning is like handing them a recipe card at the kitchen door. The chef already knows how to cook — you're just specifying what you want today.

💡 The important nuance: These "learning curves" involve zero gradient updates. GPT-3's weights never change. The adaptation happens entirely within the forward pass as the model processes the context window — a fundamentally different mechanism from fine-tuning.

Whether the model is truly "learning" the task from scratch in context, or recognizing a pattern it encountered during pretraining, is actually an open and fascinating question we discuss later. But either way, the practical behavior is the same: show it examples, it gets better at the task.

How GPT-3 Works

Same architecture as GPT-2 — a decoder-only Transformer — but scaled by roughly two orders of magnitude, to 175 billion parameters.

The architecture uses alternating dense and locally banded sparse attention patterns (inspired by the Sparse Transformer), with pre-layer normalization and the same BPE tokenizer as GPT-2. We trained eight model sizes simultaneously to study how capability scales with compute — from 125M parameters up to 175B.

Model Size Ladder

Model Parameters Layers d_model Attention Heads Batch Size
GPT-3 Small125M12768120.5M tokens
GPT-3 Medium350M241,024160.5M tokens
GPT-3 Large760M241,536160.5M tokens
GPT-3 XL1.3B242,048241M tokens
GPT-3 2.7B2.7B322,560321M tokens
GPT-3 6.7B6.7B324,096322M tokens
GPT-3 13B13B405,140402M tokens
GPT-3 175B ★175B9612,288963.2M tokens

All models trained on 300 billion tokens total, on V100 GPUs across a high-bandwidth cluster provided by Microsoft. The context window is 2,048 tokens for all sizes.

Training Data Mix

We trained primarily on a filtered version of Common Crawl — an enormous web scrape — but raw Common Crawl has quality problems. So we filtered it against high-quality reference corpora, deduplicated aggressively at the document level, and blended in curated sources. Importantly, we intentionally oversampled higher-quality data rather than sampling proportionally to raw size.

Dataset Tokens Weight in Training Mix Epochs During Training
Common Crawl (filtered)410B60%0.44×
WebText219B22%2.9×
Books112B8%1.9×
Books255B8%0.43×
Wikipedia3B3%3.4×

Notice that WebText2 — despite being only 19B tokens — carries 22% of the training weight. We see roughly three passes over it during training. This reflects a deliberate quality-over-quantity philosophy. Wikipedia, similarly, is tiny in raw tokens but gets 3.4 full passes. The model is learning more per token from these curated sources.

📐 On scaling laws: Kaplan et al. (2020) showed that validation loss follows a smooth power law as a function of compute — and this trend continues for GPT-3, extending the curve by two more orders of magnitude with only slight deviation. This gave us confidence that more compute = reliably better model.
Figure: Smooth power-law scaling of compute vs. validation loss. Training curves across all 8 model sizes show that cross-entropy loss continues to decrease predictably with compute, following a power-law trend first identified in the scaling laws literature. GPT-3 extends this curve two orders of magnitude beyond prior work.
"One of the most striking things about running these experiments was watching the in-context learning curves differ by model size. Small models would flat-line as you added more examples in context. GPT-3 175B would keep climbing. The gap between few-shot and zero-shot performance grew dramatically with scale — a pattern that held across almost every task we tested."

What We Found

GPT-3 achieves state-of-the-art on several benchmarks without any fine-tuning — results that should not have been possible under conventional wisdom about large language models.

LAMBADA: The +18% Shock

LAMBADA is a cloze task (fill-in-the-blank): predict the last word of a sentence that requires reading a full paragraph of context. Prior state-of-the-art, held by the 17B-parameter Turing-NLG model, was 68.0% accuracy. We expected incremental gains.

💡 86.4% few-shot on LAMBADA — an 18+ percentage point improvement over the previous state of the art. This one genuinely caught us off guard. Even GPT-3 2.7B in few-shot mode already surpasses the 17B-parameter SOTA.
SettingLAMBADA AccuracyLAMBADA Perplexity
Previous SOTA (Turing-NLG 17B)68.0%8.63
GPT-3 Zero-Shot76.2%3.00
GPT-3 One-Shot72.5%3.35
GPT-3 Few-Shot86.4%1.92

The key insight here is that few-shot framing allowed us to restructure the task as a fill-in-the-blank rather than open generation. This let the model infer that exactly one word was desired — something a standard language model has no way of knowing from the training objective alone. The few-shot format was doing real work.

TriviaQA: Matching Fine-Tuned Retrieval Systems (Closed-Book)

TriviaQA tests factual question answering — no external documents allowed. GPT-3 must answer entirely from knowledge stored in its parameters. The comparison is stark:

SystemNaturalQSWebQSTriviaQA
RAG (Fine-tuned + Retrieval)44.545.568.0
T5-11B + SSM (Closed-Book, Fine-tuned)36.644.760.5
T5-11B (Fine-tuned, Closed-Book)34.537.450.1
GPT-3 Zero-Shot14.614.464.3
GPT-3 One-Shot23.025.368.0
GPT-3 Few-Shot29.941.571.2
💡 TriviaQA one-shot: 68.0% — matching RAG, a system that fine-tunes on the dataset AND retrieves over a 21-million-document dense vector index. GPT-3 does this with one example, no retrieval, and no fine-tuning.

Arithmetic From Scratch: An Emergent Ability

We hadn't designed GPT-3 to do arithmetic. It wasn't in our objectives. So what happened when we just asked it, in natural language, "Q: What is 48 plus 76?" was genuinely surprising.

💡 100% accuracy on 2-digit addition, 98.9% on 2-digit subtraction — in few-shot. Performance holds at 80%+ for 3-digit addition. These results are not memorization: spot-checks show only 0.8% of the test problems appear anywhere in training data.

The performance drop-off with digit count is informative. At 4 digits the model achieves around 25%, and at 5 digits around 9-10%. It's clearly struggling to carry digits reliably — it makes the kinds of mistakes a student makes when they forget to "carry the 1." But it's actually trying to compute, not pattern-match to memorized answers. That's a qualitatively different kind of failure than random guessing.

Translation: Strong Into English, Weaker Out of English

GPT-3's training data is 93% English. Yet it achieves competitive translation performance on several language pairs — and notably surpasses previous unsupervised translation baselines when translating into English, reflecting the strength of its English language model.

SystemEn→FrFr→EnEn→DeDe→EnEn→RoRo→En
Supervised SOTA45.635.041.240.238.539.9
MASS (Unsupervised)37.534.928.335.235.233.1
GPT-3 Zero-Shot25.221.224.627.214.119.9
GPT-3 One-Shot28.333.726.230.420.638.6
GPT-3 Few-Shot32.639.229.740.621.039.5

Word Scrambling and SuperGLUE

GPT-3 achieves 71.8 on the SuperGLUE benchmark in few-shot mode — exceeding fine-tuned BERT-Large (69.0) using only 32 examples per task, with zero gradient updates. On ReCoRD it comes within striking distance of SOTA at 91.1 F1. On word scrambling tasks — where it must recover original words from anagrammed or permuted letters — we see spectacular few-shot learning curves, particularly at 175B scale where the ability seems to emerge discontinuously relative to smaller models.

Figure: In-context learning curves scale sharply with model size. On a simple symbol-removal task, the slope of the "examples in context → performance" curve is dramatically steeper for GPT-3 175B than for any smaller model. Large models don't just perform better at K=0; they learn faster from each additional example in context.

Where It Still Struggles

Natural language inference, tasks involving pairwise comparison, and certain reading comprehension formats remain stubborn weak spots — and the reasons are illuminating.

We want to be honest about this. GPT-3 does not uniformly improve on everything. There are clear failure modes, and understanding them points toward what must change architecturally or algorithmically.

Natural Language Inference (ANLI, RTE)

NLI (Natural Language Inference) asks: does sentence A imply, contradict, or remain neutral with respect to sentence B? On Adversarial NLI (ANLI), which was specifically designed to fool language models, all our models smaller than 175B perform at essentially random chance (~33%). Even GPT-3 175B only partially closes the gap from chance to SOTA — and the improvement is inconsistent across rounds.

⚠️ WiC: 49.4% few-shot — at random chance. Word-in-Context asks whether a word is used the same way in two different sentences. Despite trying many prompt formulations, GPT-3 couldn't reliably solve this. The task requires careful pairwise comparison, which seems to be a structural weakness.

Our hypothesis is architectural: GPT-3 is a left-to-right autoregressive model. Tasks that require carefully comparing two spans — "does sentence A entail sentence B," "is this word used the same way in both sentences" — may fundamentally benefit from bidirectional attention. BERT-style models, which can attend in all directions simultaneously, have a natural advantage here.

Reading Comprehension: QuAC and RACE

On QuAC (conversational QA requiring modeling of structured dialog turns), GPT-3 lands 13 F1 points below an ELMo baseline — a relatively modest 2019 model. On RACE (multiple-choice reading comprehension from Chinese middle and high school exams), it sits 45% below SOTA and competes only with early contextual representation models. These are formats that require carefully re-reading a passage and then selecting from options — again, something bidirectionality seems to help with.

Long-Document Coherence

In open-ended text generation, GPT-3 samples sometimes repeat themselves semantically at the document level, lose coherence over long passages, contradict earlier statements, or generate occasional non-sequiturs. The quality is high enough to pass human scrutiny on short passages (as the news article experiment shows), but sustained long-form generation remains a qualitatively harder problem.

"The WiC result in particular stayed with me. 49.4% — coin flip territory. We tried at least a dozen different prompt formulations. Nothing worked. That kind of clean failure, on a task that feels easy to articulate in English, tells you something structural is missing. It wasn't a prompt engineering problem."

The Human Evaluation That Unnerved Us

We asked humans to distinguish GPT-3-generated news articles from human-written ones. At the 175B scale, they could do so only 52% of the time — barely above chance.

We generated ~200-word news articles using GPT-3 at various model sizes, conditioned on real article titles and subtitles. We then asked human evaluators to rate each article: human or AI-generated? To calibrate the task, we also generated articles from a "control model" — a tiny 160M parameter model with high output randomness, deliberately producing bad output.

Model Mean Accuracy 95% Confidence Interval "I don't know" Rate
Control (deliberately bad)86%83%–90%3.6%
GPT-3 Small (125M)76%72%–80%4.9%
GPT-3 Medium (350M)61%58%–65%6.0%
GPT-3 Large (760M)68%64%–72%8.7%
GPT-3 XL (1.3B)62%59%–65%7.5%
GPT-3 2.7B62%58%–65%7.1%
GPT-3 6.7B60%56%–63%6.2%
GPT-3 13B55%52%–58%7.1%
GPT-3 175B52%49%–54%7.8%
⚠️ 52% — essentially chance. The trend across model sizes is clear and monotonic: as the model grows, humans become less able to distinguish its output from real journalism. We note that individual participants frequently scored below 50%, meaning GPT-3 was outperforming them at appearing human.

The articles that tripped up evaluators most often contained subtle tells rather than obvious ones: occasional factual inaccuracies (the model has no access to what actually happened on a given date), mild repetition in argument structure, and unusual phrasing choices that weren't quite wrong but weren't quite right either. These cues were apparently too subtle for most people to detect reliably under normal reading conditions.

What does this mean? At minimum, it suggests we are approaching an inflection point in synthetic text quality — one where the usual assumption that humans can reliably detect AI-generated content no longer holds. The downstream implications for misinformation, fraud, and content authenticity are significant and require serious attention beyond what any single paper can provide.

"Running this experiment and seeing the 52% number was a genuine inflection point in how I thought about what we'd built. It's one thing to say 'the model generates high-quality text.' It's another to measure the concrete consequence of that — that the people most likely to be targeted by synthetic content can't reliably identify it."

Broader Concerns

A model this capable trained on internet-scale data inherits internet-scale biases — and its ability to generate convincing text raises misuse risks we need to address directly.

We include this section not as a disclaimer but as a genuine part of the scientific contribution. Characterizing failure modes and societal risks is as important as reporting benchmark gains.

Gender Bias

When prompted with "The [occupation] was a ___", GPT-3 associated 83% of 388 tested occupations with male gender identifiers. High-education roles (legislator, professor, banker) and physically demanding roles (mason, sheriff) skewed heavily male. Roles skewing female included midwife, nurse, receptionist, and housekeeper — which closely track real-world stereotypes in the training data.

Interestingly, when we added "competent" as a modifier, the male lean increased for most occupations. The average occupation bias metric (log ratio of female to male probability) shifted from −1.11 (neutral prompt) to −2.14 (competent prompt). Adding "incompetent" returned the value close to neutral. The model is amplifying existing societal associations rather than correcting for them.

Race and Religion

When asked to complete text beginning with "The [race] people are known for...", GPT-3's completions vary systematically by racial group in ways that reflect and in some cases amplify stereotypes present in training corpora. For religion, sentiment analysis of completions about different faiths shows that Christianity and Judaism receive more positive sentiment on average; Islam receives less positive sentiment and is associated more frequently with words relating to violence.

⚠️ Internet-trained models have internet-scale biases. This is not a surprise in principle, but seeing it manifest so clearly in a model of this capability level is a reminder that bias amplification is not merely a technical footnote — it's a first-order concern for deployment.

Misuse Risk

Any activity relying on generating high-quality text at scale — misinformation, phishing, fake reviews, academic fraud, social engineering — becomes easier as text generation quality improves. The 52% human-detection result we discuss above is a concrete milestone in that direction. We monitor forums where malicious actors discuss AI tools and found limited evidence of active exploitation as of 2020, but we assess that sufficiently reliable, steerable generation would increase this risk substantially.

Energy Cost

Training GPT-3 consumed several thousand petaflop/s-days of compute — roughly 100× more than GPT-2. This is energy-intensive and contributes to the environmental costs of large-scale ML. We note that inference is comparatively efficient: generating 100 pages of text from the trained model costs on the order of 0.4 kWh. But we cannot fully amortize training cost across all future uses, and the field needs better accounting for these costs as models continue to grow.

What This Opens Up

GPT-3 is a proof of concept for in-context learning at scale — but it also clearly marks where pure autoregressive self-supervised pretraining runs into fundamental walls.

The most obvious next question is: does scaling continue to help? The power-law trend in validation loss held for two additional orders of magnitude, and downstream task performance largely tracks this. There's no obvious bend in the curve suggesting we've hit a ceiling. But raw scaling is unlikely to be the whole answer.

The Bidirectionality Gap

GPT-3's struggles on NLI, WiC, and other comparison tasks suggest that unidirectional left-to-right modeling has structural limitations. A bidirectional model at the scale of GPT-3 — or a hybrid that combines bidirectional encoding with generative decoding — could potentially achieve the "best of both worlds": strong few-shot and zero-shot generalization, plus the pairwise comparison abilities where GPT-3 underperforms. This remains an important open direction.

Grounding and Other Modalities

Language models are not grounded in the physical world. GPT-3 struggles with intuitive physics — "If I put cheese in the fridge, will it melt?" is a question type where it fails despite surface fluency. Real understanding of the world requires more than text: video, embodied interaction, sensorimotor feedback. Extending the pretraining paradigm to include images (as we later explored with CLIP and DALL·E) or other modalities seems like a natural and necessary next step.

🔭 Open question: Is GPT-3 "learning" tasks from in-context examples, or recognizing tasks it was implicitly exposed to during pretraining? This distinction matters deeply for understanding what few-shot learning actually is — and what its limits are. Tasks like word scrambling seem genuinely learned in-context. Translation clearly requires pretraining exposure. Most tasks probably lie on a spectrum between these extremes.

The Limits of Prediction as an Objective

Our training objective weights every token equally — predicting "the" carries the same loss as predicting a key named entity. This is a crude objective for a system that should ultimately understand what matters. More targeted objectives, human feedback, or reinforcement learning could help the model focus its capacity on what's actually meaningful. The direction of learning the objective from human feedback — which the field later pursued aggressively — was already visible as a promising path from where we stood in 2020.

Sample Efficiency

GPT-3 still sees far more text during pretraining than a human encounters in a lifetime. Its in-context sample efficiency (learning from 10–100 examples at inference time) is impressive. Its pretraining sample efficiency is not. Closing this gap — achieving human-like learning from the amount of data a human actually encounters — requires fundamentally new ideas, possibly involving richer world models, causal reasoning, or persistent memory.

"What gives me most pause is not the things GPT-3 gets wrong — those are fixable. It's the things it gets right without us fully understanding why. When a 175B parameter model does 2-digit multiplication at 29% accuracy from a handful of examples, having never been trained on arithmetic — that's telling us something deep about what is latent in language. We haven't fully unpacked it yet."

Citation & Paper Link

The full paper, including all experimental details, appendices, and 500 uncurated sample completions:

arxiv.org/abs/2005.14165 — Language Models are Few-Shot Learners

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems, 33, 1877–1901.