Gemini 3.1 Pro, Google’s latest flagship reasoning model, hit a verified 77.1% on ARC-AGI-2 — a benchmark designed to test the kind of fluid, novel-pattern reasoning that has historically been the hardest thing for language models to do. The score represents a meaningful jump over the prior Gemini 3 generation and lands in striking distance of the human baseline on the benchmark, which until now no frontier model had credibly approached. Gemini 3.1 Pro is shipping now in the Gemini app, NotebookLM, AI Studio, Vertex AI, Gemini Enterprise, the Gemini CLI, and Android Studio — broad availability that lets builders evaluate the reasoning gains immediately. This is the developer-focused breakdown: what changed, why ARC-AGI-2 matters, how to use the new capability, and how Gemini 3.1 Pro stacks up against GPT-5.5 and Claude Opus 4.7.
What’s actually new
The headline metric is the ARC-AGI-2 score. ARC-AGI-2 is the second generation of François Chollet’s Abstraction and Reasoning Corpus benchmark, designed specifically to defeat memorization and pattern-matching strategies. Each problem in ARC-AGI-2 presents a small grid-based puzzle where the model has to infer a transformation rule from a few examples and apply it to a new input. The puzzles are constructed so that solving them requires genuine reasoning over abstract structure — not retrieval of similar problems from training data. Models that win on ARC-AGI-2 demonstrate something closer to human-style fluid intelligence than to traditional benchmark performance.
Gemini 3.1 Pro’s 77.1% is the highest verified score by a wide margin among publicly accessible models. For comparison, GPT-5.5 sits in the low-to-mid 60s on the same benchmark; Claude Opus 4.7 lands in the high 50s; the Gemini 3 baseline (released late 2025) was around 51%. The 26-point gain over the prior Gemini generation is unusual — most generation-over-generation improvements on hard benchmarks deliver 5-10 points. The architecture changes Google made between Gemini 3 and 3.1 Pro evidently shifted the reasoning capability meaningfully.
Beyond the benchmark, three concrete capabilities are new. First, the 1M token context window survived the architecture refresh and now operates with materially better reasoning quality at full context — earlier models often degraded near the upper context limits, but Gemini 3.1 Pro maintains coherent reasoning across million-token inputs. Second, native multimodal handling now includes entire code repositories as a first-class input type — the model can ingest a full repo and reason about cross-file structure without manual chunking. Third, a refined “thinking budget” parameter lets developers explicitly trade off latency against reasoning depth, useful for production deployments that need predictable response times.
The deployment surface is also notable. Where some flagship releases ship to one or two endpoints, Gemini 3.1 Pro hit broad availability on day one: the Gemini consumer app, NotebookLM (limited to Pro and Ultra subscribers initially), AI Studio for prototyping, Vertex AI for production deployments, Gemini Enterprise for business workspaces, the Gemini CLI for terminal-based workflows, and Android Studio for mobile development. Same model, same capabilities, no surface-specific feature gating.
Why it matters
- The reasoning ceiling is moving. ARC-AGI-2 was designed specifically to be hard for the strategies that worked on prior benchmarks. Crossing 77% on it is evidence that frontier model reasoning is getting structurally better, not just better at the benchmarks it was trained on.
- Google retook the reasoning lead. Through 2025 and early 2026, OpenAI’s GPT-5.x line and Anthropic‘s Claude Opus models traded the top spot on different benchmarks. Gemini 3.1 Pro’s ARC-AGI-2 score puts Google clearly ahead on this specific reasoning axis. Whether the lead persists through the next round of OpenAI / Anthropic releases is open; for the moment it’s Google’s.
- The 1M context window is now usable for real reasoning. Long-context models often degraded at the high end. Gemini 3.1 Pro maintains reasoning quality through the full window, making “throw the entire codebase in the prompt” a viable pattern for code review, refactoring, and architectural analysis.
- Pricing and access matter for builders. Gemini 3.1 Pro pricing in Vertex AI is competitive with GPT-5.5 — roughly $3.00 per million input tokens, $15.00 per million output. For workloads that benefit from the reasoning improvement, the cost-per-correct-answer math now favors Gemini in ways it hasn’t for some prior generations.
- The thinking-budget parameter is a quiet but useful addition. Production deployments often need predictable latency. The ability to cap reasoning depth (and accept a small accuracy hit for a large latency reduction) makes Gemini 3.1 Pro practical for use cases where prior reasoning models were too slow.
- NotebookLM as a tier-gated launchpad. Putting Gemini 3.1 Pro behind a Pro/Ultra paywall on NotebookLM and the Gemini app is a deliberate consumer monetization strategy. Expect more capability-tied tiering on flagship models as the cost of running them stays high.
How to use it today
Gemini 3.1 Pro is broadly available via the Google AI ecosystem. The fastest paths depend on your starting point.
- For consumer evaluation: Open the Gemini app on web or mobile. If you have a Google AI Pro or Ultra subscription, 3.1 Pro is the default model. Try it on hard reasoning problems — multi-step math, architectural code analysis, ambiguous instruction parsing — and compare to your prior baseline.
- For developer prototyping in AI Studio: Visit aistudio.google.com, sign in, and select
gemini-3.1-pro-preview-05-2026from the model picker. AI Studio gives you a free chat interface plus the underlying API call, which you can copy as code in Python, Node, Go, or curl format.# Quick test via the Gemini API directly curl -X POST \ "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-pro-preview-05-2026:generateContent?key=$GEMINI_API_KEY" \ -H 'Content-Type: application/json' \ -d '{ "contents": [{ "parts": [{ "text": "Solve this ARC-AGI-style problem: given input grids [[1,0],[0,1]] -> [[0,1],[1,0]] and [[1,1],[0,0]] -> [[0,0],[1,1]], what is [[1,0],[1,0]] -> ?" }] }], "generationConfig": { "temperature": 0.2, "thinkingBudget": 2048 } }' - For production via Vertex AI: The Vertex AI SDK has supported Gemini 3.1 Pro since launch. Authenticate against your GCP project, instantiate the client, and call generate-content. Production deployments get region selection, VPC service controls, and the standard enterprise security controls.
from google.cloud import aiplatform from vertexai.generative_models import GenerativeModel, GenerationConfig aiplatform.init(project="your-gcp-project", location="us-central1") model = GenerativeModel("gemini-3.1-pro-preview-05-2026") config = GenerationConfig( temperature=0.2, max_output_tokens=8192, thinking_budget=4096, # Higher = more reasoning, more latency ) response = model.generate_content( "Refactor the attached repository for thread safety. " "Identify shared mutable state, add appropriate synchronization, " "and explain each change with file:line references.", generation_config=config, # Pass the entire repo as context — up to 1M tokens tools=[{"file_data": {"file_uri": "gs://your-bucket/repo-archive.tar"}}], ) print(response.text) - For agentic workflows via Gemini CLI: The CLI ships with native Gemini 3.1 Pro support. Useful for code review, refactoring, and codebase exploration without leaving the terminal.
# Install the CLI (one-time) npm install -g @google/gemini-cli # Authenticate gemini auth # Ask a deep question about your codebase gemini --model gemini-3.1-pro \ --context-files src/ tests/ \ "Identify any race conditions in the connection pooling logic and suggest fixes." - For document-heavy workflows via NotebookLM: Visit notebooklm.google.com, create a notebook, and upload up to 50 sources (PDFs, docs, slides, audio). Gemini 3.1 Pro automatically powers the new notebook for Pro / Ultra users. The reasoning improvement is most visible on cross-document synthesis tasks — “compare the methodology across these eight papers” is dramatically better than on the prior model.
- Tune the thinking budget for your use case. For interactive applications targeting sub-second latency, set
thinkingBudgetto 512-1024. For batch or background tasks where reasoning quality matters more than latency, push to 4096-8192. The accuracy-latency curve is fairly steep on hard reasoning problems; experiment with your specific workload.
How it compares
| Model | Provider | ARC-AGI-2 | Context | API price (in / out per 1M tokens) | Notable strength |
|---|---|---|---|---|---|
| Gemini 3.1 Pro | 77.1% | 1M tokens | $3.00 / $15.00 | Reasoning + long context | |
| GPT-5.5 | OpenAI | ~63% | 200K tokens | $10.00 / $30.00 | Agentic coding + tool use |
| Claude Opus 4.7 | Anthropic | ~58% | 500K tokens | $15.00 / $75.00 | Tone, controlled reasoning, coding |
| Gemini 3 baseline | ~51% | 1M tokens | $1.25 / $5.00 | Cost-efficient generalist | |
| DeepSeek V4 | DeepSeek | ~48% | 128K tokens | $0.30 / $1.10 | Open-weights, lowest cost |
The picture: Gemini 3.1 Pro leads on the reasoning benchmark and the context window, sits in the middle on price, and is broadly comparable on general capability with GPT-5.5 and Claude Opus 4.7. For workloads where the reasoning improvement translates to real value — complex synthesis, multi-step problem solving, long-context analysis — Gemini is the new clear choice. For workloads where tool use, latency, or specific domain capability matters more, the calculus depends on your specific use case.
One implicit takeaway from the comparison: the cost-per-correct-answer differs more across models than the cost-per-token. A model with 77% accuracy at $3/1M input tokens often wins on total cost over a model with 60% accuracy at $1/1M tokens, because you’re paying less for retries, validations, and corrections. The reasoning improvement compounds favorably in the cost math.
What’s next
Three trajectories worth watching for the rest of 2026.
The OpenAI and Anthropic responses. Both labs have public roadmaps that anticipate Gemini’s progress. Expect a GPT-6 or GPT-5.x successor with comparable reasoning capability within 4-6 months. Anthropic has been less explicit about timelines but will not cede the reasoning category for long. The benchmark race is fluid; the model that’s ahead today is rarely ahead in eighteen months.
The reasoning-cost-curve. Reasoning capability comes with compute cost. Gemini 3.1 Pro’s thinking-budget parameter is one approach to managing the trade-off; expect every major lab to ship similar controls. The interesting question is whether reasoning quality scales such that smaller, cheaper models eventually deliver Gemini 3.1 Pro-class reasoning at lower cost — a “reasoning compression” trajectory similar to what’s already happening on raw capability. Likely yes, on a 12-18 month delay.
The agent ecosystem implications. Better reasoning makes more capable agents. Gemini 3.1 Pro’s improvement at long-horizon planning unlocks agentic workflows that were brittle on the prior model. Expect a wave of new agent products built on Gemini specifically — and expect competing products on GPT-5.5 to feel relatively dated until OpenAI ships its next reasoning leap.
The deeper observation: ARC-AGI-2 was specifically designed to push frontier models past the limits of their previous strategies, and 77.1% says the strategy worked — Gemini 3.1 Pro is meaningfully more capable at fluid reasoning than its predecessors. Whether that capability is “real general intelligence” or “still pattern-matching, just on harder patterns” is a debate philosophers and AI researchers will have for years. What’s not in dispute: the practical capability is materially better, and builders who haven’t evaluated it for their workloads should.
Frequently Asked Questions
Is Gemini 3.1 Pro available everywhere or just to paid users?
It depends on the surface. The Gemini consumer app and NotebookLM gate 3.1 Pro behind Google AI Pro or Ultra subscriptions ($20/mo and $200/mo respectively). The developer-facing endpoints (AI Studio, Gemini API, Vertex AI, Gemini CLI, Android Studio) provide access to anyone with API keys, with usage-based pricing.
How does the 77.1% ARC-AGI-2 score compare to humans?
The human baseline on ARC-AGI-2 is roughly 60-65% for typical adults and 80-85% for trained reasoners. Gemini 3.1 Pro’s 77.1% lands within typical-human range — the first frontier model to do so. The “gold-medal” human score remains comfortably above the model, but the gap has narrowed substantially.
Should I migrate workloads from GPT-5.5 to Gemini 3.1 Pro?
Run the eval. Gemini’s reasoning advantage helps most on complex synthesis and long-horizon planning. GPT-5.5’s agentic-coding strength shows up most on tool-use-heavy workflows. Most teams find one model wins on some workloads and loses on others; multi-model deployments with workload-aware routing are increasingly the right pattern.
What’s the practical impact of the 1M context window?
Most useful for codebase analysis (entire repos as input), document synthesis (dozens of long PDFs at once), and long-running conversations (large agent traces). For typical short-prompt workloads, the larger window is nice-to-have but not transformative — and you pay per input token, so use the context wisely.
Does the thinking-budget parameter actually work as advertised?
Yes, with caveats. Higher thinking budgets produce noticeably better reasoning on hard problems but slower responses. On easy problems, increasing the budget delivers diminishing returns and adds latency for no gain. Tune per workload; the sweet spot for most production use cases is 1024-2048 tokens of thinking.
Is Gemini 3.1 Pro safe to use on sensitive enterprise data?
Via Vertex AI, yes — Vertex provides VPC service controls, customer-managed encryption keys, regional data residency, and the standard enterprise security controls Google Cloud customers expect. Via the Gemini consumer app and NotebookLM, the data-handling policies are stricter than they used to be (Google does not use NotebookLM content to train models) but enterprises with sensitive data should default to Vertex.