OpenAI Ships Three New Realtime Voice Models with GPT-5 Reasoning

OpenAI just shipped three new realtime voice models on May 7, the most consequential update to its voice stack since the original Realtime API launched in late 2024. GPT-Realtime-2 brings GPT-5-class reasoning into a streaming voice model with a 128K context window. GPT-Realtime-Translate handles live speech-to-speech translation across 70+ input languages and 13 output languages. GPT-Realtime-Whisper is a streaming transcription model built specifically for low-latency speech-to-text. The Realtime API officially exited beta with the launch, signaling that voice AI is no longer experimental at OpenAI — it’s a first-class commercial product.

What’s actually new in GPT-Realtime-2

The most important shift is reasoning. Previous OpenAI voice models — including the original GPT-4o realtime and the gpt-4o-realtime-preview that shipped in 2024 — could hold a fluid conversation but stumbled on multi-step requests. Ask the old model to “Compare the prices of the three plans, factor in the 20% discount, and recommend which one fits a five-person team,” and it would either lose the thread or produce a confidently wrong answer. GPT-Realtime-2 closes that gap. Internally, the model uses GPT-5-class reasoning during generation, including the ability to think briefly before speaking on harder requests.

The 128K context window is the second meaningful change. The previous voice model’s effective context was much shorter, which meant developers building agentic voice applications had to carefully manage memory and aggressively truncate conversation history. With 128K, you can hold 90+ minutes of conversation, full reference documents, and structured tool definitions in context simultaneously. Long-running voice agents that maintain coherent state across an hour-plus interaction are now practical without complex external memory systems.

GPT-Realtime-Translate is its own model, not a feature of GPT-Realtime-2. It listens to one language and speaks another in near-real-time, with sub-second latency between input and translated output. The 70-language input support covers nearly every commercially significant language; the 13-language output set focuses on the languages with the highest commercial demand — English, Spanish, Mandarin, Hindi, Arabic, Portuguese, French, German, Japanese, Korean, Italian, Russian, and Indonesian.

GPT-Realtime-Whisper is the third model and the most niche. It’s a streaming transcription model with much lower per-minute pricing than the conversational models — designed for use cases where you need accurate text from speech but don’t need the model to actually respond. Live captioning, meeting transcription, voice-to-text for note-taking, accessibility tools.

Why it matters

  • Voice AI agents become economically viable for high-volume use cases. The previous gpt-4o-realtime-preview was excellent but pricey enough that voice agents cost more than human call-center reps for many workloads. GPT-Realtime-2 maintains the quality while bringing the economics into commercial range. Customer support, voice ordering, voice-driven booking — all become deployable at scale.
  • Real-time translation is now production-grade. Until this week, real-time speech-to-speech translation required stitching together multiple models with seconds of latency. GPT-Realtime-Translate does it in a single model with sub-second delay, opening up live multilingual customer service, multilingual conferences, and accessibility applications.
  • Streaming transcription pricing collapsed. GPT-Realtime-Whisper at $0.017/minute is roughly 3-4x cheaper than the original Whisper API on a per-minute basis for streaming use cases. For applications transcribing hundreds of hours of audio per day, the cost reduction is substantial.
  • The Realtime API is generally available. Exit-of-beta means production SLAs, predictable pricing, and contractual commitments. Enterprise procurement teams that wouldn’t deploy on a beta API can now sign off on voice deployments.
  • The competitive landscape just got harder for ElevenLabs, Deepgram, and others. Voice-AI startups built on the assumption that OpenAI’s voice models would lag the rest of its product line. With GPT-Realtime-2’s reasoning quality and the new pricing, the differentiation has narrowed substantially.
  • Translation services and call centers are on notice. Translation as a $50B+ services industry has been one of the slowest sectors to be disrupted by AI. GPT-Realtime-Translate is the model that makes the disruption look near-term rather than theoretical.

How to use GPT-Realtime-2 today

The Realtime API exits beta with a WebSocket-based interface. Here’s a minimal Python client that connects, sends audio, and streams the model’s voice response.

  1. Install the OpenAI SDK with realtime support:
    pip install --upgrade openai
    pip install websockets pyaudio  # for microphone input
    
  2. Set your API key:
    export OPENAI_API_KEY=sk-...
    
  3. Open a Realtime API session with GPT-Realtime-2:
    import asyncio
    import os
    from openai import AsyncOpenAI
    
    client = AsyncOpenAI()
    
    async def voice_session():
        async with client.beta.realtime.connect(
            model="gpt-realtime-2"
        ) as connection:
            await connection.session.update(
                session={
                    "modalities": ["audio", "text"],
                    "voice": "alloy",
                    "input_audio_format": "pcm16",
                    "output_audio_format": "pcm16",
                    "instructions": (
                        "You are a customer support agent for a travel "
                        "booking company. Greet warmly, ask qualifying "
                        "questions, and help the caller find the right "
                        "trip option."
                    ),
                    "turn_detection": {"type": "server_vad"},
                }
            )
    
            async for event in connection:
                if event.type == "response.audio.delta":
                    # Stream audio bytes to your output device
                    play_audio_chunk(event.delta)
                elif event.type == "response.text.delta":
                    print(event.delta, end="", flush=True)
    
    asyncio.run(voice_session())
    
  4. For live translation, switch the model and configure target language:
    await connection.session.update(
        session={
            "modalities": ["audio"],
            "instructions": "Translate everything you hear into Spanish.",
            "input_audio_format": "pcm16",
            "output_audio_format": "pcm16",
        }
    )
    # Connect with model="gpt-realtime-translate"
    
  5. For streaming transcription only, use GPT-Realtime-Whisper:
    import openai
    
    response = openai.audio.transcriptions.create(
        model="gpt-realtime-whisper",
        file=open("call_recording.wav", "rb"),
        response_format="text",
        stream=True,
    )
    
    for chunk in response:
        print(chunk.text, end="", flush=True)
    
  6. Tool use works the same way it does in GPT-5. Define tools in the session config, the model calls them mid-conversation, and you handle the responses inline. This is what makes GPT-Realtime-2 viable for voice agents that book appointments, look up customer records, or process orders.
    tools = [{
        "type": "function",
        "name": "get_available_flights",
        "description": "Search available flights",
        "parameters": {
            "type": "object",
            "properties": {
                "origin": {"type": "string"},
                "destination": {"type": "string"},
                "date": {"type": "string"},
            },
            "required": ["origin", "destination", "date"],
        },
    }]
    
    await connection.session.update(
        session={"tools": tools, "tool_choice": "auto"}
    )
    

How it compares

Here’s how the three new GPT-Realtime models stack up against each other and against the closest competition.

Model Provider Use case Latency Pricing
GPT-Realtime-2 OpenAI Conversational voice agent with reasoning ~300-500ms $32/M input, $64/M output audio tokens
GPT-Realtime-Translate OpenAI Live speech-to-speech translation ~600ms $0.034/min
GPT-Realtime-Whisper OpenAI Streaming transcription only ~200ms $0.017/min
Gemini 2.5 Flash Native Audio Google Conversational voice agent ~400ms $0.30 per 1M input chars (text-equiv)
ElevenLabs Conversational AI ElevenLabs Voice agent with custom voice ~400ms $0.08-0.30/min depending on tier
Deepgram Nova-3 + Aura-2 Deepgram STT + TTS pipeline ~250ms STT + 150ms TTS $0.0043/min STT, $0.018/1K chars TTS
AssemblyAI Universal-Streaming AssemblyAI Streaming transcription only ~300ms $0.025/min

For pure transcription, Deepgram remains cheapest per minute and AssemblyAI competes on accuracy. GPT-Realtime-Whisper sits in the middle on both axes, with the advantage that it lives in the same ecosystem as the rest of the OpenAI stack — which matters operationally.

For conversational voice agents with reasoning, the GPT-Realtime-2 vs Gemini 2.5 Native Audio comparison is going to be the main one for the next 6-12 months. Both deliver competitive latency and quality. OpenAI’s advantage is the broader ecosystem (function calling, tool use, larger context, well-known SDK). Google’s advantage is multimodal native handling and tight Workspace integration.

For voice cloning and custom voice — where you want the agent to sound like a specific person — ElevenLabs still leads. OpenAI’s voice options are limited to its eight built-in voices.

What’s next

Three threads will play out as the new GPT-Realtime stack moves from launch to production.

First, aggressive price competition. Anthropic, Google, and Meta have voice models in various stages of development. The next 90 days will likely see at least one major price cut on conversational voice as competitors respond to OpenAI’s positioning. Expect $20/M input audio tokens to become the new floor by late summer.

Second, specialized voice agent platforms. With GPT-Realtime-2’s reasoning capability, building a voice agent that handles a specific vertical — healthcare scheduling, real estate showings, restaurant reservations — becomes a 1-week project rather than a 6-month engineering investment. Expect a wave of vertical voice-agent SaaS products through Q3 2026.

Third, regulatory and trust pressure. The combination of human-quality voice, multilingual translation, and reasoning-capable agents is a powerful enabler for legitimate uses and a significant risk surface for deepfakes and fraud. The FTC, FCC, and state attorneys general have all signaled interest in voice-AI disclosure rules. Expect compliance frameworks around voice AI to formalize over the next 12-18 months, and build your deployments with disclosure baked in from day one.

Frequently Asked Questions

Can GPT-Realtime-2 replace a human call-center agent?

For routine inbound and outbound calls — qualification, scheduling, simple support — yes, with appropriate scoping and human escalation paths. For complex, judgment-heavy, or emotionally sensitive calls, no. The 2026 production pattern is hybrid: voice AI handles the first 80% of routine traffic, escalates to humans for the rest. Companies running this pattern report 50-70% reductions in cost-per-call without measurable degradation in customer satisfaction, provided the escalation paths are clean.

What’s the difference between GPT-Realtime-2 and the original gpt-4o-realtime model?

Three main differences. First, reasoning: GPT-Realtime-2 has GPT-5-class reasoning, the older model didn’t. Second, context window: 128K vs ~32K. Third, the older model is in deprecation track and will be retired in Q3 2026 per OpenAI’s announcement. New deployments should target GPT-Realtime-2.

Is GPT-Realtime-Translate good enough to replace human translators?

For real-time spoken communication, yes for most use cases. For high-stakes translation (legal, medical with safety implications, formal diplomacy), no. The 2026 pattern: AI translation handles real-time conversational scenarios; human translators review and certify formal written work. The two markets are increasingly separate.

Can I use these models for HIPAA-compliant healthcare applications?

Through OpenAI’s standard Realtime API, no — OpenAI doesn’t sign BAAs (Business Associate Agreements) for the standard API. For HIPAA-compliant deployment, route through Microsoft Azure OpenAI Service, which offers BAA terms. The Azure rollout of GPT-Realtime-2 is expected within 4-6 weeks per Microsoft’s typical lag.

What’s the latency over typical residential internet?

The 300-500ms voice-to-voice latency cited above is measured under good network conditions. On residential cable or fiber, expect 400-700ms total. On mobile networks, expect 500-1000ms. Latency above ~700ms starts to feel awkward in conversation. If you’re building for mobile-heavy users, test extensively with real network conditions and consider falling back to text-only mode when network quality is poor.

How do I prevent the voice agent from going off-script?

Strong system instructions, narrow tool definitions, and explicit refusal patterns. The instructions field is your primary control surface; spend serious effort on it. Add explicit guardrails: “If the caller asks about pricing for tier-3 plans, say you’ll have a human follow up. If the caller asks for medical advice, refuse and offer to transfer to a human.” Test extensively before production deployment, and route 5-10% of traffic to a logging-only canary deployment for the first week of any major prompt change.

What does this mean for ElevenLabs and other voice-AI startups?

OpenAI’s pricing and quality combination just narrowed ElevenLabs’ moat substantially. ElevenLabs still leads on voice cloning, custom voices, and very-low-latency edge deployments. For commodity conversational voice agents, the differentiation has gotten harder. Expect ElevenLabs to push deeper into voice-cloning, broadcast-quality TTS, and specialized verticals where their voice library is differentiated. Some smaller voice startups will be folded into larger AI companies through acquisition over the next 12-18 months.

Scroll to Top