Experiments with voice: the speech-to-speech architecture | Part 1

8 min read

—Updated May 06, 2026

Experiments with voice: the speech-to-speech architecture | Part 1

Alok Mishra

DevRev

The magic of the latent space

The theoretical appeal of native S2S (like the architectures underpinning the GPT-Realtime API) is the elimination of discrete text bottlenecks. By tokenizing audio directly (e.g., via neural audio codecs like SoundStream) and mapping those acoustic tokens into a joint multi-modal latent space, the model generates audio autoregressively.

Our initial experiments here were genuinely captivating. Because the model processes the raw acoustic payload directly in the latent space, it successfully captures non-lexical nuances – breathing, coughing, laughing, and subtle cadence shifts – without needing explicit SSML markup or prosody-prediction models.

It feels real. We realized very quickly that humans don't actually expect a perfect, sterile voice AI. They respond better to organic imperfections. When an S2S model starts sounding so natural, it creates an interaction that is beautifully strange and familiar. That profound sense of presence is the undeniable superpower of S2S, and for non-deterministic, open-ended conversational agents, this end-to-end approach yields a highly magical UX.

The production bottlenecks

However, DevRev agents are essentially state machines. They execute complex B2B workflows, query specific database nodes, and trigger engineering webhooks. When we tested native S2S architectures against deterministic enterprise constraints, the system exhibited fatal limitations.

1. Lack of explicit lexical grounding (the WER problem)

In a cascaded pipeline, you can strictly control the ASR layer. You can apply lexical biasing, LM shallow fusion, or strict decoding constraints to perfectly transcribe company-specific acronyms.

Native S2S processes audio without an explicit intermediate text bottleneck. In our experiments, when someone speaks highly specific technical jargon (e.g., "My Sev-1 ticket for the Postgres cluster is dropping payloads"), the S2S model tended to suffer from statistical brittleness on domain-specific entities. Without explicit lexical grounding, the model misinterprets the acoustic features of the acronym, poisoning the context vector before any logical reasoning can occur.

2. Semantic density and instruction drift

Audio tokens contain significantly more dimensional data than text tokens, capturing pitch, tone, and ambient noise alongside semantics. Consequently, the semantic density per token is much lower.

While text-native LLMs are highly optimized for long-context reasoning and strict adherence to complex system prompts via extensive RLHF, the S2S models we evaluated struggled to maintain strict behavioral alignment over a long temporal window. In our tests, S2S agents exhibited what we observed as "instruction drift," failing to adhere to strict conditional routing logic or policy constraints during multi-turn support calls.

3. Structured generation and the "alignment tax"

The ultimate dealbreaker for S2S in our stack was function calling. To resolve a customer issue, the agent must output a strictly formatted JSON payload to trigger a backend webhook. Native S2S architectures currently struggle with consistent structured data generation.

Attempting to fix this via model alignment introduces a critical architectural tradeoff. If you heavily penalize an S2S model during training (via RL or supervised fine-tuning) to force strict adherence to rigid JSON grammars, you flatten the variance in the latent space. The model pays a massive "alignment tax" – it loses the high-fidelity prosody, emotional variance, and fluid acoustic entropy that made the S2S architecture so familiar and appealing in the first place, often collapsing into robotic or degraded output.

4. The full-duplex problem

True full-duplex communication isn't just about fast VAD. It requires the system to process inbound audio tokens, maintain isolated KV caches for generation, and stream semantic updates from mid-sentence interruptions while simultaneously generating autoregressive audio output. In the S2S implementations we evaluated, the models struggled to handle these semantic collisions gracefully without clearing the context window or introducing severe acoustic artifacts.

Ultimately, our decision to drop S2S in its current state came down to a single, non-negotiable metric: trust and safety for voice-driven actions. When an AI agent operates on behalf of an enterprise: modifying databases, accessing user states etc, a hallucinated policy or a malformed JSON payload isn't just a UX glitch; it is a critical safety failure. We realized that until native S2S models can guarantee the deterministic reliability and strict behavioral guardrails required to safely execute backend actions, they cannot be deployed in front of enterprise customers.

The principle of experimentation: the death of the sunk cost fallacy

If there is one meta-lesson from navigating the S2S illusion, the telecom traps, and the hallucinations of dynamic pipelines, it is this: In the world of AI, the half-life of a technological advantage is measured in weeks. Building a sub-500ms, enterprise-grade orchestration engine wasn't about getting the architecture right on the first whiteboard session. It was about adopting a ruthless philosophy of experimentation, fundamentally enabled by a shift in how we write code.

Midway through these iterations, agentic coding with Claude reached a tipping point where, in our team's assessment, it began operating at a level comparable to a senior engineer for certain infrastructure tasks.  This didn't just accelerate boilerplate generation; it completely inverted our architectural risk profile by killing the psychological sunk cost fallacy.

When a human engineer spends three weeks wrestling with legacy SIP interconnects or writing complex WebRTC C++ bindings, they fight to keep that architecture alive because it was painful to build. But when an AI agent can scaffold, test, and deploy that same complex infrastructure in a fraction of the time, the sunk cost drops dramatically. Code became incredibly cheap.

We must engineer for a fast loop: fast feedback, fast execution, fast iteration. We built self-hosted Asterisk servers and threw them away without blinking because the AI built it fast enough that it didn't hurt to discard it. We evaluated frontier S2S models and moved on when they didn't meet our enterprise reliability requirements. We wrote complex LLM reasoning schemas and ripped them out in favor of static YAML personas.

You cannot be precious about your stack. The teams that will win in conversational AI are not the ones who write the most elegant architecture on day one; they are the ones who utilize AI engineering agents to run through the cycle of hypothesis, deployment, failure, and pivot faster than anyone else.

We remain incredibly optimistic about the future of S2S and expect future iterations of native audio models to crack the alignment tax. Our speculative forecast is that chained hybrid pipelines could become viable within the next year or so, though the pace of progress in this space makes any specific timeline uncertain.

But today, they are currently unfit for deterministic enterprise orchestration. We required a solution that could achieve the sub-500ms TTFA of native S2S, but with the strict, JSON-compliant reasoning of a text-native LLM.

In part 2, we detail our fallback to a cascaded pipeline, the resulting latency spikes, and how relying on cloud-managed inference endpoints forced us to completely rebuild our transport layer and execute bare-metal compiler optimizations.