Quizzr Logo

AI Voice & TTS

Building Low-Latency Voice Agents with Full-Duplex Audio Architectures

Discover how to orchestrate STT, LLM, and TTS components into a unified pipeline to create seamless conversational interfaces that minimize turn-taking lag.

AI & MLIntermediate12 min read

The Anatomy of Latency in Conversational Pipelines

In a standard conversational AI architecture, the system must navigate three distinct computational phases: transcribing audio, generating a text response, and synthesizing speech. This sequence is often referred to as the STT-LLM-TTS pipeline, and its primary challenge is the additive nature of latency. If each component takes one second to process, the user experiences a three-second delay before the system responds, which feels unnatural and sluggish in a verbal conversation.

To build a truly responsive interface, developers must move away from the traditional request-response model where each stage waits for the previous one to complete fully. Instead, the goal is to implement a streaming architecture where data flows through the pipeline in small, manageable chunks. This approach allows the system to begin synthesizing the start of a response while the Large Language Model is still generating the end of the sentence.

The golden rule of conversational AI latency is that humans perceive delays over 300 milliseconds as a break in the flow of natural conversation, necessitating a shift from batch processing to continuous streaming.

Modern implementations prioritize Time To First Byte over total processing time to ensure the user hears a response as quickly as possible. This requires a deep understanding of how to manage state across asynchronous streams while maintaining the context of the conversation. By optimizing the handoff between components, we can reduce perceived latency even when the underlying models are computationally expensive.

Understanding the Sequential Bottleneck

In a naive implementation, the Speech-to-Text engine waits for a period of silence before sending a complete transcript to the LLM. The LLM then processes the entire prompt and generates a full response before passing that text to the Text-to-Speech engine. This linear dependency is the root cause of high latency in most voice applications.

Breaking this dependency requires a mental shift toward event-driven architectures where partial results trigger immediate actions in downstream components. For example, a partial transcript can be used to pre-warm the LLM, or the first few tokens of an LLM response can be sent to the TTS engine immediately to start audio generation.

Metrics that Matter: TTFT and P99 Latency

When measuring the performance of a voice pipeline, Time to First Token and Time to First Byte are the most critical metrics to track. These represent the duration between the user finishing their sentence and the system emitting its first sound. Optimizing for the average case is insufficient, as outliers in latency can lead to a disjointed and frustrating user experience.

Engineers should focus on P99 latency, which measures the performance of the slowest 1 percent of requests. High P99 values usually indicate issues with network congestion, cold starts in serverless functions, or inefficient memory management in the audio buffering layer.

Optimizing Speech-to-Text for Real-Time Streams

The first stage of the pipeline involves converting raw audio into text using a Speech-to-Text engine. In a real-time scenario, sending a large audio file at the end of a turn is unacceptable because it introduces a massive delay proportional to the length of the speech. Instead, we use streaming STT where audio data is sent in small chunks, typically 20 to 100 milliseconds in length, over a persistent connection.

A critical component of this stage is Voice Activity Detection, which identifies when a user is speaking and when they have stopped. Accurate VAD prevents the system from trying to transcribe background noise or starting the LLM process while the user is simply taking a breath. Tuning the VAD parameters is a delicate balance between responsiveness and preventing accidental interruptions.

  • Endpointing: The logic that determines when a user has finished a thought based on silence duration.
  • Partial Transcripts: Real-time text updates that provide a glimpse of the user speech before it is finalized.
  • Interim Results: Using unstable transcript segments to trigger early lookups or context preparation in the LLM layer.

Many developers use WebSockets to maintain a full-duplex communication channel between the client and the STT server. This allows the server to push transcript updates to the client as soon as they are available. Handling these partial results correctly is vital for features like real-time UI updates or early intent recognition.

Implementing Robust Voice Activity Detection

VAD is more than just measuring volume levels; it involves distinguishing human speech from ambient sounds like keyboard clicks or street noise. Modern VAD models use lightweight neural networks that run on the edge to minimize the amount of audio data that needs to be sent to the cloud. This reduces bandwidth usage and improves the privacy of the application.

When the VAD identifies an endpoint, it signals the STT engine to finalize the transcript. This finalization event is the trigger for the next stage of the pipeline, and its timing must be precise to avoid cutting off the user or waiting too long in silence.

Handling Audio Chunking and Buffering

Audio data must be properly formatted and chunked before transmission to the STT provider. Most engines expect raw PCM or Opus encoded audio at specific sample rates, such as 16kHz or 48kHz. Inconsistent sample rates or mismatched buffer sizes can lead to audio artifacts that degrade transcription accuracy.

pythonAudio Stream Buffer Management
1import asyncio
2
3async def stream_audio_to_stt(audio_queue, stt_client):
4    # Process audio chunks from a queue as they arrive from the microphone
5    while True:
6        audio_chunk = await audio_queue.get()
7        if audio_chunk is None:
8            break
9        
10        # Send binary audio data to the STT service via WebSocket
11        await stt_client.send_bytes(audio_chunk)
12        
13        # Mark the task as done to allow for efficient memory cleanup
14        audio_queue.task_done()

Bridging STT and LLM with Streaming Tokens

Once the STT engine provides a finalized transcript, it is passed to the LLM to generate a response. To keep latency low, we must use the streaming capability of the LLM provider to receive words as they are generated. If the system waits for the entire paragraph to be finished, the user will face an agonizing wait, especially for long or complex answers.

As tokens stream in from the LLM, they need to be aggregated into sentences or phrases before being sent to the TTS engine. This is because most TTS models produce higher quality audio when they have more context, and synthesizing individual words results in robotic, choppy speech. Finding the right threshold for this aggregation is key to balancing speed and quality.

The orchestrator must also manage the conversational state, ensuring that the new turn is appended to the message history correctly. This context is essential for the LLM to understand references like it or that from previous turns. Managing this history while handling a stream of tokens requires a robust state machine in your application logic.

Sentence-Level Aggregation Strategies

A common technique for bridging LLMs and TTS is to buffer tokens until a punctuation mark like a period, question mark, or exclamation point is reached. This ensures that the TTS engine has a complete semantic unit to work with, allowing it to apply correct prosody and intonation. However, for very long sentences, you might need to split the text at commas to keep the audio flowing.

The following code demonstrates a simple aggregator that listens to an LLM stream and yields complete sentences to the next stage of the pipeline. This pattern prevents the system from starting the TTS too early with incomplete thoughts.

Managing Context in Real-Time

In a real-time conversation, the user might interrupt the assistant mid-sentence. When this happens, the system must immediately stop the current LLM generation and the TTS playback. The system then needs to decide how to handle the interrupted context: should the partial response be saved in the history, or should it be discarded?

pythonLLM Token Stream Processor
1async def process_llm_stream(token_generator):
2    sentence_buffer = []
3    # Iterate over tokens provided by the LLM API
4    async for token in token_generator:
5        sentence_buffer.append(token)
6        
7        # Check if the token indicates the end of a thought
8        if token.endswith(('.', '?', '!')):
9            full_sentence = "".join(sentence_buffer).strip()
10            # Yield the complete sentence for the TTS engine to process
11            yield full_sentence
12            sentence_buffer = []

Neural Voice Synthesis and Playback Control

The final stage of the pipeline is converting the generated text back into high-fidelity audio. Modern neural TTS engines can produce human-like speech with realistic emotion and cadence, but they are often the most computationally intensive part of the stack. Just like the previous stages, the TTS process must be streamed to allow the client to start playing audio as soon as the first few bytes are synthesized.

Managing the audio buffer on the client side is a non-trivial task. You must ensure that the player has enough data to prevent underruns, which cause audible glitches, while keeping the buffer small enough to allow for immediate interruptions. If the user starts speaking, the client must be able to clear the buffer and stop playback instantly.

There is a significant trade-off between the complexity of the voice model and the speed of synthesis. While heavy models offer better naturalism, lighter models provide faster response times. For most real-time applications, a optimized model that supports streaming outputs is the preferred choice to maintain the conversational rhythm.

Streaming TTS Audio via WebSockets

Sending a single large MP3 file is not suitable for real-time interaction. Instead, the server should send raw audio samples or small compressed frames over a WebSocket. The client-side application then uses an API like the Web Audio API to queue these frames in an AudioWorklet or a series of buffers for seamless playback.

This streaming approach allows the user to hear the beginning of a sentence while the server is still calculating the waveform for the end of that same sentence. This overlap is what makes the interaction feel instantaneous and life-like.

Handling Interruptions and Barge-in

Barge-in is the ability for a user to interrupt the AI while it is speaking. Detecting barge-in requires the STT engine to be active even while the TTS is playing audio. This creates an echo cancellation challenge, as the system must distinguish between the assistant's own voice coming through the speakers and the user's new input.

javascriptClient-Side Audio Buffer Playback
1const audioCtx = new (window.AudioContext || window.webkitAudioContext)();
2const sourceQueue = [];
3
4function playReceivedChunk(audioData) {
5    // Decode the incoming binary chunk and schedule it for playback
6    audioCtx.decodeAudioData(audioData, (buffer) => {
7        const source = audioCtx.createBufferSource();
8        source.buffer = buffer;
9        source.connect(audioCtx.destination);
10        
11        // Calculate start time based on the end of the previous buffer
12        const startTime = sourceQueue.length > 0 ? sourceQueue[sourceQueue.length - 1].endTime : audioCtx.currentTime;
13        source.start(startTime);
14        sourceQueue.push({ source, endTime: startTime + buffer.duration });
15    });
16}
17
18function stopAllPlayback() {
19    // Immediately stop all queued audio sources on user interruption
20    sourceQueue.forEach(item => item.source.stop());
21    sourceQueue.length = 0;
22}

Architectural Patterns for Low-Latency Conversational AI

To tie everything together, you need a central orchestrator that manages the lifecycle of a conversation. This orchestrator is responsible for routing data between the STT, LLM, and TTS components and handling edge cases like network timeouts or model errors. A common pattern is to use an asynchronous event loop that processes messages from different queues.

Choosing the right communication protocol is also vital. While WebSockets are the most common choice for their simplicity and broad support, WebRTC is gaining popularity for voice applications because it is designed for low-latency media streaming. WebRTC can significantly reduce the jitter and delay associated with TCP-based protocols like WebSockets.

Finally, deploying these components geographically close to your users can shave off hundreds of milliseconds of round-trip time. Using an edge computing strategy where the STT and TTS processes are handled at the network edge can make the difference between a clunky interface and a seamless personal assistant.

WebSocket vs WebRTC for Voice

WebSockets run over TCP, which ensures that every packet arrives in order but can cause delays if a packet is lost and needs to be retransmitted. WebRTC uses UDP, which prioritizes speed over perfect reliability, making it ideal for real-time audio where a tiny lost fragment is better than a long delay. For production-grade voice bots, WebRTC is often the superior choice for the media transport layer.

However, WebRTC is more complex to implement and requires specialized infrastructure like STUN and TURN servers to handle NAT traversal. Many developers start with WebSockets for rapid prototyping and migrate to WebRTC as they scale and optimize for the lowest possible latency.

We use cookies

Necessary cookies keep the site working. Analytics and ads help us improve and fund Quizzr. You can manage your preferences.