SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model

I recently read an intriguing SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model paper from the NVIDIA team (published in arXiv, June 2025 ) that caught my attention. The paper delves into the world of full-duplex speech-to-speech language models , specifically addressing the challenge of enabling natural, real-time conversations with AI that support simultaneous listening and speaking, including user interruptions (barge-in). In this blog post, we’ll dissect the paper and provide insights into this exciting development.

Premise: Exploring the Motivation Behind the Paper

Current speech language models are limited to turn-based exchanges , operating in a half-duplex manner where only one party speaks at a time. This creates an unnatural conversation flow that lacks:

Real-time adaptability - inability to handle user barge-in/interruptions
Natural conversation dynamics - lack of simultaneous listening and speaking
Low latency response - delays in turn-taking

Existing full-duplex speech-to-speech (S2S) models face significant challenges:

Increased complexity - requiring additional submodules for turn-taking management
Heavy resource requirements - extensive speech-text pretraining on top of LLM backbones
Delicate balance - simultaneous codec modeling for both speech perception and generation
Limited accessibility - no openly available duplex S2S models with training/inference code

The motivation is to create a simpler, more efficient, and accessible full-duplex S2S model that can handle natural conversational behaviors without the overhead of previous approaches.

Research Questions and Hypotheses

Primary Research Questions:

Can we build a duplex S2S model without requiring speech-text pretraining?
- Hypothesis: Using a pretrained streaming encoder for user input eliminates the need for speech pretraining
How can we achieve better agent voice quality while reducing computational requirements?
- Hypothesis: Separate architectures for agent and user modeling enables codec fine-tuning for personalized agent voices
Can we enable robust barge-in and turn-taking capabilities through data augmentation alone?
- Hypothesis: Synthetic duplex data with carefully designed silence patterns and multi-turn structures can teach the model conversational behaviors
Is it possible to reduce the bitrate requirements while maintaining or improving quality?
- Hypothesis: Codec personalization can achieve better quality at 0.6 kbps compared to previous 1.1+ kbps approaches

Introduce Proposed Method

Architecture Overview

The SALM-Duplex model consists of three main components:

1. Streaming Speech Encoder (User Input)

100M parameter CTC-based streaming encoder
Processes user speech at 80ms frame rate
Generates continuous embeddings without requiring speech pretraining
Enables causal, streaming processing

2. Speech Codec (Agent Output)

NanoCodec with Finite Scalar Quantization (FSQ)
4 independent parallel codebooks at 12.5 Hz frame rate
Operates at ultra-low 0.6 kbps bitrate (half the previous 1.1 kbps)
Personalization-friendly design - can be fine-tuned for specific agent voices
Enables parallel prediction of all codebooks with minimal latency

3. Decoder-only LLM Backbone

Initialized from TinyLlama-1.1B-chat
Extended vocabulary to include speech codec tokens
Modality adapter between speech encoder and text LLM
Processes time-aligned user and agent streams simultaneously

Key Design Innovations

Dual-Stream Architecture:

User stream: Continuous embeddings from pretrained encoder
Agent stream: Text + 4 parallel audio codec channels
Both streams are time-aligned and summed as input to LLM

Channel Fusion:

Text and speech tokens aligned at turn level (not word level)
Small delay (one token) introduced to speech channels to better condition on text
Separate <BOS> and <EOS> tokens for text and speech
Padding between text and speech tokens using text pad ID

Multi-Channel Next Token Prediction:

Simultaneous prediction of text and speech
Text and speech loss weighted differently (3:1 ratio)
Agent learns to generate text first, then speech follows with slight delay

Experiment Details

Training Setup

Hardware & Scale:

32 A100 (80GB) GPUs
Batch duration: 1000 seconds per GPU
Implemented in PyTorch using NeMo Toolkit

Optimization:

FusedAdam optimizer
Inverse Square Root Annealing LR schedule
Initial learning rate: 3e-4 with 2500 step warmup
Gradient clipping at threshold 1.0

Model Initialization:

Speech encoder: 100M streaming pretrained encoder (80ms right context)
LLM: TinyLlama-1.1B-chat
Codec: Personalized 0.6 kbps NanoCodec with 4 channels (vocab size 4,037 each)
Text tokenizer: 32k SentencePiece

Training Data (26.7k hours total)

1. Single-turn Spoken QA (20.4k hours):

ASR-QA (20k hours): Multi-speaker TTS synthesis from ASR-labeled data
MS MARCO (0.2k hours): TTS-synthesized questions/answers
Alpaca (0.2k hours): TTS-synthesized instruction following

2. Multi-turn Conversations (6.3k hours):

Internal SFT (3k hours): Real multi-turn spoken QA
UltraChat (3k hours): 4-turn synthetic conversations
Topic (0.3k hours): 4-turn conversations on 63 topics

Duplex Data Format:

Two separate streams (user and agent)
0.64s silence between turns
Barge-in augmentation with 0.64s agent speech retained after cutoff

Key Findings

1. Superior Turn-taking and Barge-in Performance

Compared to Moshi:

71% higher barge-in success rate (83.0% vs 56.0% on UltraChat)
0.11-0.12s lower latency for barge-in response
0.4-0.5 point improvement in speech quality (UTMOS)
Zero false alarms

2. Competitive Reasoning Quality

Outperforms Moshi on all datasets despite using:

Much less data
Smaller backbone (1.1B vs 7B)
Significantly stronger on reasoning tasks (ASR-QA: 7.8 vs 1.9)

3. Codec Personalization Breakthrough

Personalized codec at 0.6 kbps outperforms 1.1-1.2 kbps codecs across all metrics:

Nearly 50% bitrate reduction
Trained on only 21k utterances from target speaker
Improves both audio reconstruction and end-to-end S2S quality

4. First Open-Source Duplex S2S Model

No speech pretraining required
Publicly available training and inference code
Simplified pipeline - can build from any LLM

5. Robust Behavior in Challenging Scenarios

Handles frequent interruptions (3 barge-ins in 15 seconds)
Unseen reasoning problems
Multi-turn conversation coherence maintained

Limitations

Reasoning Gap with Cascaded Systems - Still lags behind optimal cascaded approaches on some datasets
First Response Latency Artifacts - 0.64s silence creates artificial delay pattern
Model Scale Limitations - Small 1.1B backbone limits capabilities
Synthetic Data Dependence - Heavy reliance on TTS-synthesized data
Evaluation Methodology Constraints - GPT-based scoring may have biases
Limited Comparison Scope - Only compared against Moshi
Personalization Trade-offs - Requires 21k utterances per speaker
Turn-level Alignment - May limit fine-grained timing control
Missing Real-world Testing - No real user interaction evaluation
Computational Requirements Not Fully Disclosed - Inference costs unclear

Conclusion

The SALM-Duplex paper presents a significant advancement in full-duplex speech-to-speech modeling by demonstrating that high-quality conversational AI is achievable without speech pretraining, with reduced bitrate, and with simpler architecture. The open-sourcing of code addresses a critical gap in reproducibility for duplex S2S research.