SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
I recently read an intriguing SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model paper from the NVIDIA team (published in arXiv, June 2025 ) that caught my attention. The paper delves into the world of full-duplex speech-to-speech language models , specifically addressing the challenge of enabling natural, real-time conversations with AI that support simultaneous listening and speaking, including user interruptions (barge-in). In this blog post, we’ll dissect the paper and provide insights into this exciting development.
Premise: Exploring the Motivation Behind the Paper
Current speech language models are limited to turn-based exchanges , operating in a half-duplex manner where only one party speaks at a time. This creates an unnatural conversation flow that lacks:
- Real-time adaptability - inability to handle user barge-in/interruptions
- Natural conversation dynamics - lack of simultaneous listening and speaking
- Low latency response - delays in turn-taking
Existing full-duplex speech-to-speech (S2S) models face significant challenges:
- Increased complexity - requiring additional submodules for turn-taking management
- Heavy resource requirements - extensive speech-text pretraining on top of LLM backbones
- Delicate balance - simultaneous codec modeling for both speech perception and generation
- Limited accessibility - no openly available duplex S2S models with training/inference code
The motivation is to create a simpler, more efficient, and accessible full-duplex S2S model that can handle natural conversational behaviors without the overhead of previous approaches.
Research Questions and Hypotheses
Primary Research Questions:
- Can we build a duplex S2S model without requiring speech-text pretraining?
- Hypothesis: Using a pretrained streaming encoder for user input eliminates the need for speech pretraining
- How can we achieve better agent voice quality while reducing computational requirements?
- Hypothesis: Separate architectures for agent and user modeling enables codec fine-tuning for personalized agent voices
- Can we enable robust barge-in and turn-taking capabilities through data augmentation alone?
- Hypothesis: Synthetic duplex data with carefully designed silence patterns and multi-turn structures can teach the model conversational behaviors
- Is it possible to reduce the bitrate requirements while maintaining or improving quality?
- Hypothesis: Codec personalization can achieve better quality at 0.6 kbps compared to previous 1.1+ kbps approaches
Introduce Proposed Method
Architecture Overview
The SALM-Duplex model consists of three main components:
1. Streaming Speech Encoder (User Input)
- 100M parameter CTC-based streaming encoder
- Processes user speech at 80ms frame rate
- Generates continuous embeddings without requiring speech pretraining
- Enables causal, streaming processing
2. Speech Codec (Agent Output)
- NanoCodec with Finite Scalar Quantization (FSQ)
- 4 independent parallel codebooks at 12.5 Hz frame rate
- Operates at ultra-low 0.6 kbps bitrate (half the previous 1.1 kbps)
- Personalization-friendly design - can be fine-tuned for specific agent voices
- Enables parallel prediction of all codebooks with minimal latency
3. Decoder-only LLM Backbone
- Initialized from TinyLlama-1.1B-chat
- Extended vocabulary to include speech codec tokens
- Modality adapter between speech encoder and text LLM
- Processes time-aligned user and agent streams simultaneously
Key Design Innovations
Dual-Stream Architecture:
- User stream: Continuous embeddings from pretrained encoder
- Agent stream: Text + 4 parallel audio codec channels
- Both streams are time-aligned and summed as input to LLM
Channel Fusion:
- Text and speech tokens aligned at turn level (not word level)
- Small delay (one token) introduced to speech channels to better condition on text
- Separate
<BOS>and<EOS>tokens for text and speech - Padding between text and speech tokens using text pad ID
Multi-Channel Next Token Prediction:
- Simultaneous prediction of text and speech
- Text and speech loss weighted differently (3:1 ratio)
- Agent learns to generate text first, then speech follows with slight delay
Experiment Details
Training Setup
Hardware & Scale:
- 32 A100 (80GB) GPUs
- Batch duration: 1000 seconds per GPU
- Implemented in PyTorch using NeMo Toolkit
Optimization:
- FusedAdam optimizer
- Inverse Square Root Annealing LR schedule
- Initial learning rate: 3e-4 with 2500 step warmup
- Gradient clipping at threshold 1.0
Model Initialization:
- Speech encoder: 100M streaming pretrained encoder (80ms right context)
- LLM: TinyLlama-1.1B-chat
- Codec: Personalized 0.6 kbps NanoCodec with 4 channels (vocab size 4,037 each)
- Text tokenizer: 32k SentencePiece
Training Data (26.7k hours total)
1. Single-turn Spoken QA (20.4k hours):
- ASR-QA (20k hours): Multi-speaker TTS synthesis from ASR-labeled data
- MS MARCO (0.2k hours): TTS-synthesized questions/answers
- Alpaca (0.2k hours): TTS-synthesized instruction following
2. Multi-turn Conversations (6.3k hours):
- Internal SFT (3k hours): Real multi-turn spoken QA
- UltraChat (3k hours): 4-turn synthetic conversations
- Topic (0.3k hours): 4-turn conversations on 63 topics
Duplex Data Format:
- Two separate streams (user and agent)
- 0.64s silence between turns
- Barge-in augmentation with 0.64s agent speech retained after cutoff
Key Findings
1. Superior Turn-taking and Barge-in Performance
Compared to Moshi:
- 71% higher barge-in success rate (83.0% vs 56.0% on UltraChat)
- 0.11-0.12s lower latency for barge-in response
- 0.4-0.5 point improvement in speech quality (UTMOS)
- Zero false alarms
2. Competitive Reasoning Quality
Outperforms Moshi on all datasets despite using:
- Much less data
- Smaller backbone (1.1B vs 7B)
- Significantly stronger on reasoning tasks (ASR-QA: 7.8 vs 1.9)
3. Codec Personalization Breakthrough
Personalized codec at 0.6 kbps outperforms 1.1-1.2 kbps codecs across all metrics:
- Nearly 50% bitrate reduction
- Trained on only 21k utterances from target speaker
- Improves both audio reconstruction and end-to-end S2S quality
4. First Open-Source Duplex S2S Model
- No speech pretraining required
- Publicly available training and inference code
- Simplified pipeline - can build from any LLM
5. Robust Behavior in Challenging Scenarios
- Handles frequent interruptions (3 barge-ins in 15 seconds)
- Unseen reasoning problems
- Multi-turn conversation coherence maintained
Limitations
- Reasoning Gap with Cascaded Systems - Still lags behind optimal cascaded approaches on some datasets
- First Response Latency Artifacts - 0.64s silence creates artificial delay pattern
- Model Scale Limitations - Small 1.1B backbone limits capabilities
- Synthetic Data Dependence - Heavy reliance on TTS-synthesized data
- Evaluation Methodology Constraints - GPT-based scoring may have biases
- Limited Comparison Scope - Only compared against Moshi
- Personalization Trade-offs - Requires 21k utterances per speaker
- Turn-level Alignment - May limit fine-grained timing control
- Missing Real-world Testing - No real user interaction evaluation
- Computational Requirements Not Fully Disclosed - Inference costs unclear
Conclusion
The SALM-Duplex paper presents a significant advancement in full-duplex speech-to-speech modeling by demonstrating that high-quality conversational AI is achievable without speech pretraining, with reduced bitrate, and with simpler architecture. The open-sourcing of code addresses a critical gap in reproducibility for duplex S2S research.