I recently read an intriguing SALM-Duplex: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model paper from the NVIDIA team (published in arXiv, June 2025 ) that caught my attention. The paper delves into the world of full-duplex speech-to-speech language models , specifically addressing the challenge of enabling natural, real-time conversations with AI that support simultaneous listening and speaking, including user interruptions (barge-in). In this blog post, we’ll dissect the paper and provide insights into this exciting development.


Premise: Exploring the Motivation Behind the Paper

Current speech language models are limited to turn-based exchanges , operating in a half-duplex manner where only one party speaks at a time. This creates an unnatural conversation flow that lacks:

  • Real-time adaptability - inability to handle user barge-in/interruptions
  • Natural conversation dynamics - lack of simultaneous listening and speaking
  • Low latency response - delays in turn-taking

Existing full-duplex speech-to-speech (S2S) models face significant challenges:

  1. Increased complexity - requiring additional submodules for turn-taking management
  2. Heavy resource requirements - extensive speech-text pretraining on top of LLM backbones
  3. Delicate balance - simultaneous codec modeling for both speech perception and generation
  4. Limited accessibility - no openly available duplex S2S models with training/inference code

The motivation is to create a simpler, more efficient, and accessible full-duplex S2S model that can handle natural conversational behaviors without the overhead of previous approaches.


Research Questions and Hypotheses

Primary Research Questions:

  1. Can we build a duplex S2S model without requiring speech-text pretraining?
    • Hypothesis: Using a pretrained streaming encoder for user input eliminates the need for speech pretraining
  2. How can we achieve better agent voice quality while reducing computational requirements?
    • Hypothesis: Separate architectures for agent and user modeling enables codec fine-tuning for personalized agent voices
  3. Can we enable robust barge-in and turn-taking capabilities through data augmentation alone?
    • Hypothesis: Synthetic duplex data with carefully designed silence patterns and multi-turn structures can teach the model conversational behaviors
  4. Is it possible to reduce the bitrate requirements while maintaining or improving quality?
    • Hypothesis: Codec personalization can achieve better quality at 0.6 kbps compared to previous 1.1+ kbps approaches

Introduce Proposed Method

Architecture Overview

The SALM-Duplex model consists of three main components:

1. Streaming Speech Encoder (User Input)

  • 100M parameter CTC-based streaming encoder
  • Processes user speech at 80ms frame rate
  • Generates continuous embeddings without requiring speech pretraining
  • Enables causal, streaming processing

2. Speech Codec (Agent Output)

  • NanoCodec with Finite Scalar Quantization (FSQ)
  • 4 independent parallel codebooks at 12.5 Hz frame rate
  • Operates at ultra-low 0.6 kbps bitrate (half the previous 1.1 kbps)
  • Personalization-friendly design - can be fine-tuned for specific agent voices
  • Enables parallel prediction of all codebooks with minimal latency

3. Decoder-only LLM Backbone

  • Initialized from TinyLlama-1.1B-chat
  • Extended vocabulary to include speech codec tokens
  • Modality adapter between speech encoder and text LLM
  • Processes time-aligned user and agent streams simultaneously

Key Design Innovations

Dual-Stream Architecture:

  • User stream: Continuous embeddings from pretrained encoder
  • Agent stream: Text + 4 parallel audio codec channels
  • Both streams are time-aligned and summed as input to LLM

Channel Fusion:

  • Text and speech tokens aligned at turn level (not word level)
  • Small delay (one token) introduced to speech channels to better condition on text
  • Separate <BOS> and <EOS> tokens for text and speech
  • Padding between text and speech tokens using text pad ID

Multi-Channel Next Token Prediction:

  • Simultaneous prediction of text and speech
  • Text and speech loss weighted differently (3:1 ratio)
  • Agent learns to generate text first, then speech follows with slight delay

Experiment Details

Training Setup

Hardware & Scale:

  • 32 A100 (80GB) GPUs
  • Batch duration: 1000 seconds per GPU
  • Implemented in PyTorch using NeMo Toolkit

Optimization:

  • FusedAdam optimizer
  • Inverse Square Root Annealing LR schedule
  • Initial learning rate: 3e-4 with 2500 step warmup
  • Gradient clipping at threshold 1.0

Model Initialization:

  • Speech encoder: 100M streaming pretrained encoder (80ms right context)
  • LLM: TinyLlama-1.1B-chat
  • Codec: Personalized 0.6 kbps NanoCodec with 4 channels (vocab size 4,037 each)
  • Text tokenizer: 32k SentencePiece

Training Data (26.7k hours total)

1. Single-turn Spoken QA (20.4k hours):

  • ASR-QA (20k hours): Multi-speaker TTS synthesis from ASR-labeled data
  • MS MARCO (0.2k hours): TTS-synthesized questions/answers
  • Alpaca (0.2k hours): TTS-synthesized instruction following

2. Multi-turn Conversations (6.3k hours):

  • Internal SFT (3k hours): Real multi-turn spoken QA
  • UltraChat (3k hours): 4-turn synthetic conversations
  • Topic (0.3k hours): 4-turn conversations on 63 topics

Duplex Data Format:

  • Two separate streams (user and agent)
  • 0.64s silence between turns
  • Barge-in augmentation with 0.64s agent speech retained after cutoff

Key Findings

1. Superior Turn-taking and Barge-in Performance

Compared to Moshi:

  • 71% higher barge-in success rate (83.0% vs 56.0% on UltraChat)
  • 0.11-0.12s lower latency for barge-in response
  • 0.4-0.5 point improvement in speech quality (UTMOS)
  • Zero false alarms

2. Competitive Reasoning Quality

Outperforms Moshi on all datasets despite using:

  • Much less data
  • Smaller backbone (1.1B vs 7B)
  • Significantly stronger on reasoning tasks (ASR-QA: 7.8 vs 1.9)

3. Codec Personalization Breakthrough

Personalized codec at 0.6 kbps outperforms 1.1-1.2 kbps codecs across all metrics:

  • Nearly 50% bitrate reduction
  • Trained on only 21k utterances from target speaker
  • Improves both audio reconstruction and end-to-end S2S quality

4. First Open-Source Duplex S2S Model

  • No speech pretraining required
  • Publicly available training and inference code
  • Simplified pipeline - can build from any LLM

5. Robust Behavior in Challenging Scenarios

  • Handles frequent interruptions (3 barge-ins in 15 seconds)
  • Unseen reasoning problems
  • Multi-turn conversation coherence maintained

Limitations

  1. Reasoning Gap with Cascaded Systems - Still lags behind optimal cascaded approaches on some datasets
  2. First Response Latency Artifacts - 0.64s silence creates artificial delay pattern
  3. Model Scale Limitations - Small 1.1B backbone limits capabilities
  4. Synthetic Data Dependence - Heavy reliance on TTS-synthesized data
  5. Evaluation Methodology Constraints - GPT-based scoring may have biases
  6. Limited Comparison Scope - Only compared against Moshi
  7. Personalization Trade-offs - Requires 21k utterances per speaker
  8. Turn-level Alignment - May limit fine-grained timing control
  9. Missing Real-world Testing - No real user interaction evaluation
  10. Computational Requirements Not Fully Disclosed - Inference costs unclear

Conclusion

The SALM-Duplex paper presents a significant advancement in full-duplex speech-to-speech modeling by demonstrating that high-quality conversational AI is achievable without speech pretraining, with reduced bitrate, and with simpler architecture. The open-sourcing of code addresses a critical gap in reproducibility for duplex S2S research.