Review: Streaming SpeechLLM Back-end Architecture for ASR & Translation
Let me break down the architectural trade-offs for each approach for streaming SpeechLLM (ASR + translation):
Architecture Comparison
1. Triton Only
Audio → Triton Ensemble:
[Preprocessor] → [Encoder model] → [LLM (BLS or TRT-LLM backend)] → Tokens
Strengths:
- Mature, production-grade, single system to operate
- Ensemble pipelines chain stages cleanly
- gRPC streaming support out of the box
- TRT-LLM backend gives excellent throughput on NVIDIA hardware
Weaknesses:
- LLM efficiency is lower than vLLM (no PagedAttention-style dynamic KV management natively)
- Encoder → LLM data transfer is in-process but less flexible for inter-GPU RDMA
- Scaling encoder and LLM independently is hard (tied to ensemble topology)
- State management for chunked streaming audio across requests is awkward in BLS
2. Triton + vLLM
Audio → Triton (encoder/preprocessor) --HTTP/gRPC--> vLLM (LLM decoder) → Tokens
Strengths:
- Best-in-class LLM decoder (PagedAttention, continuous batching)
- Clean separation: Triton handles audio, vLLM handles generation
Weaknesses:
- Serialization overhead: Embeddings cross process boundaries over gRPC/HTTP — not GPU-to-GPU direct
- No unified scheduler — Triton and vLLM have no awareness of each other’s load
- Chunked streaming audio requires explicit state management between the two systems
3. vLLM + Dynamo (current approach)
HTTP → Frontend → Processor
→ ASR Encode Worker (GPU 0) --NIXL RDMA--> PD Worker/vLLM (GPU 1) → Tokens
Strengths:
- NIXL RDMA: GPU-to-GPU embedding transfer with near-zero latency (no serialization, no CPU copy)
- Independent scaling: Encoder and LLM workers can be on separate GPUs, scaled independently
- E/PD is ideal for ASR: ASR input is processed once (no iterative prefill benefit from P/D separation), so E/PD avoids unnecessary round-trips
- Streaming-first design: explicitly designed for chunked audio in future iterations
- KV-aware routing: useful for multi-turn sessions (translation w/ context)
- Unified observability, dynamic GPU scheduling via planner
Weaknesses:
- Most complex operationally (etcd/NATS/ZMQ, service discovery)
- Still pre-release (1.0.0 not on PyPI yet)
- Overkill for a single-GPU deployment
Streaming-Specific Considerations
For real-time streaming ASR (meeting transcription, live translation), the critical path is:
Chunk 1 → Encode → Transfer → Decode partial tokens → stream output
Chunk 2 → Encode → Transfer → Append to KV cache → continue stream
...
| Property | Triton Only | Triton+vLLM | vLLM+Dynamo |
|---|---|---|---|
| Encoder→LLM transfer | In-process | gRPC (CPU) | NIXL RDMA (GPU-GPU) |
| Incremental KV updates | Difficult | Not native | Native (planned) |
| Encoder/LLM GPU isolation | No | Partial | Yes |
| First-token latency | Low–Medium | Medium | Low |
| Multi-stream throughput | Medium | Medium–High | High |
| Operational simplicity | High | Medium | Low |
Recommendation
| Use Case | Best Fit |
|---|---|
| Single GPU, simple deployment | Triton only (with TRT-LLM backend) |
| Max LLM throughput, existing Triton infra | Triton + vLLM |
| Multi-GPU, high concurrency, streaming-first | vLLM + Dynamo (current design) |
The current E/PD Dynamo design is the right long-term architecture for a streaming SpeechLLM at scale — especially once chunked audio streaming is wired in (the encoder worker already accepts incremental requests). The main practical pain point right now is the operational complexity of the stack and the fact that 1.0.0 isn’t packaged yet.
The key architectural gap in the current implementation that would unlock true streaming is VAD (Voice Activity Detection) at the frontend to segment audio into chunks, feeding them incrementally to the encode worker rather than waiting for the full utterance.
Additional Candidates
4. NVIDIA RIVA
Mic/Audio → RIVA ASR Service (streaming gRPC) → Text → RIVA NMT → Translation
- Purpose-built for streaming ASR + translation pipelines
- Chunk-based streaming via gRPC bidirectional stream (true word-by-word output)
- TensorRT-optimized conformer/Citrinet/Whisper models
- Handles VAD, diarization, punctuation natively
- Limitation: not LLM-based — traditional encoder-decoder, not autoregressive generative model
- Best for: production deployments where accuracy of established models is sufficient and you don’t need an LLM backbone
5. SGLang + Dynamo (or standalone)
Audio → Encode Worker → SGLang (prefill+decode) with RadixAttention
- Available as a Dynamo backend (
components/src/dynamo/sglang/) - RadixAttention: automatic prefix KV cache sharing — great for translation where system prompt is long/repeated
- Generally comparable throughput to vLLM, sometimes better for high-concurrency with shared prefixes
- More mature multimodal support in recent releases (0.4+)
- Streaming-first design with native SSE
- Best for: translation with long shared context/system prompts, or when you want an alternative to vLLM’s PagedAttention
6. TRT-LLM + Triton (without Dynamo)
Audio → Triton Preprocessor → TRT-LLM backend (in-flight batching) → Tokens
- Maximum GPU efficiency — TRT-LLM is the fastest LLM inference engine on NVIDIA hardware
- In-flight batching, paged KV cache, speculative decoding all built in
- First-class Triton integration (no glue code needed)
- Limitation: model compilation step is non-trivial, especially for new models like Qwen3-ASR
- Dynamo already supports TRT-LLM as a backend if you want RDMA + disaggregation later
- Best for: known stable models in production where you want peak throughput
7. Faster-Whisper / CTranslate2
Audio → Faster-Whisper (encoder+decoder, CTranslate2 backend) → Tokens streamed
- Specifically for Whisper-family models (encoder-decoder, not decoder-only)
- 4× faster than OpenAI Whisper with same accuracy, ~half memory
- True streaming: yields tokens as they’re generated
- Supports batching, beam search, VAD integration (
silero-vad) - Limitation: only Whisper-architecture models; Qwen3-ASR is decoder-only LLM style
- Best for: pure ASR without translation, or when the model is Whisper-based and you want simplicity
8. Ray Serve (composition layer)
Audio → Ray Serve DAG:
[VAD actor] → [Encoder replica pool] → [vLLM replica pool] → Tokens
- Ray Serve handles inter-service communication, autoscaling, and load balancing
- Can wrap any backend (vLLM, TRT-LLM, Faster-Whisper) as Ray actors
- Built-in replica autoscaling based on request queue depth
- Limitation: no RDMA (embeddings go through Ray object store, not GPU-to-GPU), adds latency vs Dynamo’s NIXL
- Best for: teams already on Ray ecosystem, multi-model pipelines where flexibility > raw latency
9. Edge-VAD + Cloud LLM (Hybrid Architecture)
Client (WebRTC/WebSocket):
[Mic] → [VAD + feature extraction] → compressed features → Cloud
Cloud:
Dynamo/vLLM worker pool (receive features, run LLM decoder only)
- Offload VAD and Fbank feature extraction to edge (runs on CPU/mobile efficiently)
- Only send compressed acoustic features over network (much smaller than raw audio)
- Cloud only runs the expensive LLM decoder portion
- Eliminates encoder GPU entirely from the cloud side
- Best for: mobile/meeting apps, when network bandwidth is a concern
Full Summary Matrix
| Architecture | Latency | Throughput | Streaming | Complexity | LLM Quality |
|---|---|---|---|---|---|
| RIVA | Very Low | High | Native | Low | Traditional ASR |
| SGLang + Dynamo | Low | High | Good | High | Full LLM |
| TRT-LLM + Triton | Low | Highest | Good | Medium | Full LLM |
| Faster-Whisper | Low | Medium | Native | Low | Whisper only |
| Ray Serve | Medium | Medium | Via actors | Medium | Any backend |
| Edge-VAD + Cloud | Lowest* | High | Native | High | Full LLM |
| vLLM + Dynamo (current) | Low | High | Planned | High | Full LLM |
Recommendation for Qwen3-ASR
Given the use of Qwen3-ASR (decoder-only LLM with audio encoder):
-
Short term / simplest: TRT-LLM + Triton — compile the model once, maximum efficiency
-
Best long-term for streaming: vLLM + Dynamo (current work) — once chunked audio streaming is implemented, NIXL RDMA makes the encoder→decoder path nearly free
-
If translation context matters: SGLang + Dynamo — RadixAttention prefix sharing pays dividends when you have long shared system prompts for translation tasks
-
If you need truly real-time (sub-300ms first token for live speech): Edge-VAD hybrid — remove the encoder from the hot path entirely
1,1 Top