qwen3_asr_triton vs speechLLM (vLLM): Comparison


Functional Equivalence

Both implementations are functionally equivalent at the protocol level:

Aspectqwen3_asr_tritonspeechLLM (vLLM)
WebSocket protocol✅ start / binary PCM / end✅ start / binary PCM / end
Token rollback✅ re-infer last 2 chunks✅ re-infer last 2 chunks
Text stability✅ LCP-style committed/partial✅ LCP committed/partial
Context reset✅ 300s + inject last_committed✅ 300s + inject last_committed
Inference backendHuggingFace Transformers (direct)vLLM (scheduler-managed)
Serving layerTriton Python BackendFastAPI + Uvicorn

KV Cache Behavior

This is a critical architectural difference.

LLM Decoder KV Cache

Neither implementation maintains a KV cache across chunks.

Both call fresh generation per chunk:

Chunk N arrives:
  1. Re-encode ALL accumulated audio (0..N) → embeddings
  2. Full decoder prefill from scratch (no past_key_values)
  3. Generate max_new_tokens new tokens
  4. Rollback last unfixed_chunk_num chunks' tokens

The decoder always restarts from an empty KV cache on every chunk. This is by design because the transcript grows and the previous decode state is stale — the audio embeddings change (Phase 1) or the prefix changes (token rollback), so cached KV states would be invalid anyway.

Audio Encoder KV Cache (Encoder Output Caching)

PhaseTritonvLLM (speechLLM)Description
Phase 1 (current)✅ implemented✅ implementedFull accumulated audio re-encoded every chunk — O(N²)
Phase 2 (incremental)🚧 test_phase2.py (planned)❌ not plannedCache per-chunk encoder outputs, reuse → O(N)

The Triton implementation has a planned Phase 2 that would cache encoder outputs per-chunk in session.encoder_output_cache and pass inputs_embeds to skip audio_tower on subsequent chunks. This exploits Qwen3-ASR’s windowed attention property (100-frame training / 400-frame inference windows — past chunk encoder outputs don’t change when new chunks arrive).

The vLLM implementation has no equivalent because vLLM’s model execution path doesn’t expose a hook to inject pre-computed encoder embeddings at the per-chunk level without custom model registration.


Benchmark Results

Single Session, 640ms chunks

Audio DurationMetricTransformers/TritonvLLM (speechLLM)Winner
10sTTFT~0.18s~0.12svLLM (marginal)
10sRTF~0.31x~0.28xvLLM (marginal)
20sTTFT~0.21s~0.40sTransformers 1.9x
20sRTF~0.36x~0.41xTransformers
60sTTFT0.224s0.561sTransformers 2.5x
60sTotal24.84s31.68sTransformers 1.28x
60sRTF0.414x0.528xTransformers

RTF (Real-Time Factor) = total_inference_time / audio_duration. Lower is better; <1.0 means faster than real-time.

Concurrency

SessionsTransformers/TritonvLLM (speechLLM)
c=1Works, RTF ~0.41xWorks, RTF ~0.53x
c=4Works, RTF ~1.1xWorks, RTF ~1.3x
c=8Works, RTF ~2.0xFails (scheduler overwhelmed)

Why Transformers Wins at Scale

Both use Phase 1 (O(N²) re-encoding) — each new chunk re-encodes all audio accumulated so far. But the overhead sources differ:

  • Transformers: ~300ms constant overhead per chunk (raw forward pass, no scheduler)
  • vLLM: chunk inference time grows linearly with session length due to PagedAttention scheduler overhead and KV cache management for the growing audio context

vLLM’s scheduler is optimized for batched LLM decode throughput, not for high-frequency small-batch audio encoder workloads. At 640ms chunk intervals, the scheduler overhead dominates.

Per-chunk latency growth (approximate):
  Transformers: latency ≈ C₀ + α × audio_length     (α ≈ small)
  vLLM:         latency ≈ C₀ + β × audio_length     (β > α due to scheduler)

At 60s audio (≈94 chunks of 640ms), this difference compounds to a 1.28x total time advantage.


Summary: Which is Better?

ScenarioWinnerReason
Short audio (<10s)vLLM (marginal)Lower cold-start, vLLM batch warmup helps
Long audio (>20s)Transformers/TritonNo scheduler overhead growth
High concurrency (c≥8)Transformers/TritonvLLM scheduler breaks down
Phase 2 (future)Transformers/TritonIncremental encoder caching already prototyped
LLM decode throughputvLLMPagedAttention for multi-turn batching
Operational simplicityvLLM (speechLLM)Single process, no Triton ensemble
Multi-GPU horizontal scalingDynamo E/PDNIXL RDMA encoder→decoder separation

Bottom line: For streaming ASR specifically, qwen3_asr_triton (Transformers) is faster and more scalable under load. vLLM’s core strengths (PagedAttention, continuous batching) apply to LLM decode token generation — they are largely irrelevant overhead for a repeated small-batch audio encoding workload.


Architecture Diagram

qwen3_asr_triton (Transformers/Triton)
────────────────────────────────────────
Client
  │ WebSocket (PCM16LE + JSON)

FastAPI :8003 (or Triton gRPC :8001)


Per-Session State Machine (SessionState)
  ├── buffer: raw PCM bytes
  ├── audio_accum: float32 np.array (grows with each chunk)
  ├── text: transcript so far
  ├── prompt_raw: fixed system prompt
  └── (NO KV cache stored)

  ▼ _run_inference() — every 640ms chunk

  ├── processor(text=prompt, audio=audio_accum)   ← FULL accumulated audio
  │   └── audio_tower (encoder) — O(N²) total

  └── model.generate(**inputs, max_new_tokens=32)  ← fresh decode, no past_key_values
      └── Qwen3ForCausalLM decoder (full prefill)


speechLLM (vLLM)
────────────────────────────────────────
Client
  │ WebSocket (PCM16LE + JSON)

FastAPI :8004 (Uvicorn)


Per-Session State Machine
  ├── streaming_state: qwen_asr library state
  ├── last_committed: stable prefix
  └── audio_accum_samples: counter

  ▼ streaming_transcribe() — every 320ms chunk (via asyncio.to_thread)

  └── vLLM AsyncLLM scheduler
      ├── audio_tower (encoder) — O(N²) total, + scheduler overhead
      └── Qwen3ForCausalLM decoder — PagedAttention KV (per-request, not cross-chunk)

File References

FileDescription
/home/user/sourcecode/qwen3_asr_triton/triton/model_repository/qwen3_asr_streaming/1/model.pyTriton Python Backend — Phase 1 streaming
/home/user/sourcecode/qwen3_asr_triton/server/qwen3_asr_server.pyFastAPI wrapper over same Transformers model
/home/user/sourcecode/qwen3_asr_triton/server/test_phase2.pyPlanned Phase 2 incremental encoder caching
/home/user/sourcecode/qwen3_asr_triton/TECHNICAL_COMPARISON.mdTriton project’s own benchmark documentation
/home/user/sourcecode/qwen3_asr_triton/server/benchmark_comparison.pyBenchmark runner: Transformers vs vLLM
/home/user/sourcecode/production_asr/speechLLM/qwen_asr_server.pyvLLM-based streaming server (speechLLM)