Skip to content

Fix Spark-TTS model hanging issue and add further optimzation#7

Merged
huwenjie333 merged 2 commits intodeployfrom
fix-tts
Feb 27, 2026
Merged

Fix Spark-TTS model hanging issue and add further optimzation#7
huwenjie333 merged 2 commits intodeployfrom
fix-tts

Conversation

@huwenjie333
Copy link
Copy Markdown
Collaborator

@huwenjie333 huwenjie333 commented Feb 25, 2026

This PR fixed the hanging issue of Spark-TTS model on modal platform, by the following fixes:

  • replaced vLLM's LLM class with AsyncLLMEngine for thread safety. The hanging issue is because the default LLM class from vLLM is not thread-safe. When multiple requests come in, they might cause race condition and deadlock. I replaced it with AsyncLLMEngine which ensures the thread safety, which is also the one used by vllm serve cli command.
  • Update @modal.fastapi_endpoint(...) implementation of generate to async def generate(...).
  • Update the prompt generation step to asynchronously interact with AsyncLLMEngine.generate, gathering multiple results fully asynchronously using asyncio.gather.
  • offload the audio detokenization operation via asyncio.to_thread to ensure that standard PyTorch GPU operations do not inadvertently block Modal's Python event loop for prolonged durations.

In addition, future optimizations was implemented:

  • increase the maximum concurrent requests to the same container from 10 to 100, as there are sufficient GPU memory (24 GB from Nvidia L4) to handle a 0.5B model (1 GB) and reserve a large amount of KV cache. In this way, we don't need to wait for the cold-start time for a new container.
  • return the FastAPI Response instead of StreamingResponse so that Uvicorn's internal thread pool won't be overloaded by chunking audio into small bytes for streaming, which reduced the total latency from 100+ seconds to a few seconds when large amount of requests are received concurrently.

A stress test was done by sending 80 requests consecutively with 0.3 seconds interval, where each requests has 3 sentences and ~40 words. All the requests were processed successfully in ~4 seconds using a single GPU container.

for i in {1..80}; do
  curl -X POST --get "https://sb-modal-ws--spark-tts-salt-sparktts-generate.modal.run" \
    --data-urlencode "text=I am a nurse who takes care of many people who have cancer.I am a nurse who takes care of many people who have cancer.I am a nurse who takes care of many people who have fever." \
    --data-urlencode "speaker_id=248" \
    --output "output_$i.wav" &
  sleep 0.3
done; wait
image

@huwenjie333 huwenjie333 changed the title fixed Fix Spark-TTS hanging issues and add future optimization Feb 25, 2026
@huwenjie333 huwenjie333 changed the title Fix Spark-TTS hanging issues and add future optimization Fix Spark-TTS model hanging issue and add further optimzation Feb 25, 2026
Copy link
Copy Markdown

@PatrickCmd PatrickCmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jqug
Copy link
Copy Markdown

jqug commented Feb 26, 2026

Thanks for this! LGTM

For future, we might want to experiment with different settings to see how it affects latency and throughput. Since latency is particularly important, we could check if increasing max_num_seqs makes the TTFT slower. I noticed this from a vLLM optimisation guide:

image

@huwenjie333 huwenjie333 merged commit 22e48e8 into deploy Feb 27, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants