Fix Spark-TTS model hanging issue and add further optimzation#7
Merged
huwenjie333 merged 2 commits intodeployfrom Feb 27, 2026
Merged
Fix Spark-TTS model hanging issue and add further optimzation#7huwenjie333 merged 2 commits intodeployfrom
huwenjie333 merged 2 commits intodeployfrom
Conversation
|
Thanks for this! LGTM For future, we might want to experiment with different settings to see how it affects latency and throughput. Since latency is particularly important, we could check if increasing
|
jqug
approved these changes
Feb 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This PR fixed the hanging issue of Spark-TTS model on modal platform, by the following fixes:
LLMclass withAsyncLLMEnginefor thread safety. The hanging issue is because the defaultLLMclass from vLLM is not thread-safe. When multiple requests come in, they might cause race condition and deadlock. I replaced it withAsyncLLMEnginewhich ensures the thread safety, which is also the one used byvllm servecli command.@modal.fastapi_endpoint(...)implementation ofgeneratetoasync def generate(...).AsyncLLMEngine.generate, gathering multiple results fully asynchronously usingasyncio.gather.asyncio.to_threadto ensure that standard PyTorch GPU operations do not inadvertently block Modal's Python event loop for prolonged durations.In addition, future optimizations was implemented:
Responseinstead ofStreamingResponseso that Uvicorn's internal thread pool won't be overloaded by chunking audio into small bytes for streaming, which reduced the total latency from 100+ seconds to a few seconds when large amount of requests are received concurrently.A stress test was done by sending 80 requests consecutively with 0.3 seconds interval, where each requests has 3 sentences and ~40 words. All the requests were processed successfully in ~4 seconds using a single GPU container.