Skip to content

Spark-tts-salt deployemnt#2

Merged
PatrickCmd merged 8 commits intodeployfrom
spark-tts-salt
Jan 7, 2026
Merged

Spark-tts-salt deployemnt#2
PatrickCmd merged 8 commits intodeployfrom
spark-tts-salt

Conversation

@huwenjie333
Copy link
Copy Markdown
Collaborator

@huwenjie333 huwenjie333 commented Dec 17, 2025

This PR deploys the Sunbird/spark-tts-salt model to the Modal platform.

Usage Example:

 curl -X POST --get "https://sb-modal-ws--spark-tts-salt-sparktts-generate.modal.run" \
   --data-urlencode "text=I am a nurse who takes care of many people who have cancer." \
   --data-urlencode "speaker_id=248" \
   --output output.wav

Speaker ID:
241: Acholi (female)
242: Ateso (female)
243: Runyankore (female)
245: Lugbara (female)
246: Swahili (male)
248: Luganda (female)

This PR also inlcudes other codes for testing the latency of Sunflower Ultravox model.

@huwenjie333 huwenjie333 changed the title [WIP] Spark-tts-salt deployemnt Spark-tts-salt deployemnt Jan 5, 2026
@huwenjie333 huwenjie333 requested review from PatrickCmd and jqug January 5, 2026 11:11
@jqug
Copy link
Copy Markdown

jqug commented Jan 5, 2026

Thanks @huwenjie333 this is great. You mention that generating a sentence takes 3-4 seconds, is it possible to find out which part of the processing is taking this time? It should be a few hundred milliseconds, max 1 second I think. At least that's what I was getting when testing on an RTX4090 GPU.

@huwenjie333
Copy link
Copy Markdown
Collaborator Author

huwenjie333 commented Jan 5, 2026

Thanks @huwenjie333 this is great. You mention that generating a sentence takes 3-4 seconds, is it possible to find out which part of the processing is taking this time? It should be a few hundred milliseconds, max 1 second I think. At least that's what I was getting when testing on an RTX4090 GPU.

I can get ~1s when enforce_eager=False and cuda graph is built during the model initialization. However, it increases the cold startup time from 1 min to 1.5-2 mins.

Text chunking time: 0.00s 
Prompt preparation time: 0.00s
Model generation time: 1.05s
Audio decoding time: 0.04s 
Audio saving time: 0.00s 
Total generation time: 1.10s

When enforce_eager=True, the inference becomes ~3s for one sentence.

Text chunking time: 0.00s 
Prompt preparation time: 0.00s
Model generation time: 3.13s
Audio decoding time: 0.06s 
Audio saving time: 0.00s 
Total generation time: 3.19s

@jqug
Copy link
Copy Markdown

jqug commented Jan 5, 2026

Ah OK that makes sense. I think we would want to use enforce_eager=False in practice, as otherwise the latency is too high to make it feel like a natural response. There could be some tricks like to already warm up the endpoint as soon as someone loads the frontend (i.e. it could be starting up in parallel with the LLM).

@jqug
Copy link
Copy Markdown

jqug commented Jan 5, 2026

Does Modal have any persistent storage that could be used to cache the compiled model? Much of that increased cold-start time is model setup, which perhaps is possible to avoid doing every time... (Pointer here on vllm caching)

@PatrickCmd PatrickCmd merged commit dc9ce99 into deploy Jan 7, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants