Skip to content

Add vLLM cache to TTS model#3

Closed
huwenjie333 wants to merge 28 commits intomainfrom
deploy
Closed

Add vLLM cache to TTS model#3
huwenjie333 wants to merge 28 commits intomainfrom
deploy

Conversation

@huwenjie333
Copy link
Copy Markdown
Collaborator

This PR adds the vLLM cache to the persist storage in the deployment at /root/.cache/vllm. It reduces cold start time from 1.5 mins to 1 min.

Screenshot 2026-01-09 at 2 16 52 PM

PatrickCmd and others added 28 commits November 14, 2025 12:47
…npod-workers#236)

remove unsupported cuda versions (12.1-12.3) from hub.json and tests.json
to fix compatibility issues with worker deployment

- hub.json: remove 12.1, 12.2, 12.3 from allowedCudaVersions
- tests.json: remove 12.1, 12.2, 12.3, 12.4 from allowedCudaVersions

refs: AE-1452
)

* enable expert parallel arg for moe models

* add ENABLE_EXPERT_PARALLEL to hub config
Sunflower-Ultravox Deployment with Modal platform
@huwenjie333 huwenjie333 closed this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants