feat: add vLLM (OpenAI-compatible) provider for local model deployment#28
feat: add vLLM (OpenAI-compatible) provider for local model deployment#28Sevenal wants to merge 1 commit intopaoloanzn:mainfrom
Conversation
Add support for local models via vLLM or any OpenAI-compatible API. Users can now point Claude Code at a self-hosted vLLM server instead of the Anthropic API, with full support for streaming responses and tool_use (function calling). Changes: - Add vllm-fetch-adapter.ts: fetch interceptor that translates between Anthropic Messages API and OpenAI Chat Completions API format - Add 'vllm' to APIProvider type and detect via CLAUDE_CODE_USE_VLLM - Add vLLM provider branch in client.ts (after Codex provider) - Add getVLLMApiKey() and isVLLMSubscriber() auth utilities - Register vLLM env vars in managedEnvConstants.ts Environment variables: CLAUDE_CODE_USE_VLLM=1 Enable vLLM provider VLLM_API_KEY API key (or OPENAI_API_KEY) VLLM_BASE_URL vLLM server URL (default http://localhost:8000) ANTHROPIC_MODEL Model name passed to vLLM directly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis change adds vLLM (OpenAI-compatible) as a new API provider option. It extends the provider detection system, introduces authentication helpers, registers environment variables for provider routing, and implements a comprehensive fetch adapter that translates Anthropic SDK request/response formats to vLLM/OpenAI formats, including request transformation, streaming response parsing, and error handling. Changes
Sequence DiagramsequenceDiagram
participant Client as Client Application
participant SDK as Anthropic SDK
participant Adapter as vLLM Fetch Adapter
participant vLLM as vLLM Server
Client->>SDK: Call Anthropic API with tools
SDK->>Adapter: POST /v1/messages (Anthropic format)
Adapter->>Adapter: Transform tools (Anthropic → OpenAI)
Adapter->>Adapter: Transform messages & system prompt
Adapter->>vLLM: POST /v1/chat/completions (OpenAI format)
vLLM-->>Adapter: Stream SSE data
Adapter->>Adapter: Parse OpenAI streaming response
Adapter->>Adapter: Transform to Anthropic SSE events
Adapter-->>SDK: Stream Anthropic SSE format
SDK-->>Client: Return parsed response with tool calls
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e8b287e134
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ? 'openai' | ||
| : 'firstParty' | ||
| : isEnvTruthy(process.env.CLAUDE_CODE_USE_VLLM) | ||
| ? 'vllm' |
There was a problem hiding this comment.
Add vLLM model mappings before returning provider
Returning 'vllm' here enables a provider value that the model-string/config pipeline does not define (ALL_MODEL_CONFIGS entries are keyed through existing providers only). getBuiltinModelStrings(getAPIProvider()) will therefore produce undefined model IDs for vLLM, and default model resolution can flow into parseUserSpecifiedModel with an undefined value, crashing on .trim() when CLAUDE_CODE_USE_VLLM=1 is set without an explicit model override.
Useful? React with 👍 / 👎.
| const vllmResponse = await globalThis.fetch(chatCompletionsUrl, { | ||
| method: 'POST', | ||
| headers: { | ||
| 'Content-Type': 'application/json', | ||
| Accept: 'text/event-stream', | ||
| Authorization: `Bearer ${apiKey || 'sk-placeholder'}`, | ||
| }, | ||
| body: JSON.stringify(vllmBody), | ||
| }) |
There was a problem hiding this comment.
Forward abort signal when proxying to vLLM
The adapter creates a new fetch request but drops init?.signal (and related request options) from the original Anthropic SDK call. In this codebase, streaming requests are issued with abort signals for user cancel/watchdog timeouts, so dropping the signal means aborted CLI requests can continue running against vLLM until completion, causing hung cancellation behavior and unnecessary backend load.
Useful? React with 👍 / 👎.
| if (!url.includes('/v1/messages')) { | ||
| return globalThis.fetch(input, init) |
There was a problem hiding this comment.
Limit interception to messages create endpoint
Using url.includes('/v1/messages') also catches /v1/messages/count_tokens, so token-count requests are incorrectly rerouted to chat completions and translated as SSE. The SDK expects a JSON token-count response for countTokens, so this breaks API-based counting under vLLM and forces degraded fallback estimation paths.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/services/api/client.ts (1)
139-145:⚠️ Potential issue | 🟠 MajorShort-circuit Anthropic auth before the vLLM branch.
Every vLLM client creation still runs
checkAndRefreshOAuthTokenIfNeeded()and may executeapiKeyHelperfirst, even though this path never talks to Anthropic. That adds unrelated login latency/failures to local-model traffic and widens the command-execution surface for the new provider.Also applies to: 323-337
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/services/api/client.ts` around lines 139 - 145, Only run Anthropic-specific auth/token work for Anthropic consumers: move the call to checkAndRefreshOAuthTokenIfNeeded() (and the configureApiKeyHeaders(defaultHeaders, getIsNonInteractiveSession()) call) inside the branch that is creating/using the Anthropic client (i.e., only when isClaudeAISubscriber() or the code path that instantiates the Anthropic/via-API provider is true) so vLLM/local-model branches never trigger OAuth or apiKeyHelper work; apply the same change to the duplicate token-check block elsewhere in this file (the other block that currently invokes checkAndRefreshOAuthTokenIfNeeded()/configureApiKeyHeaders()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/services/api/vllm-fetch-adapter.ts`:
- Around line 475-478: The upstream finish reason (choices[0].finish_reason) is
captured into stopReason but then discarded because finishStream() always emits
only "tool_use" or "end_turn"; update the code paths where stopReason is passed
to finishStream() (and the other similar block around the 587-619 region) to
forward the actual finishReason string (or map it to a meaningful enum) so
callers can see "length", "stop_sequence", etc., instead of always receiving
"tool_use"/"end_turn"; ensure references to choices[0].finish_reason,
stopReason, and finishStream() are updated and that any consumers still handle
the existing values or new mapped ones.
- Around line 674-682: When proxying the upstream request for vLLM (the
globalThis.fetch call that produces vllmResponse), preserve the original
request's cancellation and transport options instead of rebuilding options from
scratch: merge the original init (or SDK fetch options) into the fetch call so
init.signal and any other transport-related fields are forwarded, merge/override
headers to ensure Content-Type/Accept/Authorization are set while preserving
other headers, and pass the body (vllmBody) and method through; locate the fetch
invocation that uses chatCompletionsUrl, vllmBody, and Authorization and replace
the hard-coded options with a merged Request/option object that includes
init.signal and other init fields so cancelled streams and SDK transport
configuration are preserved.
- Around line 210-218: The vllmBody currently only includes model, messages,
stream and tools, which drops Anthropic generation controls; update the code
that builds vllmBody (the block that sets vllmBody, using claudeModel, messages,
anthropicTools and translateTools) to forward Anthropic parameters when
present—copy max_tokens, temperature, stop_sequences, and tool_choice (or map to
the vLLM equivalents) into vllmBody, ensuring proper key names and types and
only adding them when defined so existing behavior isn’t broken.
In `@src/utils/auth.ts`:
- Around line 1640-1645: The new getVLLMApiKey()/isVLLMSubscriber() functions
are not wired into the provider-classification helpers, causing functions like
isAnthropicAuthEnabled, is1PApiCustomer, and isUsing3PServices to misreport for
vllm; update each of those functions to treat provider 'vllm' as appropriate
(e.g., include isVLLMSubscriber() or getAPIProvider() === 'vllm' in their
truthiness checks) so vllm sessions are classified consistently alongside other
3P/1P providers.
In `@src/utils/managedEnvConstants.ts`:
- Around line 155-156: Remove VLLM_BASE_URL from the SAFE_ENV_VARS whitelist so
it is no longer applied without a trust prompt; update the SAFE_ENV_VARS array
(the list that currently contains 'VLLM_API_KEY' and 'VLLM_BASE_URL') to only
include 'VLLM_API_KEY' and mirror the handling for ANTHROPIC_BASE_URL by
treating VLLM_BASE_URL as sensitive (i.e., not trusted by default) and add a
short comment explaining why VLLM_BASE_URL must remain excluded.
---
Outside diff comments:
In `@src/services/api/client.ts`:
- Around line 139-145: Only run Anthropic-specific auth/token work for Anthropic
consumers: move the call to checkAndRefreshOAuthTokenIfNeeded() (and the
configureApiKeyHeaders(defaultHeaders, getIsNonInteractiveSession()) call)
inside the branch that is creating/using the Anthropic client (i.e., only when
isClaudeAISubscriber() or the code path that instantiates the Anthropic/via-API
provider is true) so vLLM/local-model branches never trigger OAuth or
apiKeyHelper work; apply the same change to the duplicate token-check block
elsewhere in this file (the other block that currently invokes
checkAndRefreshOAuthTokenIfNeeded()/configureApiKeyHeaders()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 435178e4-f3f6-46a9-b52d-b43a3a53b6e2
📒 Files selected for processing (5)
src/services/api/client.tssrc/services/api/vllm-fetch-adapter.tssrc/utils/auth.tssrc/utils/managedEnvConstants.tssrc/utils/model/providers.ts
| const vllmBody: Record<string, unknown> = { | ||
| model: claudeModel, | ||
| messages, | ||
| stream: true, | ||
| } | ||
|
|
||
| if (anthropicTools.length > 0) { | ||
| vllmBody.tools = translateTools(anthropicTools) | ||
| } |
There was a problem hiding this comment.
Forward the Anthropic generation controls.
translateToVLLMBody() only sends model, messages, stream, and tools. Dropping max_tokens means the upstream server falls back to its default length, which can truncate or overrun completions relative to the Anthropic request; temperature, stop_sequences, and tool_choice are lost for the same reason.
Suggested mapping
const vllmBody: Record<string, unknown> = {
model: claudeModel,
messages,
stream: true,
+ ...(typeof anthropicBody.max_tokens === 'number'
+ ? { max_tokens: anthropicBody.max_tokens }
+ : {}),
+ ...(typeof anthropicBody.temperature === 'number'
+ ? { temperature: anthropicBody.temperature }
+ : {}),
+ ...(Array.isArray(anthropicBody.stop_sequences)
+ ? { stop: anthropicBody.stop_sequences }
+ : {}),
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const vllmBody: Record<string, unknown> = { | |
| model: claudeModel, | |
| messages, | |
| stream: true, | |
| } | |
| if (anthropicTools.length > 0) { | |
| vllmBody.tools = translateTools(anthropicTools) | |
| } | |
| const vllmBody: Record<string, unknown> = { | |
| model: claudeModel, | |
| messages, | |
| stream: true, | |
| ...(typeof anthropicBody.max_tokens === 'number' | |
| ? { max_tokens: anthropicBody.max_tokens } | |
| : {}), | |
| ...(typeof anthropicBody.temperature === 'number' | |
| ? { temperature: anthropicBody.temperature } | |
| : {}), | |
| ...(Array.isArray(anthropicBody.stop_sequences) | |
| ? { stop: anthropicBody.stop_sequences } | |
| : {}), | |
| } | |
| if (anthropicTools.length > 0) { | |
| vllmBody.tools = translateTools(anthropicTools) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/api/vllm-fetch-adapter.ts` around lines 210 - 218, The vllmBody
currently only includes model, messages, stream and tools, which drops Anthropic
generation controls; update the code that builds vllmBody (the block that sets
vllmBody, using claudeModel, messages, anthropicTools and translateTools) to
forward Anthropic parameters when present—copy max_tokens, temperature,
stop_sequences, and tool_choice (or map to the vLLM equivalents) into vllmBody,
ensuring proper key names and types and only adding them when defined so
existing behavior isn’t broken.
| // ── Finish reason ─────────────────────────────────── | ||
| if (choices[0].finish_reason) { | ||
| stopReason = choices[0].finish_reason as string | ||
| } |
There was a problem hiding this comment.
Don't discard the upstream finish reason.
You capture choices[0].finish_reason into stopReason, but finishStream() ignores it and always emits tool_use or end_turn. That turns length and stop-sequence terminations into normal turns, so callers can't distinguish a truncated response from a clean stop.
Also applies to: 587-619
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/api/vllm-fetch-adapter.ts` around lines 475 - 478, The upstream
finish reason (choices[0].finish_reason) is captured into stopReason but then
discarded because finishStream() always emits only "tool_use" or "end_turn";
update the code paths where stopReason is passed to finishStream() (and the
other similar block around the 587-619 region) to forward the actual
finishReason string (or map it to a meaningful enum) so callers can see
"length", "stop_sequence", etc., instead of always receiving
"tool_use"/"end_turn"; ensure references to choices[0].finish_reason,
stopReason, and finishStream() are updated and that any consumers still handle
the existing values or new mapped ones.
| const vllmResponse = await globalThis.fetch(chatCompletionsUrl, { | ||
| method: 'POST', | ||
| headers: { | ||
| 'Content-Type': 'application/json', | ||
| Accept: 'text/event-stream', | ||
| Authorization: `Bearer ${apiKey || 'sk-placeholder'}`, | ||
| }, | ||
| body: JSON.stringify(vllmBody), | ||
| }) |
There was a problem hiding this comment.
Preserve cancellation and transport options when proxying the fetch.
The upstream request is rebuilt from scratch here, so init.signal and any SDK-supplied transport options are dropped. Cancelled streams will keep running on the vLLM server, and proxy/agent configuration from the Anthropic client stops applying to the forwarded call.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/services/api/vllm-fetch-adapter.ts` around lines 674 - 682, When proxying
the upstream request for vLLM (the globalThis.fetch call that produces
vllmResponse), preserve the original request's cancellation and transport
options instead of rebuilding options from scratch: merge the original init (or
SDK fetch options) into the fetch call so init.signal and any other
transport-related fields are forwarded, merge/override headers to ensure
Content-Type/Accept/Authorization are set while preserving other headers, and
pass the body (vllmBody) and method through; locate the fetch invocation that
uses chatCompletionsUrl, vllmBody, and Authorization and replace the hard-coded
options with a merged Request/option object that includes init.signal and other
init fields so cancelled streams and SDK transport configuration are preserved.
| export function getVLLMApiKey(): string | undefined { | ||
| return process.env.VLLM_API_KEY || process.env.OPENAI_API_KEY | ||
| } | ||
|
|
||
| export function isVLLMSubscriber(): boolean { | ||
| return getAPIProvider() === 'vllm' && !!getVLLMApiKey() |
There was a problem hiding this comment.
Wire vllm into the existing provider classification.
isVLLMSubscriber() is added here, but the rest of this file still hard-codes only Bedrock/Vertex/Foundry as 3P. Today isAnthropicAuthEnabled() (Line 116), is1PApiCustomer() (Line 1662), and isUsing3PServices() (Line 1808) can all report the wrong state for a vLLM session.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/auth.ts` around lines 1640 - 1645, The new
getVLLMApiKey()/isVLLMSubscriber() functions are not wired into the
provider-classification helpers, causing functions like isAnthropicAuthEnabled,
is1PApiCustomer, and isUsing3PServices to misreport for vllm; update each of
those functions to treat provider 'vllm' as appropriate (e.g., include
isVLLMSubscriber() or getAPIProvider() === 'vllm' in their truthiness checks) so
vllm sessions are classified consistently alongside other 3P/1P providers.
| 'VLLM_API_KEY', | ||
| 'VLLM_BASE_URL', |
There was a problem hiding this comment.
Don't whitelist VLLM_BASE_URL as a safe env var.
SAFE_ENV_VARS are applied without a trust prompt, and VLLM_BASE_URL chooses the server that receives prompts and tool traffic. That gives managed/project settings the same redirect capability the file already treats as dangerous for ANTHROPIC_BASE_URL.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/managedEnvConstants.ts` around lines 155 - 156, Remove
VLLM_BASE_URL from the SAFE_ENV_VARS whitelist so it is no longer applied
without a trust prompt; update the SAFE_ENV_VARS array (the list that currently
contains 'VLLM_API_KEY' and 'VLLM_BASE_URL') to only include 'VLLM_API_KEY' and
mirror the handling for ANTHROPIC_BASE_URL by treating VLLM_BASE_URL as
sensitive (i.e., not trusted by default) and add a short comment explaining why
VLLM_BASE_URL must remain excluded.
Add support for local models via vLLM or any OpenAI-compatible API. Users can now point Claude Code at a self-hosted vLLM server instead of the Anthropic API, with full support for streaming responses and tool_use (function calling).
Changes:
Environment variables:
CLAUDE_CODE_USE_VLLM=1 Enable vLLM provider
VLLM_API_KEY API key (or OPENAI_API_KEY)
VLLM_BASE_URL vLLM server URL (default http://localhost:8000)
ANTHROPIC_MODEL Model name passed to vLLM directly
Summary by CodeRabbit