Skip to content

feat: add vLLM (OpenAI-compatible) provider for local model deployment#28

Open
Sevenal wants to merge 1 commit intopaoloanzn:mainfrom
Sevenal:feat/vllm-openai-compat-provider
Open

feat: add vLLM (OpenAI-compatible) provider for local model deployment#28
Sevenal wants to merge 1 commit intopaoloanzn:mainfrom
Sevenal:feat/vllm-openai-compat-provider

Conversation

@Sevenal
Copy link
Copy Markdown

@Sevenal Sevenal commented Apr 10, 2026

Add support for local models via vLLM or any OpenAI-compatible API. Users can now point Claude Code at a self-hosted vLLM server instead of the Anthropic API, with full support for streaming responses and tool_use (function calling).

Changes:

  • Add vllm-fetch-adapter.ts: fetch interceptor that translates between Anthropic Messages API and OpenAI Chat Completions API format
  • Add 'vllm' to APIProvider type and detect via CLAUDE_CODE_USE_VLLM
  • Add vLLM provider branch in client.ts (after Codex provider)
  • Add getVLLMApiKey() and isVLLMSubscriber() auth utilities
  • Register vLLM env vars in managedEnvConstants.ts

Environment variables:
CLAUDE_CODE_USE_VLLM=1 Enable vLLM provider
VLLM_API_KEY API key (or OPENAI_API_KEY)
VLLM_BASE_URL vLLM server URL (default http://localhost:8000)
ANTHROPIC_MODEL Model name passed to vLLM directly

Summary by CodeRabbit

  • New Features
    • Added support for vLLM (OpenAI-compatible) API provider, enabling configuration and use of vLLM-compatible models.
    • Implemented new authentication and provider detection utilities for vLLM integration.
    • Added environment variable support for vLLM configuration and API key management.

Add support for local models via vLLM or any OpenAI-compatible API.
Users can now point Claude Code at a self-hosted vLLM server instead of
the Anthropic API, with full support for streaming responses and
tool_use (function calling).

Changes:
- Add vllm-fetch-adapter.ts: fetch interceptor that translates between
  Anthropic Messages API and OpenAI Chat Completions API format
- Add 'vllm' to APIProvider type and detect via CLAUDE_CODE_USE_VLLM
- Add vLLM provider branch in client.ts (after Codex provider)
- Add getVLLMApiKey() and isVLLMSubscriber() auth utilities
- Register vLLM env vars in managedEnvConstants.ts

Environment variables:
  CLAUDE_CODE_USE_VLLM=1       Enable vLLM provider
  VLLM_API_KEY                 API key (or OPENAI_API_KEY)
  VLLM_BASE_URL                vLLM server URL (default http://localhost:8000)
  ANTHROPIC_MODEL              Model name passed to vLLM directly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

This change adds vLLM (OpenAI-compatible) as a new API provider option. It extends the provider detection system, introduces authentication helpers, registers environment variables for provider routing, and implements a comprehensive fetch adapter that translates Anthropic SDK request/response formats to vLLM/OpenAI formats, including request transformation, streaming response parsing, and error handling.

Changes

Cohort / File(s) Summary
Provider Detection & Type System
src/utils/model/providers.ts
Extended APIProvider union type to include 'vllm'; updated getAPIProvider() to check CLAUDE_CODE_USE_VLLM env var and return 'vllm' when enabled.
Authentication & Environment Variables
src/utils/auth.ts, src/utils/managedEnvConstants.ts
Added getVLLMApiKey() and isVLLMSubscriber() helpers; registered CLAUDE_CODE_USE_VLLM, VLLM_BASE_URL, and VLLM_API_KEY in provider-managed and safe env var lists.
API Client Integration
src/services/api/client.ts
Added early-return branch in getAnthropicClient() to initialize Anthropic client with vLLM when CLAUDE_CODE_USE_VLLM is enabled, using fetched adapter and env var configuration.
vLLM Fetch Adapter
src/services/api/vllm-fetch-adapter.ts
New 702-line adapter implementing request/response translation: converts Anthropic /v1/messages calls to vLLM /v1/chat/completions; transforms tool schemas and message payloads; parses streaming responses and emits Anthropic SSE events; handles errors and non-OK HTTP responses.

Sequence Diagram

sequenceDiagram
    participant Client as Client Application
    participant SDK as Anthropic SDK
    participant Adapter as vLLM Fetch Adapter
    participant vLLM as vLLM Server

    Client->>SDK: Call Anthropic API with tools
    SDK->>Adapter: POST /v1/messages (Anthropic format)
    Adapter->>Adapter: Transform tools (Anthropic → OpenAI)
    Adapter->>Adapter: Transform messages & system prompt
    Adapter->>vLLM: POST /v1/chat/completions (OpenAI format)
    vLLM-->>Adapter: Stream SSE data
    Adapter->>Adapter: Parse OpenAI streaming response
    Adapter->>Adapter: Transform to Anthropic SSE events
    Adapter-->>SDK: Stream Anthropic SSE format
    SDK-->>Client: Return parsed response with tool calls
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A vLLM tunnel, oh what a sight!
With adapters that dance through the digital night,
Anthropic conversations to OpenAI format we bend,
Request and response, transform end to end,
New providers converge in this code-crafted blend! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding vLLM support as an OpenAI-compatible provider for local model deployment, which matches the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e8b287e134

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

? 'openai'
: 'firstParty'
: isEnvTruthy(process.env.CLAUDE_CODE_USE_VLLM)
? 'vllm'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add vLLM model mappings before returning provider

Returning 'vllm' here enables a provider value that the model-string/config pipeline does not define (ALL_MODEL_CONFIGS entries are keyed through existing providers only). getBuiltinModelStrings(getAPIProvider()) will therefore produce undefined model IDs for vLLM, and default model resolution can flow into parseUserSpecifiedModel with an undefined value, crashing on .trim() when CLAUDE_CODE_USE_VLLM=1 is set without an explicit model override.

Useful? React with 👍 / 👎.

Comment on lines +674 to +682
const vllmResponse = await globalThis.fetch(chatCompletionsUrl, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Accept: 'text/event-stream',
Authorization: `Bearer ${apiKey || 'sk-placeholder'}`,
},
body: JSON.stringify(vllmBody),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward abort signal when proxying to vLLM

The adapter creates a new fetch request but drops init?.signal (and related request options) from the original Anthropic SDK call. In this codebase, streaming requests are issued with abort signals for user cancel/watchdog timeouts, so dropping the signal means aborted CLI requests can continue running against vLLM until completion, causing hung cancellation behavior and unnecessary backend load.

Useful? React with 👍 / 👎.

Comment on lines +652 to +653
if (!url.includes('/v1/messages')) {
return globalThis.fetch(input, init)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit interception to messages create endpoint

Using url.includes('/v1/messages') also catches /v1/messages/count_tokens, so token-count requests are incorrectly rerouted to chat completions and translated as SSE. The SDK expects a JSON token-count response for countTokens, so this breaks API-based counting under vLLM and forces degraded fallback estimation paths.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/services/api/client.ts (1)

139-145: ⚠️ Potential issue | 🟠 Major

Short-circuit Anthropic auth before the vLLM branch.

Every vLLM client creation still runs checkAndRefreshOAuthTokenIfNeeded() and may execute apiKeyHelper first, even though this path never talks to Anthropic. That adds unrelated login latency/failures to local-model traffic and widens the command-execution surface for the new provider.

Also applies to: 323-337

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/api/client.ts` around lines 139 - 145, Only run
Anthropic-specific auth/token work for Anthropic consumers: move the call to
checkAndRefreshOAuthTokenIfNeeded() (and the
configureApiKeyHeaders(defaultHeaders, getIsNonInteractiveSession()) call)
inside the branch that is creating/using the Anthropic client (i.e., only when
isClaudeAISubscriber() or the code path that instantiates the Anthropic/via-API
provider is true) so vLLM/local-model branches never trigger OAuth or
apiKeyHelper work; apply the same change to the duplicate token-check block
elsewhere in this file (the other block that currently invokes
checkAndRefreshOAuthTokenIfNeeded()/configureApiKeyHeaders()).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/services/api/vllm-fetch-adapter.ts`:
- Around line 475-478: The upstream finish reason (choices[0].finish_reason) is
captured into stopReason but then discarded because finishStream() always emits
only "tool_use" or "end_turn"; update the code paths where stopReason is passed
to finishStream() (and the other similar block around the 587-619 region) to
forward the actual finishReason string (or map it to a meaningful enum) so
callers can see "length", "stop_sequence", etc., instead of always receiving
"tool_use"/"end_turn"; ensure references to choices[0].finish_reason,
stopReason, and finishStream() are updated and that any consumers still handle
the existing values or new mapped ones.
- Around line 674-682: When proxying the upstream request for vLLM (the
globalThis.fetch call that produces vllmResponse), preserve the original
request's cancellation and transport options instead of rebuilding options from
scratch: merge the original init (or SDK fetch options) into the fetch call so
init.signal and any other transport-related fields are forwarded, merge/override
headers to ensure Content-Type/Accept/Authorization are set while preserving
other headers, and pass the body (vllmBody) and method through; locate the fetch
invocation that uses chatCompletionsUrl, vllmBody, and Authorization and replace
the hard-coded options with a merged Request/option object that includes
init.signal and other init fields so cancelled streams and SDK transport
configuration are preserved.
- Around line 210-218: The vllmBody currently only includes model, messages,
stream and tools, which drops Anthropic generation controls; update the code
that builds vllmBody (the block that sets vllmBody, using claudeModel, messages,
anthropicTools and translateTools) to forward Anthropic parameters when
present—copy max_tokens, temperature, stop_sequences, and tool_choice (or map to
the vLLM equivalents) into vllmBody, ensuring proper key names and types and
only adding them when defined so existing behavior isn’t broken.

In `@src/utils/auth.ts`:
- Around line 1640-1645: The new getVLLMApiKey()/isVLLMSubscriber() functions
are not wired into the provider-classification helpers, causing functions like
isAnthropicAuthEnabled, is1PApiCustomer, and isUsing3PServices to misreport for
vllm; update each of those functions to treat provider 'vllm' as appropriate
(e.g., include isVLLMSubscriber() or getAPIProvider() === 'vllm' in their
truthiness checks) so vllm sessions are classified consistently alongside other
3P/1P providers.

In `@src/utils/managedEnvConstants.ts`:
- Around line 155-156: Remove VLLM_BASE_URL from the SAFE_ENV_VARS whitelist so
it is no longer applied without a trust prompt; update the SAFE_ENV_VARS array
(the list that currently contains 'VLLM_API_KEY' and 'VLLM_BASE_URL') to only
include 'VLLM_API_KEY' and mirror the handling for ANTHROPIC_BASE_URL by
treating VLLM_BASE_URL as sensitive (i.e., not trusted by default) and add a
short comment explaining why VLLM_BASE_URL must remain excluded.

---

Outside diff comments:
In `@src/services/api/client.ts`:
- Around line 139-145: Only run Anthropic-specific auth/token work for Anthropic
consumers: move the call to checkAndRefreshOAuthTokenIfNeeded() (and the
configureApiKeyHeaders(defaultHeaders, getIsNonInteractiveSession()) call)
inside the branch that is creating/using the Anthropic client (i.e., only when
isClaudeAISubscriber() or the code path that instantiates the Anthropic/via-API
provider is true) so vLLM/local-model branches never trigger OAuth or
apiKeyHelper work; apply the same change to the duplicate token-check block
elsewhere in this file (the other block that currently invokes
checkAndRefreshOAuthTokenIfNeeded()/configureApiKeyHeaders()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 435178e4-f3f6-46a9-b52d-b43a3a53b6e2

📥 Commits

Reviewing files that changed from the base of the PR and between 7dc15d6 and e8b287e.

📒 Files selected for processing (5)
  • src/services/api/client.ts
  • src/services/api/vllm-fetch-adapter.ts
  • src/utils/auth.ts
  • src/utils/managedEnvConstants.ts
  • src/utils/model/providers.ts

Comment on lines +210 to +218
const vllmBody: Record<string, unknown> = {
model: claudeModel,
messages,
stream: true,
}

if (anthropicTools.length > 0) {
vllmBody.tools = translateTools(anthropicTools)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Forward the Anthropic generation controls.

translateToVLLMBody() only sends model, messages, stream, and tools. Dropping max_tokens means the upstream server falls back to its default length, which can truncate or overrun completions relative to the Anthropic request; temperature, stop_sequences, and tool_choice are lost for the same reason.

Suggested mapping
   const vllmBody: Record<string, unknown> = {
     model: claudeModel,
     messages,
     stream: true,
+    ...(typeof anthropicBody.max_tokens === 'number'
+      ? { max_tokens: anthropicBody.max_tokens }
+      : {}),
+    ...(typeof anthropicBody.temperature === 'number'
+      ? { temperature: anthropicBody.temperature }
+      : {}),
+    ...(Array.isArray(anthropicBody.stop_sequences)
+      ? { stop: anthropicBody.stop_sequences }
+      : {}),
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const vllmBody: Record<string, unknown> = {
model: claudeModel,
messages,
stream: true,
}
if (anthropicTools.length > 0) {
vllmBody.tools = translateTools(anthropicTools)
}
const vllmBody: Record<string, unknown> = {
model: claudeModel,
messages,
stream: true,
...(typeof anthropicBody.max_tokens === 'number'
? { max_tokens: anthropicBody.max_tokens }
: {}),
...(typeof anthropicBody.temperature === 'number'
? { temperature: anthropicBody.temperature }
: {}),
...(Array.isArray(anthropicBody.stop_sequences)
? { stop: anthropicBody.stop_sequences }
: {}),
}
if (anthropicTools.length > 0) {
vllmBody.tools = translateTools(anthropicTools)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/api/vllm-fetch-adapter.ts` around lines 210 - 218, The vllmBody
currently only includes model, messages, stream and tools, which drops Anthropic
generation controls; update the code that builds vllmBody (the block that sets
vllmBody, using claudeModel, messages, anthropicTools and translateTools) to
forward Anthropic parameters when present—copy max_tokens, temperature,
stop_sequences, and tool_choice (or map to the vLLM equivalents) into vllmBody,
ensuring proper key names and types and only adding them when defined so
existing behavior isn’t broken.

Comment on lines +475 to +478
// ── Finish reason ───────────────────────────────────
if (choices[0].finish_reason) {
stopReason = choices[0].finish_reason as string
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't discard the upstream finish reason.

You capture choices[0].finish_reason into stopReason, but finishStream() ignores it and always emits tool_use or end_turn. That turns length and stop-sequence terminations into normal turns, so callers can't distinguish a truncated response from a clean stop.

Also applies to: 587-619

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/api/vllm-fetch-adapter.ts` around lines 475 - 478, The upstream
finish reason (choices[0].finish_reason) is captured into stopReason but then
discarded because finishStream() always emits only "tool_use" or "end_turn";
update the code paths where stopReason is passed to finishStream() (and the
other similar block around the 587-619 region) to forward the actual
finishReason string (or map it to a meaningful enum) so callers can see
"length", "stop_sequence", etc., instead of always receiving
"tool_use"/"end_turn"; ensure references to choices[0].finish_reason,
stopReason, and finishStream() are updated and that any consumers still handle
the existing values or new mapped ones.

Comment on lines +674 to +682
const vllmResponse = await globalThis.fetch(chatCompletionsUrl, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Accept: 'text/event-stream',
Authorization: `Bearer ${apiKey || 'sk-placeholder'}`,
},
body: JSON.stringify(vllmBody),
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve cancellation and transport options when proxying the fetch.

The upstream request is rebuilt from scratch here, so init.signal and any SDK-supplied transport options are dropped. Cancelled streams will keep running on the vLLM server, and proxy/agent configuration from the Anthropic client stops applying to the forwarded call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/services/api/vllm-fetch-adapter.ts` around lines 674 - 682, When proxying
the upstream request for vLLM (the globalThis.fetch call that produces
vllmResponse), preserve the original request's cancellation and transport
options instead of rebuilding options from scratch: merge the original init (or
SDK fetch options) into the fetch call so init.signal and any other
transport-related fields are forwarded, merge/override headers to ensure
Content-Type/Accept/Authorization are set while preserving other headers, and
pass the body (vllmBody) and method through; locate the fetch invocation that
uses chatCompletionsUrl, vllmBody, and Authorization and replace the hard-coded
options with a merged Request/option object that includes init.signal and other
init fields so cancelled streams and SDK transport configuration are preserved.

Comment on lines +1640 to +1645
export function getVLLMApiKey(): string | undefined {
return process.env.VLLM_API_KEY || process.env.OPENAI_API_KEY
}

export function isVLLMSubscriber(): boolean {
return getAPIProvider() === 'vllm' && !!getVLLMApiKey()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wire vllm into the existing provider classification.

isVLLMSubscriber() is added here, but the rest of this file still hard-codes only Bedrock/Vertex/Foundry as 3P. Today isAnthropicAuthEnabled() (Line 116), is1PApiCustomer() (Line 1662), and isUsing3PServices() (Line 1808) can all report the wrong state for a vLLM session.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/auth.ts` around lines 1640 - 1645, The new
getVLLMApiKey()/isVLLMSubscriber() functions are not wired into the
provider-classification helpers, causing functions like isAnthropicAuthEnabled,
is1PApiCustomer, and isUsing3PServices to misreport for vllm; update each of
those functions to treat provider 'vllm' as appropriate (e.g., include
isVLLMSubscriber() or getAPIProvider() === 'vllm' in their truthiness checks) so
vllm sessions are classified consistently alongside other 3P/1P providers.

Comment on lines +155 to +156
'VLLM_API_KEY',
'VLLM_BASE_URL',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't whitelist VLLM_BASE_URL as a safe env var.

SAFE_ENV_VARS are applied without a trust prompt, and VLLM_BASE_URL chooses the server that receives prompts and tool traffic. That gives managed/project settings the same redirect capability the file already treats as dangerous for ANTHROPIC_BASE_URL.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/managedEnvConstants.ts` around lines 155 - 156, Remove
VLLM_BASE_URL from the SAFE_ENV_VARS whitelist so it is no longer applied
without a trust prompt; update the SAFE_ENV_VARS array (the list that currently
contains 'VLLM_API_KEY' and 'VLLM_BASE_URL') to only include 'VLLM_API_KEY' and
mirror the handling for ANTHROPIC_BASE_URL by treating VLLM_BASE_URL as
sensitive (i.e., not trusted by default) and add a short comment explaining why
VLLM_BASE_URL must remain excluded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant