Skip to main content

VLLM

Pass-through endpoints for VLLM - call provider-specific endpoint, in native format (no translation).

FeatureSupportedNotes
Cost TrackingNot supported
Loggingworks across all integrations
End-user TrackingTell us if you need this
Streaming

Just replace https://my-vllm-server.com with LITELLM_PROXY_BASE_URL/vllm 🚀

Example Usage

curl -L -X GET 'http://0.0.0.0:4000/vllm/metrics' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \

Supports ALL VLLM Endpoints (including streaming).

Quick Start

Let's call the VLLM /metrics endpoint

  1. Add HOSTED VLLM API BASE to your environment
export HOSTED_VLLM_API_BASE="https://my-vllm-server.com"
  1. Start LiteLLM Proxy
litellm

# RUNNING on http://0.0.0.0:4000
  1. Test it!

Let's call the VLLM /metrics endpoint

curl -L -X GET 'http://0.0.0.0:4000/vllm/metrics' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \

Examples

Anything after http://0.0.0.0:4000/vllm is treated as a provider-specific route, and handled accordingly.

Key Changes:

Original EndpointReplace With
https://my-vllm-server.comhttp://0.0.0.0:4000/vllm (LITELLM_PROXY_BASE_URL="http://0.0.0.0:4000")
bearer $VLLM_API_KEYbearer anything (use bearer LITELLM_VIRTUAL_KEY if Virtual Keys are setup on proxy)

Example 1: Metrics endpoint

LiteLLM Proxy Call

curl -L -X GET 'http://0.0.0.0:4000/vllm/metrics' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $LITELLM_VIRTUAL_KEY' \

Direct VLLM API Call

curl -L -X GET 'https://my-vllm-server.com/metrics' \
-H 'Content-Type: application/json' \

Example 2: Chat API

LiteLLM Proxy Call

curl -L -X POST 'http://0.0.0.0:4000/vllm/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer $LITELLM_VIRTUAL_KEY' \
-d '{
"messages": [
{
"role": "user",
"content": "I am going to Paris, what should I see?"
}
],
"max_tokens": 2048,
"temperature": 0.8,
"top_p": 0.1,
"model": "qwen2.5-7b-instruct",
}'

Direct VLLM API Call

curl -L -X POST 'https://my-vllm-server.com/chat/completions' \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{
"role": "user",
"content": "I am going to Paris, what should I see?"
}
],
"max_tokens": 2048,
"temperature": 0.8,
"top_p": 0.1,
"model": "qwen2.5-7b-instruct",
}'

Advanced - Use with Virtual Keys

Pre-requisites

Use this, to avoid giving developers the raw Cohere API key, but still letting them use Cohere endpoints.

Usage

  1. Setup environment
export DATABASE_URL=""
export LITELLM_MASTER_KEY=""
export HOSTED_VLLM_API_BASE=""
litellm

# RUNNING on http://0.0.0.0:4000
  1. Generate virtual key
curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{}'

Expected Response

{
...
"key": "sk-1234ewknldferwedojwojw"
}
  1. Test it!
curl -L -X POST 'http://0.0.0.0:4000/vllm/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234ewknldferwedojwojw' \
--data '{
"messages": [
{
"role": "user",
"content": "I am going to Paris, what should I see?"
}
],
"max_tokens": 2048,
"temperature": 0.8,
"top_p": 0.1,
"model": "qwen2.5-7b-instruct",
}'