Merge branch 'BerriAI:main' into main

This commit is contained in:
tombii 2025-04-14 10:37:22 +02:00 committed by GitHub
commit f4f2b6b71b
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
1155 changed files with 89209 additions and 20481 deletions

File diff suppressed because it is too large Load diff

View file

@ -1,13 +1,15 @@
# used by CI/CD testing
openai==1.54.0
openai==1.68.2
python-dotenv
tiktoken
importlib_metadata
cohere
redis
redis==5.2.1
redisvl==0.4.1
anthropic
orjson==3.9.15
pydantic==2.7.1
pydantic==2.10.2
google-cloud-aiplatform==1.43.0
fastapi-sso==0.10.0
fastapi-sso==0.16.0
uvloop==0.21.0
mcp==1.5.0 # for MCP server

View file

@ -6,6 +6,16 @@
<!-- e.g. "Fixes #000" -->
## Pre-Submission checklist
**Please complete all items before asking a LiteLLM maintainer to review your PR**
- [ ] I have Added testing in the [`tests/litellm/`](https://github.com/BerriAI/litellm/tree/main/tests/litellm) directory, **Adding at least 1 test is a hard requirement** - [see details](https://docs.litellm.ai/docs/extras/contributing_code)
- [ ] I have added a screenshot of my new test passing locally
- [ ] My PR passes all unit tests on (`make test-unit`)[https://docs.litellm.ai/docs/extras/contributing_code]
- [ ] My PR's scope is as isolated as possible, it only solves 1 specific problem
## Type
<!-- Select the type of Pull Request -->
@ -20,10 +30,4 @@
## Changes
<!-- List of changes -->
## [REQUIRED] Testing - Attach a screenshot of any new tests passing locally
If UI changes, send a screenshot/GIF of working UI fixes
<!-- Test procedure -->

View file

@ -80,7 +80,6 @@ jobs:
permissions:
contents: read
packages: write
#
steps:
- name: Checkout repository
uses: actions/checkout@v4
@ -112,7 +111,11 @@ jobs:
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }}, ${{ steps.meta.outputs.tags }}-${{ github.event.inputs.release_type }} # if a tag is provided, use that, otherwise use the release tag, and if neither is available, use 'latest'
tags: |
${{ steps.meta.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }},
${{ steps.meta.outputs.tags }}-${{ github.event.inputs.release_type }}
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm:main-{1}', env.REGISTRY, github.event.inputs.tag) || '' }},
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm:main-stable', env.REGISTRY) || '' }}
labels: ${{ steps.meta.outputs.labels }}
platforms: local,linux/amd64,linux/arm64,linux/arm64/v8
@ -151,8 +154,12 @@ jobs:
context: .
file: ./docker/Dockerfile.database
push: true
tags: ${{ steps.meta-database.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }}, ${{ steps.meta-database.outputs.tags }}-${{ github.event.inputs.release_type }}
labels: ${{ steps.meta-database.outputs.labels }}
tags: |
${{ steps.meta-database.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }},
${{ steps.meta-database.outputs.tags }}-${{ github.event.inputs.release_type }}
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm-database:main-{1}', env.REGISTRY, github.event.inputs.tag) || '' }},
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm-database:main-stable', env.REGISTRY) || '' }}
labels: ${{ steps.meta-database.outputs.labels }}
platforms: local,linux/amd64,linux/arm64,linux/arm64/v8
build-and-push-image-non_root:
@ -190,7 +197,11 @@ jobs:
context: .
file: ./docker/Dockerfile.non_root
push: true
tags: ${{ steps.meta-non_root.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }}, ${{ steps.meta-non_root.outputs.tags }}-${{ github.event.inputs.release_type }}
tags: |
${{ steps.meta-non_root.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }},
${{ steps.meta-non_root.outputs.tags }}-${{ github.event.inputs.release_type }}
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm-non_root:main-{1}', env.REGISTRY, github.event.inputs.tag) || '' }},
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm-non_root:main-stable', env.REGISTRY) || '' }}
labels: ${{ steps.meta-non_root.outputs.labels }}
platforms: local,linux/amd64,linux/arm64,linux/arm64/v8
@ -229,7 +240,11 @@ jobs:
context: .
file: ./litellm-js/spend-logs/Dockerfile
push: true
tags: ${{ steps.meta-spend-logs.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }}, ${{ steps.meta-spend-logs.outputs.tags }}-${{ github.event.inputs.release_type }}
tags: |
${{ steps.meta-spend-logs.outputs.tags }}-${{ github.event.inputs.tag || 'latest' }},
${{ steps.meta-spend-logs.outputs.tags }}-${{ github.event.inputs.release_type }}
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm-spend_logs:main-{1}', env.REGISTRY, github.event.inputs.tag) || '' }},
${{ github.event.inputs.release_type == 'stable' && format('{0}/berriai/litellm-spend_logs:main-stable', env.REGISTRY) || '' }}
platforms: local,linux/amd64,linux/arm64,linux/arm64/v8
build-and-push-helm-chart:

27
.github/workflows/helm_unit_test.yml vendored Normal file
View file

@ -0,0 +1,27 @@
name: Helm unit test
on:
pull_request:
push:
branches:
- main
jobs:
unit-test:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Set up Helm 3.11.1
uses: azure/setup-helm@v1
with:
version: '3.11.1'
- name: Install Helm Unit Test Plugin
run: |
helm plugin install https://github.com/helm-unittest/helm-unittest --version v0.4.4
- name: Run unit tests
run:
helm unittest -f 'tests/*.yaml' deploy/charts/litellm-helm

View file

@ -54,27 +54,29 @@ def interpret_results(csv_file):
def _get_docker_run_command_stable_release(release_version):
return f"""
\n\n
## Docker Run LiteLLM Proxy
\n\n
## Docker Run LiteLLM Proxy
```
docker run \\
-e STORE_MODEL_IN_DB=True \\
-p 4000:4000 \\
ghcr.io/berriai/litellm_stable_release_branch-{release_version}
```
docker run \\
-e STORE_MODEL_IN_DB=True \\
-p 4000:4000 \\
ghcr.io/berriai/litellm:litellm_stable_release_branch-{release_version}
```
"""
def _get_docker_run_command(release_version):
return f"""
\n\n
## Docker Run LiteLLM Proxy
\n\n
## Docker Run LiteLLM Proxy
```
docker run \\
-e STORE_MODEL_IN_DB=True \\
-p 4000:4000 \\
ghcr.io/berriai/litellm:main-{release_version}
```
docker run \\
-e STORE_MODEL_IN_DB=True \\
-p 4000:4000 \\
ghcr.io/berriai/litellm:main-{release_version}
```
"""

View file

@ -8,7 +8,7 @@ class MyUser(HttpUser):
def chat_completion(self):
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer sk-ZoHqrLIs2-5PzJrqBaviAA",
"Authorization": "Bearer sk-8N1tLOOyH8TIxwOLahhIVg",
# Include any additional headers you may need for authentication, etc.
}

206
.github/workflows/publish-migrations.yml vendored Normal file
View file

@ -0,0 +1,206 @@
name: Publish Prisma Migrations
permissions:
contents: write
pull-requests: write
on:
push:
paths:
- 'schema.prisma' # Check root schema.prisma
branches:
- main
jobs:
publish-migrations:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:14
env:
POSTGRES_DB: temp_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
# Add shadow database service
postgres_shadow:
image: postgres:14
env:
POSTGRES_DB: shadow_db
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- 5433:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.x'
- name: Install Dependencies
run: |
pip install prisma
pip install python-dotenv
- name: Generate Initial Migration if None Exists
env:
DATABASE_URL: "postgresql://postgres:postgres@localhost:5432/temp_db"
DIRECT_URL: "postgresql://postgres:postgres@localhost:5432/temp_db"
SHADOW_DATABASE_URL: "postgresql://postgres:postgres@localhost:5433/shadow_db"
run: |
mkdir -p deploy/migrations
echo 'provider = "postgresql"' > deploy/migrations/migration_lock.toml
if [ -z "$(ls -A deploy/migrations/2* 2>/dev/null)" ]; then
echo "No existing migrations found, creating baseline..."
VERSION=$(date +%Y%m%d%H%M%S)
mkdir -p deploy/migrations/${VERSION}_initial
echo "Generating initial migration..."
# Save raw output for debugging
prisma migrate diff \
--from-empty \
--to-schema-datamodel schema.prisma \
--shadow-database-url "${SHADOW_DATABASE_URL}" \
--script > deploy/migrations/${VERSION}_initial/raw_migration.sql
echo "Raw migration file content:"
cat deploy/migrations/${VERSION}_initial/raw_migration.sql
echo "Cleaning migration file..."
# Clean the file
sed '/^Installing/d' deploy/migrations/${VERSION}_initial/raw_migration.sql > deploy/migrations/${VERSION}_initial/migration.sql
# Verify the migration file
if [ ! -s deploy/migrations/${VERSION}_initial/migration.sql ]; then
echo "ERROR: Migration file is empty after cleaning"
echo "Original content was:"
cat deploy/migrations/${VERSION}_initial/raw_migration.sql
exit 1
fi
echo "Final migration file content:"
cat deploy/migrations/${VERSION}_initial/migration.sql
# Verify it starts with SQL
if ! head -n 1 deploy/migrations/${VERSION}_initial/migration.sql | grep -q "^--\|^CREATE\|^ALTER"; then
echo "ERROR: Migration file does not start with SQL command or comment"
echo "First line is:"
head -n 1 deploy/migrations/${VERSION}_initial/migration.sql
echo "Full content is:"
cat deploy/migrations/${VERSION}_initial/migration.sql
exit 1
fi
echo "Initial migration generated at $(date -u)" > deploy/migrations/${VERSION}_initial/README.md
fi
- name: Compare and Generate Migration
if: success()
env:
DATABASE_URL: "postgresql://postgres:postgres@localhost:5432/temp_db"
DIRECT_URL: "postgresql://postgres:postgres@localhost:5432/temp_db"
SHADOW_DATABASE_URL: "postgresql://postgres:postgres@localhost:5433/shadow_db"
run: |
# Create temporary migration workspace
mkdir -p temp_migrations
# Copy existing migrations (will not fail if directory is empty)
cp -r deploy/migrations/* temp_migrations/ 2>/dev/null || true
VERSION=$(date +%Y%m%d%H%M%S)
# Generate diff against existing migrations or empty state
prisma migrate diff \
--from-migrations temp_migrations \
--to-schema-datamodel schema.prisma \
--shadow-database-url "${SHADOW_DATABASE_URL}" \
--script > temp_migrations/migration_${VERSION}.sql
# Check if there are actual changes
if [ -s temp_migrations/migration_${VERSION}.sql ]; then
echo "Changes detected, creating new migration"
mkdir -p deploy/migrations/${VERSION}_schema_update
mv temp_migrations/migration_${VERSION}.sql deploy/migrations/${VERSION}_schema_update/migration.sql
echo "Migration generated at $(date -u)" > deploy/migrations/${VERSION}_schema_update/README.md
else
echo "No schema changes detected"
exit 0
fi
- name: Verify Migration
if: success()
env:
DATABASE_URL: "postgresql://postgres:postgres@localhost:5432/temp_db"
DIRECT_URL: "postgresql://postgres:postgres@localhost:5432/temp_db"
SHADOW_DATABASE_URL: "postgresql://postgres:postgres@localhost:5433/shadow_db"
run: |
# Create test database
psql "${SHADOW_DATABASE_URL}" -c 'CREATE DATABASE migration_test;'
# Apply all migrations in order to verify
for migration in deploy/migrations/*/migration.sql; do
echo "Applying migration: $migration"
psql "${SHADOW_DATABASE_URL}" -f $migration
done
# Add this step before create-pull-request to debug permissions
- name: Check Token Permissions
run: |
echo "Checking token permissions..."
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/repos/BerriAI/litellm/collaborators
echo "\nChecking if token can create PRs..."
curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
-H "Accept: application/vnd.github.v3+json" \
https://api.github.com/repos/BerriAI/litellm
# Add this debug step before git push
- name: Debug Changed Files
run: |
echo "Files staged for commit:"
git diff --name-status --staged
echo "\nAll changed files:"
git status
- name: Create Pull Request
if: success()
uses: peter-evans/create-pull-request@v5
with:
token: ${{ secrets.GITHUB_TOKEN }}
commit-message: "chore: update prisma migrations"
title: "Update Prisma Migrations"
body: |
Auto-generated migration based on schema.prisma changes.
Generated files:
- deploy/migrations/${VERSION}_schema_update/migration.sql
- deploy/migrations/${VERSION}_schema_update/README.md
branch: feat/prisma-migration-${{ env.VERSION }}
base: main
delete-branch: true
- name: Generate and Save Migrations
run: |
# Only add migration files
git add deploy/migrations/
git status # Debug what's being committed
git commit -m "chore: update prisma migrations"

53
.github/workflows/test-linting.yml vendored Normal file
View file

@ -0,0 +1,53 @@
name: LiteLLM Linting
on:
pull_request:
branches: [ main ]
jobs:
lint:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.12'
- name: Install Poetry
uses: snok/install-poetry@v1
- name: Install dependencies
run: |
poetry install --with dev
- name: Run Black formatting
run: |
cd litellm
poetry run black .
cd ..
- name: Run Ruff linting
run: |
cd litellm
poetry run ruff check .
cd ..
- name: Run MyPy type checking
run: |
cd litellm
poetry run mypy . --ignore-missing-imports
cd ..
- name: Check for circular imports
run: |
cd litellm
poetry run python ../tests/documentation_tests/test_circular_imports.py
cd ..
- name: Check import safety
run: |
poetry run python -c "from litellm import *" || (echo '🚨 import failed, this means you introduced unprotected imports! 🚨'; exit 1)

35
.github/workflows/test-litellm.yml vendored Normal file
View file

@ -0,0 +1,35 @@
name: LiteLLM Mock Tests (folder - tests/litellm)
on:
pull_request:
branches: [ main ]
jobs:
test:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- name: Thank You Message
run: |
echo "### 🙏 Thank you for contributing to LiteLLM!" >> $GITHUB_STEP_SUMMARY
echo "Your PR is being tested now. We appreciate your help in making LiteLLM better!" >> $GITHUB_STEP_SUMMARY
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.12'
- name: Install Poetry
uses: snok/install-poetry@v1
- name: Install dependencies
run: |
poetry install --with dev,proxy-dev --extras proxy
poetry run pip install pytest-xdist
- name: Run tests
run: |
poetry run pytest tests/litellm -x -vv -n 4

8
.gitignore vendored
View file

@ -1,3 +1,4 @@
.python-version
.venv
.env
.newenv
@ -77,3 +78,10 @@ litellm/proxy/_experimental/out/404.html
litellm/proxy/_experimental/out/model_hub.html
.mypy_cache/*
litellm/proxy/application.log
tests/llm_translation/vertex_test_account.json
tests/llm_translation/test_vertex_key.json
litellm/proxy/migrations/0_init/migration.sql
litellm/proxy/db/migrations/0_init/migration.sql
litellm/proxy/db/migrations/*
litellm/proxy/migrations/*config.yaml
litellm/proxy/migrations/*

View file

@ -6,44 +6,35 @@ repos:
entry: pyright
language: system
types: [python]
files: ^litellm/
files: ^(litellm/|litellm_proxy_extras/)
- id: isort
name: isort
entry: isort
language: system
types: [python]
files: litellm/.*\.py
files: (litellm/|litellm_proxy_extras/).*\.py
exclude: ^litellm/__init__.py$
- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
- id: black
name: black
entry: poetry run black
language: system
types: [python]
files: (litellm/|litellm_proxy_extras/).*\.py
- repo: https://github.com/pycqa/flake8
rev: 7.0.0 # The version of flake8 to use
hooks:
- id: flake8
exclude: ^litellm/tests/|^litellm/proxy/tests/
exclude: ^litellm/tests/|^litellm/proxy/tests/|^litellm/tests/litellm/|^tests/litellm/
additional_dependencies: [flake8-print]
files: litellm/.*\.py
# - id: flake8
# name: flake8 (router.py function length)
# files: ^litellm/router\.py$
# args: [--max-function-length=40]
# # additional_dependencies: [flake8-functions]
files: (litellm/|litellm_proxy_extras/).*\.py
- repo: https://github.com/python-poetry/poetry
rev: 1.8.0
hooks:
- id: poetry-check
files: ^(pyproject.toml|litellm-proxy-extras/pyproject.toml)$
- repo: local
hooks:
- id: check-files-match
name: Check if files match
entry: python3 ci_cd/check_files_match.py
language: system
# - id: check-file-length
# name: Check file length
# entry: python check_file_length.py
# args: ["10000"] # set your desired maximum number of lines
# language: python
# files: litellm/.*\.py
# exclude: ^litellm/tests/
language: system

View file

@ -12,8 +12,7 @@ WORKDIR /app
USER root
# Install build dependencies
RUN apk update && \
apk add --no-cache gcc python3-dev openssl openssl-dev
RUN apk add --no-cache gcc python3-dev openssl openssl-dev
RUN pip install --upgrade pip && \
@ -37,9 +36,6 @@ RUN pip install dist/*.whl
# install dependencies as wheels
RUN pip wheel --no-cache-dir --wheel-dir=/wheels/ -r requirements.txt
# install semantic-cache [Experimental]- we need this here and not in requirements.txt because redisvl pins to pydantic 1.0
RUN pip install redisvl==0.0.7 --no-deps
# ensure pyjwt is used, not jwt
RUN pip uninstall jwt -y
RUN pip uninstall PyJWT -y
@ -55,8 +51,7 @@ FROM $LITELLM_RUNTIME_IMAGE AS runtime
USER root
# Install runtime dependencies
RUN apk update && \
apk add --no-cache openssl
RUN apk add --no-cache openssl
WORKDIR /app
# Copy the current directory contents into the container at /app

35
Makefile Normal file
View file

@ -0,0 +1,35 @@
# LiteLLM Makefile
# Simple Makefile for running tests and basic development tasks
.PHONY: help test test-unit test-integration lint format
# Default target
help:
@echo "Available commands:"
@echo " make test - Run all tests"
@echo " make test-unit - Run unit tests"
@echo " make test-integration - Run integration tests"
@echo " make test-unit-helm - Run helm unit tests"
install-dev:
poetry install --with dev
install-proxy-dev:
poetry install --with dev,proxy-dev
lint: install-dev
poetry run pip install types-requests types-setuptools types-redis types-PyYAML
cd litellm && poetry run mypy . --ignore-missing-imports
# Testing
test:
poetry run pytest tests/
test-unit:
poetry run pytest tests/litellm/
test-integration:
poetry run pytest tests/ -k "not litellm"
test-unit-helm:
helm unittest -f 'tests/*.yaml' deploy/charts/litellm-helm

View file

@ -16,9 +16,6 @@
<a href="https://pypi.org/project/litellm/" target="_blank">
<img src="https://img.shields.io/pypi/v/litellm.svg" alt="PyPI Version">
</a>
<a href="https://dl.circleci.com/status-badge/redirect/gh/BerriAI/litellm/tree/main" target="_blank">
<img src="https://dl.circleci.com/status-badge/img/gh/BerriAI/litellm/tree/main.svg?style=svg" alt="CircleCI">
</a>
<a href="https://www.ycombinator.com/companies/berriai">
<img src="https://img.shields.io/badge/Y%20Combinator-W23-orange?style=flat-square" alt="Y Combinator W23">
</a>
@ -40,7 +37,7 @@ LiteLLM manages:
[**Jump to LiteLLM Proxy (LLM Gateway) Docs**](https://github.com/BerriAI/litellm?tab=readme-ov-file#openai-proxy---docs) <br>
[**Jump to Supported LLM Providers**](https://github.com/BerriAI/litellm?tab=readme-ov-file#supported-providers-docs)
🚨 **Stable Release:** Use docker images with the `-stable` tag. These have undergone 12 hour load tests, before being published.
🚨 **Stable Release:** Use docker images with the `-stable` tag. These have undergone 12 hour load tests, before being published. [More information about the release cycle here](https://docs.litellm.ai/docs/proxy/release_cycle)
Support for more providers. Missing a provider or LLM Platform, raise a [feature request](https://github.com/BerriAI/litellm/issues/new?assignees=&labels=enhancement&projects=&template=feature_request.yml&title=%5BFeature%5D%3A+).
@ -340,64 +337,7 @@ curl 'http://0.0.0.0:4000/key/generate' \
## Contributing
To contribute: Clone the repo locally -> Make a change -> Submit a PR with the change.
Here's how to modify the repo locally:
Step 1: Clone the repo
```
git clone https://github.com/BerriAI/litellm.git
```
Step 2: Navigate into the project, and install dependencies:
```
cd litellm
poetry install -E extra_proxy -E proxy
```
Step 3: Test your change:
```
cd tests # pwd: Documents/litellm/litellm/tests
poetry run flake8
poetry run pytest .
```
Step 4: Submit a PR with your changes! 🚀
- push your fork to your GitHub repo
- submit a PR from there
### Building LiteLLM Docker Image
Follow these instructions if you want to build / run the LiteLLM Docker Image yourself.
Step 1: Clone the repo
```
git clone https://github.com/BerriAI/litellm.git
```
Step 2: Build the Docker Image
Build using Dockerfile.non_root
```
docker build -f docker/Dockerfile.non_root -t litellm_test_image .
```
Step 3: Run the Docker Image
Make sure config.yaml is present in the root directory. This is your litellm proxy config file.
```
docker run \
-v $(pwd)/proxy_config.yaml:/app/config.yaml \
-e DATABASE_URL="postgresql://xxxxxxxx" \
-e LITELLM_MASTER_KEY="sk-1234" \
-p 4000:4000 \
litellm_test_image \
--config /app/config.yaml --detailed_debug
```
Interested in contributing? Contributions to LiteLLM Python SDK, Proxy Server, and contributing LLM integrations are both accepted and highly encouraged! [See our Contribution Guide for more details](https://docs.litellm.ai/docs/extras/contributing_code)
# Enterprise
For companies that need better security, user management and professional support

60
ci_cd/baseline_db.py Normal file
View file

@ -0,0 +1,60 @@
import subprocess
from pathlib import Path
from datetime import datetime
def create_baseline():
"""Create baseline migration in deploy/migrations"""
try:
# Get paths
root_dir = Path(__file__).parent.parent
deploy_dir = root_dir / "deploy"
migrations_dir = deploy_dir / "migrations"
schema_path = root_dir / "schema.prisma"
# Create migrations directory
migrations_dir.mkdir(parents=True, exist_ok=True)
# Create migration_lock.toml if it doesn't exist
lock_file = migrations_dir / "migration_lock.toml"
if not lock_file.exists():
lock_file.write_text('provider = "postgresql"\n')
# Create timestamp-based migration directory
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
migration_dir = migrations_dir / f"{timestamp}_baseline"
migration_dir.mkdir(parents=True, exist_ok=True)
# Generate migration SQL
result = subprocess.run(
[
"prisma",
"migrate",
"diff",
"--from-empty",
"--to-schema-datamodel",
str(schema_path),
"--script",
],
capture_output=True,
text=True,
check=True,
)
# Write the SQL to migration.sql
migration_file = migration_dir / "migration.sql"
migration_file.write_text(result.stdout)
print(f"Created baseline migration in {migration_dir}")
return True
except subprocess.CalledProcessError as e:
print(f"Error running prisma command: {e.stderr}")
return False
except Exception as e:
print(f"Error creating baseline migration: {str(e)}")
return False
if __name__ == "__main__":
create_baseline()

View file

@ -0,0 +1,19 @@
#!/bin/bash
# Exit on error
set -e
echo "🚀 Building and publishing litellm-proxy-extras"
# Navigate to litellm-proxy-extras directory
cd "$(dirname "$0")/../litellm-proxy-extras"
# Build the package
echo "📦 Building package..."
poetry build
# Publish to PyPI
echo "🌎 Publishing to PyPI..."
poetry publish
echo "✅ Done! Package published successfully"

95
ci_cd/run_migration.py Normal file
View file

@ -0,0 +1,95 @@
import os
import subprocess
from pathlib import Path
from datetime import datetime
import testing.postgresql
import shutil
def create_migration(migration_name: str = None):
"""
Create a new migration SQL file in the migrations directory by comparing
current database state with schema
Args:
migration_name (str): Name for the migration
"""
try:
# Get paths
root_dir = Path(__file__).parent.parent
migrations_dir = root_dir / "litellm-proxy-extras" / "litellm_proxy_extras" / "migrations"
schema_path = root_dir / "schema.prisma"
# Create temporary PostgreSQL database
with testing.postgresql.Postgresql() as postgresql:
db_url = postgresql.url()
# Create temporary migrations directory next to schema.prisma
temp_migrations_dir = schema_path.parent / "migrations"
try:
# Copy existing migrations to temp directory
if temp_migrations_dir.exists():
shutil.rmtree(temp_migrations_dir)
shutil.copytree(migrations_dir, temp_migrations_dir)
# Apply existing migrations to temp database
os.environ["DATABASE_URL"] = db_url
subprocess.run(
["prisma", "migrate", "deploy", "--schema", str(schema_path)],
check=True,
)
# Generate diff between current database and schema
result = subprocess.run(
[
"prisma",
"migrate",
"diff",
"--from-url",
db_url,
"--to-schema-datamodel",
str(schema_path),
"--script",
],
capture_output=True,
text=True,
check=True,
)
if result.stdout.strip():
# Generate timestamp and create migration directory
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
migration_name = migration_name or "unnamed_migration"
migration_dir = migrations_dir / f"{timestamp}_{migration_name}"
migration_dir.mkdir(parents=True, exist_ok=True)
# Write the SQL to migration.sql
migration_file = migration_dir / "migration.sql"
migration_file.write_text(result.stdout)
print(f"Created migration in {migration_dir}")
return True
else:
print("No schema changes detected. Migration not needed.")
return False
finally:
# Clean up: remove temporary migrations directory
if temp_migrations_dir.exists():
shutil.rmtree(temp_migrations_dir)
except subprocess.CalledProcessError as e:
print(f"Error generating migration: {e.stderr}")
return False
except Exception as e:
print(f"Error creating migration: {str(e)}")
return False
if __name__ == "__main__":
# If running directly, can optionally pass migration name as argument
import sys
migration_name = sys.argv[1] if len(sys.argv) > 1 else None
create_migration(migration_name)

View file

@ -6,8 +6,9 @@
"id": "9dKM5k8qsMIj"
},
"source": [
"## LiteLLM HuggingFace\n",
"Docs for huggingface: https://docs.litellm.ai/docs/providers/huggingface"
"## LiteLLM Hugging Face\n",
"\n",
"Docs for huggingface: https://docs.litellm.ai/docs/providers/huggingface\n"
]
},
{
@ -27,23 +28,18 @@
"id": "yp5UXRqtpu9f"
},
"source": [
"## Hugging Face Free Serverless Inference API\n",
"Read more about the Free Serverless Inference API here: https://huggingface.co/docs/api-inference.\n",
"## Serverless Inference Providers\n",
"\n",
"In order to use litellm to call Serverless Inference API:\n",
"Read more about Inference Providers here: https://huggingface.co/blog/inference-providers.\n",
"\n",
"* Browse Serverless Inference compatible models here: https://huggingface.co/models?inference=warm&pipeline_tag=text-generation.\n",
"* Copy the model name from hugging face\n",
"* Set `model = \"huggingface/<model-name>\"`\n",
"In order to use litellm with Hugging Face Inference Providers, you need to set `model=huggingface/<provider>/<model-id>`.\n",
"\n",
"Example set `model=huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct` to call `meta-llama/Meta-Llama-3.1-8B-Instruct`\n",
"\n",
"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct"
"Example: `huggingface/together/deepseek-ai/DeepSeek-R1` to run DeepSeek-R1 (https://huggingface.co/deepseek-ai/DeepSeek-R1) through Together AI.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -51,107 +47,18 @@
"id": "Pi5Oww8gpCUm",
"outputId": "659a67c7-f90d-4c06-b94e-2c4aa92d897a"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ModelResponse(id='chatcmpl-c54dfb68-1491-4d68-a4dc-35e603ea718a', choices=[Choices(finish_reason='eos_token', index=0, message=Message(content=\"I'm just a computer program, so I don't have feelings, but thank you for asking! How can I assist you today?\", role='assistant', tool_calls=None, function_call=None))], created=1724858285, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=27, prompt_tokens=47, total_tokens=74))\n",
"ModelResponse(id='chatcmpl-d2ae38e6-4974-431c-bb9b-3fa3f95e5a6d', choices=[Choices(finish_reason='length', index=0, message=Message(content=\"\\n\\nIm doing well, thank you. Ive been keeping busy with work and some personal projects. How about you?\\n\\nI'm doing well, thank you. I've been enjoying some time off and catching up on some reading. How can I assist you today?\\n\\nI'm looking for a good book to read. Do you have any recommendations?\\n\\nOf course! Here are a few book recommendations across different genres:\\n\\n1.\", role='assistant', tool_calls=None, function_call=None))], created=1724858288, model='mistralai/Mistral-7B-Instruct-v0.3', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=85, prompt_tokens=6, total_tokens=91))\n"
]
}
],
"outputs": [],
"source": [
"import os\n",
"import litellm\n",
"from litellm import completion\n",
"\n",
"# Make sure to create an API_KEY with inference permissions at https://huggingface.co/settings/tokens/new?globalPermissions=inference.serverless.write&tokenType=fineGrained\n",
"os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
"# You can create a HF token here: https://huggingface.co/settings/tokens\n",
"os.environ[\"HF_TOKEN\"] = \"hf_xxxxxx\"\n",
"\n",
"# Call https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\n",
"# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
"response = litellm.completion(\n",
" model=\"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
")\n",
"print(response)\n",
"\n",
"\n",
"# Call https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3\n",
"response = litellm.completion(\n",
" model=\"huggingface/mistralai/Mistral-7B-Instruct-v0.3\",\n",
" messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
")\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-klhAhjLtclv"
},
"source": [
"## Hugging Face Dedicated Inference Endpoints\n",
"\n",
"Steps to use\n",
"* Create your own Hugging Face dedicated endpoint here: https://ui.endpoints.huggingface.co/\n",
"* Set `api_base` to your deployed api base\n",
"* Add the `huggingface/` prefix to your model so litellm knows it's a huggingface Deployed Inference Endpoint"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Lbmw8Gl_pHns",
"outputId": "ea8408bf-1cc3-4670-ecea-f12666d204a8"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"object\": \"chat.completion\",\n",
" \"choices\": [\n",
" {\n",
" \"finish_reason\": \"length\",\n",
" \"index\": 0,\n",
" \"message\": {\n",
" \"content\": \"\\n\\nI am doing well, thank you for asking. How about you?\\nI am doing\",\n",
" \"role\": \"assistant\",\n",
" \"logprobs\": -8.9481967812\n",
" }\n",
" }\n",
" ],\n",
" \"id\": \"chatcmpl-74dc9d89-3916-47ce-9bea-b80e66660f77\",\n",
" \"created\": 1695871068.8413374,\n",
" \"model\": \"glaiveai/glaive-coder-7b\",\n",
" \"usage\": {\n",
" \"prompt_tokens\": 6,\n",
" \"completion_tokens\": 18,\n",
" \"total_tokens\": 24\n",
" }\n",
"}\n"
]
}
],
"source": [
"import os\n",
"import litellm\n",
"\n",
"os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
"\n",
"# TGI model: Call https://huggingface.co/glaiveai/glaive-coder-7b\n",
"# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
"# set api base to your deployed api endpoint from hugging face\n",
"response = litellm.completion(\n",
" model=\"huggingface/glaiveai/glaive-coder-7b\",\n",
" messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
" api_base=\"https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud\"\n",
"# Call DeepSeek-R1 model through Together AI\n",
"response = completion(\n",
" model=\"huggingface/together/deepseek-ai/DeepSeek-R1\",\n",
" messages=[{\"content\": \"How many r's are in the word `strawberry`?\", \"role\": \"user\"}],\n",
")\n",
"print(response)"
]
@ -162,13 +69,12 @@
"id": "EU0UubrKzTFe"
},
"source": [
"## HuggingFace - Streaming (Serveless or Dedicated)\n",
"Set stream = True"
"## Streaming\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
@ -176,74 +82,147 @@
"id": "y-QfIvA-uJKX",
"outputId": "b007bb98-00d0-44a4-8264-c8a2caed6768"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<litellm.utils.CustomStreamWrapper object at 0x1278471d0>\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='I', role='assistant', function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=\"'m\", role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' just', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' a', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' computer', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' program', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=',', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' so', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' I', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' don', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=\"'t\", role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' have', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' feelings', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=',', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' but', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' thank', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' you', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' for', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' asking', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='!', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' How', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' can', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' I', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' assist', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' you', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content=' today', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='?', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(content='<|eot_id|>', role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n",
"ModelResponse(id='chatcmpl-ffeb4491-624b-4ddf-8005-60358cf67d36', choices=[StreamingChoices(finish_reason='stop', index=0, delta=Delta(content=None, role=None, function_call=None, tool_calls=None), logprobs=None)], created=1724858353, model='meta-llama/Meta-Llama-3.1-8B-Instruct', object='chat.completion.chunk', system_fingerprint=None)\n"
]
}
],
"outputs": [],
"source": [
"import os\n",
"import litellm\n",
"from litellm import completion\n",
"\n",
"# Make sure to create an API_KEY with inference permissions at https://huggingface.co/settings/tokens/new?globalPermissions=inference.serverless.write&tokenType=fineGrained\n",
"os.environ[\"HUGGINGFACE_API_KEY\"] = \"\"\n",
"os.environ[\"HF_TOKEN\"] = \"hf_xxxxxx\"\n",
"\n",
"# Call https://huggingface.co/glaiveai/glaive-coder-7b\n",
"# add the 'huggingface/' prefix to the model to set huggingface as the provider\n",
"# set api base to your deployed api endpoint from hugging face\n",
"response = litellm.completion(\n",
" model=\"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
" messages=[{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}],\n",
" stream=True\n",
"response = completion(\n",
" model=\"huggingface/together/deepseek-ai/DeepSeek-R1\",\n",
" messages=[\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": \"How many r's are in the word `strawberry`?\",\n",
" \n",
" }\n",
" ],\n",
" stream=True,\n",
")\n",
"\n",
"print(response)\n",
"\n",
"for chunk in response:\n",
" print(chunk)"
" print(chunk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## With images as input\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CKXAnK55zQRl"
},
"metadata": {},
"outputs": [],
"source": []
"source": [
"from litellm import completion\n",
"\n",
"# Set your Hugging Face Token\n",
"os.environ[\"HF_TOKEN\"] = \"hf_xxxxxx\"\n",
"\n",
"messages = [\n",
" {\n",
" \"role\": \"user\",\n",
" \"content\": [\n",
" {\"type\": \"text\", \"text\": \"What's in this image?\"},\n",
" {\n",
" \"type\": \"image_url\",\n",
" \"image_url\": {\n",
" \"url\": \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg\",\n",
" },\n",
" },\n",
" ],\n",
" }\n",
"]\n",
"\n",
"response = completion(\n",
" model=\"huggingface/sambanova/meta-llama/Llama-3.3-70B-Instruct\",\n",
" messages=messages,\n",
")\n",
"print(response.choices[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tools - Function Calling\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from litellm import completion\n",
"\n",
"\n",
"# Set your Hugging Face Token\n",
"os.environ[\"HF_TOKEN\"] = \"hf_xxxxxx\"\n",
"\n",
"tools = [\n",
" {\n",
" \"type\": \"function\",\n",
" \"function\": {\n",
" \"name\": \"get_current_weather\",\n",
" \"description\": \"Get the current weather in a given location\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"location\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"The city and state, e.g. San Francisco, CA\",\n",
" },\n",
" \"unit\": {\"type\": \"string\", \"enum\": [\"celsius\", \"fahrenheit\"]},\n",
" },\n",
" \"required\": [\"location\"],\n",
" },\n",
" },\n",
" }\n",
"]\n",
"messages = [{\"role\": \"user\", \"content\": \"What's the weather like in Boston today?\"}]\n",
"\n",
"response = completion(\n",
" model=\"huggingface/sambanova/meta-llama/Llama-3.1-8B-Instruct\", messages=messages, tools=tools, tool_choice=\"auto\"\n",
")\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hugging Face Dedicated Inference Endpoints\n",
"\n",
"Steps to use\n",
"\n",
"- Create your own Hugging Face dedicated endpoint here: https://ui.endpoints.huggingface.co/\n",
"- Set `api_base` to your deployed api base\n",
"- set the model to `huggingface/tgi` so that litellm knows it's a huggingface Deployed Inference Endpoint.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import litellm\n",
"\n",
"\n",
"response = litellm.completion(\n",
" model=\"huggingface/tgi\",\n",
" messages=[{\"content\": \"Hello, how are you?\", \"role\": \"user\"}],\n",
" api_base=\"https://my-endpoint.endpoints.huggingface.cloud/v1/\",\n",
")\n",
"print(response)"
]
}
],
"metadata": {
@ -251,7 +230,8 @@
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
@ -264,7 +244,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
"version": "3.12.0"
}
},
"nbformat": 4,

View file

@ -1 +1 @@
litellm==1.55.3
litellm==1.61.15

View file

@ -1,2 +1,11 @@
python3 -m build
twine upload --verbose dist/litellm-1.18.13.dev4.tar.gz -u __token__ -
twine upload --verbose dist/litellm-1.18.13.dev4.tar.gz -u __token__ -
Note: You might need to make a MANIFEST.ini file on root for build process incase it fails
Place this in MANIFEST.ini
recursive-exclude venv *
recursive-exclude myenv *
recursive-exclude py313_env *
recursive-exclude **/.venv *

View file

@ -18,7 +18,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.3.0
version: 0.4.3
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to

View file

@ -22,6 +22,8 @@ If `db.useStackgresOperator` is used (not yet implemented):
| Name | Description | Value |
| ---------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
| `replicaCount` | The number of LiteLLM Proxy pods to be deployed | `1` |
| `masterkeySecretName` | The name of the Kubernetes Secret that contains the Master API Key for LiteLLM. If not specified, use the generated secret name. | N/A |
| `masterkeySecretKey` | The key within the Kubernetes Secret that contains the Master API Key for LiteLLM. If not specified, use `masterkey` as the key. | N/A |
| `masterkey` | The Master API Key for LiteLLM. If not specified, a random key is generated. | N/A |
| `environmentSecrets` | An optional array of Secret object names. The keys and values in these secrets will be presented to the LiteLLM proxy pod as environment variables. See below for an example Secret object. | `[]` |
| `environmentConfigMaps` | An optional array of ConfigMap object names. The keys and values in these configmaps will be presented to the LiteLLM proxy pod as environment variables. See below for an example Secret object. | `[]` |

View file

@ -78,8 +78,8 @@ spec:
- name: PROXY_MASTER_KEY
valueFrom:
secretKeyRef:
name: {{ include "litellm.fullname" . }}-masterkey
key: masterkey
name: {{ .Values.masterkeySecretName | default (printf "%s-masterkey" (include "litellm.fullname" .)) }}
key: {{ .Values.masterkeySecretKey | default "masterkey" }}
{{- if .Values.redis.enabled }}
- name: REDIS_HOST
value: {{ include "litellm.redis.serviceName" . }}
@ -97,6 +97,9 @@ spec:
value: {{ $val | quote }}
{{- end }}
{{- end }}
{{- with .Values.extraEnvVars }}
{{- toYaml . | nindent 12 }}
{{- end }}
envFrom:
{{- range .Values.environmentSecrets }}
- secretRef:

View file

@ -48,6 +48,23 @@ spec:
{{- end }}
- name: DISABLE_SCHEMA_UPDATE
value: "false" # always run the migration from the Helm PreSync hook, override the value set
{{- with .Values.volumeMounts }}
volumeMounts:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.volumes }}
volumes:
{{- toYaml . | nindent 8 }}
{{- end }}
restartPolicy: OnFailure
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
ttlSecondsAfterFinished: {{ .Values.migrationJob.ttlSecondsAfterFinished }}
backoffLimit: {{ .Values.migrationJob.backoffLimit }}
{{- end }}

View file

@ -1,3 +1,4 @@
{{- if not .Values.masterkeySecretName }}
{{ $masterkey := (.Values.masterkey | default (randAlphaNum 17)) }}
apiVersion: v1
kind: Secret
@ -5,4 +6,5 @@ metadata:
name: {{ include "litellm.fullname" . }}-masterkey
data:
masterkey: {{ $masterkey | b64enc }}
type: Opaque
type: Opaque
{{- end }}

View file

@ -2,6 +2,10 @@ apiVersion: v1
kind: Service
metadata:
name: {{ include "litellm.fullname" . }}
{{- with .Values.service.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
labels:
{{- include "litellm.labels" . | nindent 4 }}
spec:

View file

@ -0,0 +1,117 @@
suite: test deployment
templates:
- deployment.yaml
- configmap-litellm.yaml
tests:
- it: should work
template: deployment.yaml
set:
image.tag: test
asserts:
- isKind:
of: Deployment
- matchRegex:
path: metadata.name
pattern: -litellm$
- equal:
path: spec.template.spec.containers[0].image
value: ghcr.io/berriai/litellm-database:test
- it: should work with tolerations
template: deployment.yaml
set:
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
asserts:
- equal:
path: spec.template.spec.tolerations[0].key
value: node-role.kubernetes.io/master
- equal:
path: spec.template.spec.tolerations[0].operator
value: Exists
- it: should work with affinity
template: deployment.yaml
set:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- antarctica-east1
asserts:
- equal:
path: spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].key
value: topology.kubernetes.io/zone
- equal:
path: spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].operator
value: In
- equal:
path: spec.template.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms[0].matchExpressions[0].values[0]
value: antarctica-east1
- it: should work without masterkeySecretName or masterkeySecretKey
template: deployment.yaml
set:
masterkeySecretName: ""
masterkeySecretKey: ""
asserts:
- contains:
path: spec.template.spec.containers[0].env
content:
name: PROXY_MASTER_KEY
valueFrom:
secretKeyRef:
name: RELEASE-NAME-litellm-masterkey
key: masterkey
- it: should work with masterkeySecretName and masterkeySecretKey
template: deployment.yaml
set:
masterkeySecretName: my-secret
masterkeySecretKey: my-key
asserts:
- contains:
path: spec.template.spec.containers[0].env
content:
name: PROXY_MASTER_KEY
valueFrom:
secretKeyRef:
name: my-secret
key: my-key
- it: should work with extraEnvVars
template: deployment.yaml
set:
extraEnvVars:
- name: EXTRA_ENV_VAR
valueFrom:
fieldRef:
fieldPath: metadata.labels['env']
asserts:
- contains:
path: spec.template.spec.containers[0].env
content:
name: EXTRA_ENV_VAR
valueFrom:
fieldRef:
fieldPath: metadata.labels['env']
- it: should work with both extraEnvVars and envVars
template: deployment.yaml
set:
envVars:
ENV_VAR: ENV_VAR_VALUE
extraEnvVars:
- name: EXTRA_ENV_VAR
value: EXTRA_ENV_VAR_VALUE
asserts:
- contains:
path: spec.template.spec.containers[0].env
content:
name: ENV_VAR
value: ENV_VAR_VALUE
- contains:
path: spec.template.spec.containers[0].env
content:
name: EXTRA_ENV_VAR
value: EXTRA_ENV_VAR_VALUE

View file

@ -0,0 +1,18 @@
suite: test masterkey secret
templates:
- secret-masterkey.yaml
tests:
- it: should create a secret if masterkeySecretName is not set
template: secret-masterkey.yaml
set:
masterkeySecretName: ""
asserts:
- isKind:
of: Secret
- it: should not create a secret if masterkeySecretName is set
template: secret-masterkey.yaml
set:
masterkeySecretName: my-secret
asserts:
- hasDocuments:
count: 0

View file

@ -75,6 +75,12 @@ ingress:
# masterkey: changeit
# if set, use this secret for the master key; otherwise, autogenerate a new one
masterkeySecretName: ""
# if set, use this secret key for the master key; otherwise, use the default key
masterkeySecretKey: ""
# The elements within proxy_config are rendered as config.yaml for the proxy
# Examples: https://github.com/BerriAI/litellm/tree/main/litellm/proxy/example_config_yaml
# Reference: https://docs.litellm.ai/docs/proxy/configs
@ -187,10 +193,17 @@ migrationJob:
backoffLimit: 4 # Backoff limit for Job restarts
disableSchemaUpdate: false # Skip schema migrations for specific environments. When True, the job will exit with code 0.
annotations: {}
ttlSecondsAfterFinished: 120
# Additional environment variables to be added to the deployment
# Additional environment variables to be added to the deployment as a map of key-value pairs
envVars: {
# USE_DDTRACE: "true"
}
# Additional environment variables to be added to the deployment as a list of k8s env vars
extraEnvVars: {
# - name: EXTRA_ENV_VAR
# value: EXTRA_ENV_VAR_VALUE
}

View file

@ -20,10 +20,18 @@ services:
STORE_MODEL_IN_DB: "True" # allows adding models to proxy via UI
env_file:
- .env # Load local .env file
depends_on:
- db # Indicates that this service depends on the 'db' service, ensuring 'db' starts first
healthcheck: # Defines the health check configuration for the container
test: [ "CMD", "curl", "-f", "http://localhost:4000/health/liveliness || exit 1" ] # Command to execute for health check
interval: 30s # Perform health check every 30 seconds
timeout: 10s # Health check command times out after 10 seconds
retries: 3 # Retry up to 3 times if health check fails
start_period: 40s # Wait 40 seconds after container start before beginning health checks
db:
image: postgres
image: postgres:16
restart: always
environment:
POSTGRES_DB: litellm
@ -31,6 +39,8 @@ services:
POSTGRES_PASSWORD: dbpassword9090
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data # Persists Postgres data across container restarts
healthcheck:
test: ["CMD-SHELL", "pg_isready -d litellm -U llmproxy"]
interval: 1s
@ -53,6 +63,6 @@ services:
volumes:
prometheus_data:
driver: local
postgres_data:
name: litellm_postgres_data # Named volume for Postgres data persistence
# ...rest of your docker-compose config if any

View file

@ -35,7 +35,7 @@ RUN pip wheel --no-cache-dir --wheel-dir=/wheels/ -r requirements.txt
FROM $LITELLM_RUNTIME_IMAGE AS runtime
# Update dependencies and clean up
RUN apk update && apk upgrade && rm -rf /var/cache/apk/*
RUN apk upgrade --no-cache
WORKDIR /app

View file

@ -12,8 +12,7 @@ WORKDIR /app
USER root
# Install build dependencies
RUN apk update && \
apk add --no-cache gcc python3-dev openssl openssl-dev
RUN apk add --no-cache gcc python3-dev openssl openssl-dev
RUN pip install --upgrade pip && \
@ -44,8 +43,7 @@ FROM $LITELLM_RUNTIME_IMAGE AS runtime
USER root
# Install runtime dependencies
RUN apk update && \
apk add --no-cache openssl
RUN apk add --no-cache openssl
WORKDIR /app
# Copy the current directory contents into the container at /app
@ -59,9 +57,6 @@ COPY --from=builder /wheels/ /wheels/
# Install the built wheel using pip; again using a wildcard if it's the only file
RUN pip install *.whl /wheels/* --no-index --find-links=/wheels/ && rm -f *.whl && rm -rf /wheels
# install semantic-cache [Experimental]- we need this here and not in requirements.txt because redisvl pins to pydantic 1.0
RUN pip install redisvl==0.0.7 --no-deps
# ensure pyjwt is used, not jwt
RUN pip uninstall jwt -y
RUN pip uninstall PyJWT -y

View file

@ -14,7 +14,7 @@ SHELL ["/bin/bash", "-o", "pipefail", "-c"]
# Install build dependencies
RUN apt-get clean && apt-get update && \
apt-get install -y gcc python3-dev && \
apt-get install -y gcc g++ python3-dev && \
rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir --upgrade pip && \
@ -56,10 +56,8 @@ COPY --from=builder /wheels/ /wheels/
# Install the built wheel using pip; again using a wildcard if it's the only file
RUN pip install *.whl /wheels/* --no-index --find-links=/wheels/ && rm -f *.whl && rm -rf /wheels
# install semantic-cache [Experimental]- we need this here and not in requirements.txt because redisvl pins to pydantic 1.0
# ensure pyjwt is used, not jwt
RUN pip install redisvl==0.0.7 --no-deps --no-cache-dir && \
pip uninstall jwt -y && \
RUN pip uninstall jwt -y && \
pip uninstall PyJWT -y && \
pip install PyJWT==2.9.0 --no-cache-dir

View file

@ -0,0 +1,301 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# /v1/messages [BETA]
Use LiteLLM to call all your LLM APIs in the Anthropic `v1/messages` format.
## Overview
| Feature | Supported | Notes |
|-------|-------|-------|
| Cost Tracking | ✅ | |
| Logging | ✅ | works across all integrations |
| End-user Tracking | ✅ | |
| Streaming | ✅ | |
| Fallbacks | ✅ | between anthropic models |
| Loadbalancing | ✅ | between anthropic models |
Planned improvement:
- Vertex AI Anthropic support
- Bedrock Anthropic support
## Usage
---
### LiteLLM Python SDK
#### Non-streaming example
```python showLineNumbers title="Example using LiteLLM Python SDK"
import litellm
response = await litellm.anthropic.messages.acreate(
messages=[{"role": "user", "content": "Hello, can you tell me a short joke?"}],
api_key=api_key,
model="anthropic/claude-3-haiku-20240307",
max_tokens=100,
)
```
Example response:
```json
{
"content": [
{
"text": "Hi! this is a very short joke",
"type": "text"
}
],
"id": "msg_013Zva2CMHLNnXjNJJKqJ2EF",
"model": "claude-3-7-sonnet-20250219",
"role": "assistant",
"stop_reason": "end_turn",
"stop_sequence": null,
"type": "message",
"usage": {
"input_tokens": 2095,
"output_tokens": 503,
"cache_creation_input_tokens": 2095,
"cache_read_input_tokens": 0
}
}
```
#### Streaming example
```python showLineNumbers title="Example using LiteLLM Python SDK"
import litellm
response = await litellm.anthropic.messages.acreate(
messages=[{"role": "user", "content": "Hello, can you tell me a short joke?"}],
api_key=api_key,
model="anthropic/claude-3-haiku-20240307",
max_tokens=100,
stream=True,
)
async for chunk in response:
print(chunk)
```
### LiteLLM Proxy Server
1. Setup config.yaml
```yaml
model_list:
- model_name: anthropic-claude
litellm_params:
model: claude-3-7-sonnet-latest
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
<Tabs>
<TabItem label="Anthropic Python SDK" value="python">
```python showLineNumbers title="Example using LiteLLM Proxy Server"
import anthropic
# point anthropic sdk to litellm proxy
client = anthropic.Anthropic(
base_url="http://0.0.0.0:4000",
api_key="sk-1234",
)
response = client.messages.create(
messages=[{"role": "user", "content": "Hello, can you tell me a short joke?"}],
model="anthropic-claude",
max_tokens=100,
)
```
</TabItem>
<TabItem label="curl" value="curl">
```bash showLineNumbers title="Example using LiteLLM Proxy Server"
curl -L -X POST 'http://0.0.0.0:4000/v1/messages' \
-H 'content-type: application/json' \
-H 'x-api-key: $LITELLM_API_KEY' \
-H 'anthropic-version: 2023-06-01' \
-d '{
"model": "anthropic-claude",
"messages": [
{
"role": "user",
"content": "Hello, can you tell me a short joke?"
}
],
"max_tokens": 100
}'
```
</TabItem>
</Tabs>
## Request Format
---
Request body will be in the Anthropic messages API format. **litellm follows the Anthropic messages specification for this endpoint.**
#### Example request body
```json
{
"model": "claude-3-7-sonnet-20250219",
"max_tokens": 1024,
"messages": [
{
"role": "user",
"content": "Hello, world"
}
]
}
```
#### Required Fields
- **model** (string):
The model identifier (e.g., `"claude-3-7-sonnet-20250219"`).
- **max_tokens** (integer):
The maximum number of tokens to generate before stopping.
_Note: The model may stop before reaching this limit; value must be greater than 1._
- **messages** (array of objects):
An ordered list of conversational turns.
Each message object must include:
- **role** (enum: `"user"` or `"assistant"`):
Specifies the speaker of the message.
- **content** (string or array of content blocks):
The text or content blocks (e.g., an array containing objects with a `type` such as `"text"`) that form the message.
_Example equivalence:_
```json
{"role": "user", "content": "Hello, Claude"}
```
is equivalent to:
```json
{"role": "user", "content": [{"type": "text", "text": "Hello, Claude"}]}
```
#### Optional Fields
- **metadata** (object):
Contains additional metadata about the request (e.g., `user_id` as an opaque identifier).
- **stop_sequences** (array of strings):
Custom sequences that, when encountered in the generated text, cause the model to stop.
- **stream** (boolean):
Indicates whether to stream the response using server-sent events.
- **system** (string or array):
A system prompt providing context or specific instructions to the model.
- **temperature** (number):
Controls randomness in the models responses. Valid range: `0 < temperature < 1`.
- **thinking** (object):
Configuration for enabling extended thinking. If enabled, it includes:
- **budget_tokens** (integer):
Minimum of 1024 tokens (and less than `max_tokens`).
- **type** (enum):
E.g., `"enabled"`.
- **tool_choice** (object):
Instructs how the model should utilize any provided tools.
- **tools** (array of objects):
Definitions for tools available to the model. Each tool includes:
- **name** (string):
The tools name.
- **description** (string):
A detailed description of the tool.
- **input_schema** (object):
A JSON schema describing the expected input format for the tool.
- **top_k** (integer):
Limits sampling to the top K options.
- **top_p** (number):
Enables nucleus sampling with a cumulative probability cutoff. Valid range: `0 < top_p < 1`.
## Response Format
---
Responses will be in the Anthropic messages API format.
#### Example Response
```json
{
"content": [
{
"text": "Hi! My name is Claude.",
"type": "text"
}
],
"id": "msg_013Zva2CMHLNnXjNJJKqJ2EF",
"model": "claude-3-7-sonnet-20250219",
"role": "assistant",
"stop_reason": "end_turn",
"stop_sequence": null,
"type": "message",
"usage": {
"input_tokens": 2095,
"output_tokens": 503,
"cache_creation_input_tokens": 2095,
"cache_read_input_tokens": 0
}
}
```
#### Response fields
- **content** (array of objects):
Contains the generated content blocks from the model. Each block includes:
- **type** (string):
Indicates the type of content (e.g., `"text"`, `"tool_use"`, `"thinking"`, or `"redacted_thinking"`).
- **text** (string):
The generated text from the model.
_Note: Maximum length is 5,000,000 characters._
- **citations** (array of objects or `null`):
Optional field providing citation details. Each citation includes:
- **cited_text** (string):
The excerpt being cited.
- **document_index** (integer):
An index referencing the cited document.
- **document_title** (string or `null`):
The title of the cited document.
- **start_char_index** (integer):
The starting character index for the citation.
- **end_char_index** (integer):
The ending character index for the citation.
- **type** (string):
Typically `"char_location"`.
- **id** (string):
A unique identifier for the response message.
_Note: The format and length of IDs may change over time._
- **model** (string):
Specifies the model that generated the response.
- **role** (string):
Indicates the role of the generated message. For responses, this is always `"assistant"`.
- **stop_reason** (string):
Explains why the model stopped generating text. Possible values include:
- `"end_turn"`: The model reached a natural stopping point.
- `"max_tokens"`: The generation stopped because the maximum token limit was reached.
- `"stop_sequence"`: A custom stop sequence was encountered.
- `"tool_use"`: The model invoked one or more tools.
- **stop_sequence** (string or `null`):
Contains the specific stop sequence that caused the generation to halt, if applicable; otherwise, it is `null`.
- **type** (string):
Denotes the type of response object, which is always `"message"`.
- **usage** (object):
Provides details on token usage for billing and rate limiting. This includes:
- **input_tokens** (integer):
Total number of input tokens processed.
- **output_tokens** (integer):
Total number of output tokens generated.
- **cache_creation_input_tokens** (integer or `null`):
Number of tokens used to create a cache entry.
- **cache_read_input_tokens** (integer or `null`):
Number of tokens read from the cache.

View file

@ -1,7 +1,7 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Assistants API
# /assistants
Covers Threads, Messages, Assistants.

View file

@ -1,7 +1,7 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# [BETA] Batches API
# /batches
Covers Batches, Files

View file

@ -3,7 +3,7 @@ import TabItem from '@theme/TabItem';
# Caching - In-Memory, Redis, s3, Redis Semantic Cache, Disk
[**See Code**](https://github.com/BerriAI/litellm/blob/main/litellm.caching.caching.py)
[**See Code**](https://github.com/BerriAI/litellm/blob/main/litellm/caching/caching.py)
:::info
@ -26,7 +26,7 @@ Install redis
pip install redis
```
For the hosted version you can setup your own Redis DB here: https://app.redislabs.com/
For the hosted version you can setup your own Redis DB here: https://redis.io/try-free/
```python
import litellm
@ -37,11 +37,11 @@ litellm.cache = Cache(type="redis", host=<host>, port=<port>, password=<password
# Make completion calls
response1 = completion(
model="gpt-3.5-turbo",
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me a joke."}]
)
response2 = completion(
model="gpt-3.5-turbo",
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me a joke."}]
)
@ -91,12 +91,12 @@ response2 = completion(
<TabItem value="redis-sem" label="redis-semantic cache">
Install redis
Install redisvl client
```shell
pip install redisvl==0.0.7
pip install redisvl==0.4.1
```
For the hosted version you can setup your own Redis DB here: https://app.redislabs.com/
For the hosted version you can setup your own Redis DB here: https://redis.io/try-free/
```python
import litellm
@ -114,6 +114,7 @@ litellm.cache = Cache(
port=os.environ["REDIS_PORT"],
password=os.environ["REDIS_PASSWORD"],
similarity_threshold=0.8, # similarity threshold for cache hits, 0 == no similarity, 1 = exact matches, 0.5 == 50% similarity
ttl=120,
redis_semantic_cache_embedding_model="text-embedding-ada-002", # this model is passed to litellm.embedding(), any litellm.embedding() model is supported here
)
response1 = completion(
@ -471,11 +472,13 @@ def __init__(
password: Optional[str] = None,
namespace: Optional[str] = None,
default_in_redis_ttl: Optional[float] = None,
similarity_threshold: Optional[float] = None,
redis_semantic_cache_use_async=False,
redis_semantic_cache_embedding_model="text-embedding-ada-002",
redis_flush_size=None,
# redis semantic cache params
similarity_threshold: Optional[float] = None,
redis_semantic_cache_embedding_model: str = "text-embedding-ada-002",
redis_semantic_cache_index_name: Optional[str] = None,
# s3 Bucket, boto3 configuration
s3_bucket_name: Optional[str] = None,
s3_region_name: Optional[str] = None,

View file

@ -27,16 +27,18 @@ os.environ["AWS_REGION_NAME"] = ""
# pdf url
image_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
file_url = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
# model
model = "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0"
image_content = [
file_content = [
{"type": "text", "text": "What's this file about?"},
{
"type": "image_url",
"image_url": image_url, # OR {"url": image_url}
"type": "file",
"file": {
"file_id": file_url,
}
},
]
@ -46,7 +48,7 @@ if not supports_pdf_input(model, None):
response = completion(
model=model,
messages=[{"role": "user", "content": image_content}],
messages=[{"role": "user", "content": file_content}],
)
assert response is not None
```
@ -80,11 +82,15 @@ curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-d '{
"model": "bedrock-model",
"messages": [
{"role": "user", "content": {"type": "text", "text": "What's this file about?"}},
{
"type": "image_url",
"image_url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
}
{"role": "user", "content": [
{"type": "text", "text": "What's this file about?"},
{
"type": "file",
"file": {
"file_id": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
}
}
]},
]
}'
```
@ -116,11 +122,13 @@ base64_url = f"data:application/pdf;base64,{encoded_file}"
# model
model = "bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0"
image_content = [
file_content = [
{"type": "text", "text": "What's this file about?"},
{
"type": "image_url",
"image_url": base64_url, # OR {"url": base64_url}
"type": "file",
"file": {
"file_data": base64_url,
}
},
]
@ -130,11 +138,53 @@ if not supports_pdf_input(model, None):
response = completion(
model=model,
messages=[{"role": "user", "content": image_content}],
messages=[{"role": "user", "content": file_content}],
)
assert response is not None
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: bedrock-model
litellm_params:
model: bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0
aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: os.environ/AWS_REGION_NAME
```
2. Start the proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "bedrock-model",
"messages": [
{"role": "user", "content": [
{"type": "text", "text": "What's this file about?"},
{
"type": "file",
"file": {
"file_data": "data:application/pdf;base64...",
}
}
]},
]
}'
```
</TabItem>
</Tabs>
## Checking if a model supports pdf input

View file

@ -107,4 +107,76 @@ response = litellm.completion(
</TabItem>
</Tabs>
**additional_drop_params**: List or null - Is a list of openai params you want to drop when making a call to the model.
**additional_drop_params**: List or null - Is a list of openai params you want to drop when making a call to the model.
## Specify allowed openai params in a request
Tell litellm to allow specific openai params in a request. Use this if you get a `litellm.UnsupportedParamsError` and want to allow a param. LiteLLM will pass the param as is to the model.
<Tabs>
<TabItem value="sdk" label="LiteLLM Python SDK">
In this example we pass `allowed_openai_params=["tools"]` to allow the `tools` param.
```python showLineNumbers title="Pass allowed_openai_params to LiteLLM Python SDK"
await litellm.acompletion(
model="azure/o_series/<my-deployment-name>",
api_key="xxxxx",
api_base=api_base,
messages=[{"role": "user", "content": "Hello! return a json object"}],
tools=[{"type": "function", "function": {"name": "get_current_time", "description": "Get the current time in a given location.", "parameters": {"type": "object", "properties": {"location": {"type": "string", "description": "The city name, e.g. San Francisco"}}, "required": ["location"]}}}]
allowed_openai_params=["tools"],
)
```
</TabItem>
<TabItem value="proxy" label="LiteLLM Proxy">
When using litellm proxy you can pass `allowed_openai_params` in two ways:
1. Dynamically pass `allowed_openai_params` in a request
2. Set `allowed_openai_params` on the config.yaml file for a specific model
#### Dynamically pass allowed_openai_params in a request
In this example we pass `allowed_openai_params=["tools"]` to allow the `tools` param for a request sent to the model set on the proxy.
```python showLineNumbers title="Dynamically pass allowed_openai_params in a request"
import openai
from openai import AsyncAzureOpenAI
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
],
extra_body={
"allowed_openai_params": ["tools"]
}
)
```
#### Set allowed_openai_params on config.yaml
You can also set `allowed_openai_params` on the config.yaml file for a specific model. This means that all requests to this deployment are allowed to pass in the `tools` param.
```yaml showLineNumbers title="Set allowed_openai_params on config.yaml"
model_list:
- model_name: azure-o1-preview
litellm_params:
model: azure/o_series/<my-deployment-name>
api_key: xxxxx
api_base: https://openai-prod-test.openai.azure.com/openai/deployments/o1/chat/completions?api-version=2025-01-01-preview
allowed_openai_params: ["tools"]
```
</TabItem>
</Tabs>

View file

@ -3,7 +3,13 @@ import TabItem from '@theme/TabItem';
# Prompt Caching
For OpenAI + Anthropic + Deepseek, LiteLLM follows the OpenAI prompt caching usage object format:
Supported Providers:
- OpenAI (`openai/`)
- Anthropic API (`anthropic/`)
- Bedrock (`bedrock/`, `bedrock/invoke/`, `bedrock/converse`) ([All models bedrock supports prompt caching on](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html))
- Deepseek API (`deepseek/`)
For the supported providers, LiteLLM follows the OpenAI prompt caching usage object format:
```bash
"usage": {
@ -499,4 +505,4 @@ curl -L -X GET 'http://0.0.0.0:4000/v1/model/info' \
</TabItem>
</Tabs>
This checks our maintained [model info/cost map](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json)
This checks our maintained [model info/cost map](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json)

View file

@ -46,7 +46,7 @@ from litellm import completion
fallback_dict = {"gpt-3.5-turbo": "gpt-3.5-turbo-16k"}
messages = [{"content": "how does a court case get to the Supreme Court?" * 500, "role": "user"}]
completion(model="gpt-3.5-turbo", messages=messages, context_window_fallback_dict=ctx_window_fallback_dict)
completion(model="gpt-3.5-turbo", messages=messages, context_window_fallback_dict=fallback_dict)
```
### Fallbacks - Switch Models/API Keys/API Bases (SDK)

View file

@ -189,4 +189,138 @@ Expected Response
```
</TabItem>
</Tabs>
</Tabs>
## Explicitly specify image type
If you have images without a mime-type, or if litellm is incorrectly inferring the mime type of your image (e.g. calling `gs://` url's with vertex ai), you can set this explicity via the `format` param.
```python
"image_url": {
"url": "gs://my-gs-image",
"format": "image/jpeg"
}
```
LiteLLM will use this for any API endpoint, which supports specifying mime-type (e.g. anthropic/bedrock/vertex ai).
For others (e.g. openai), it will be ignored.
<Tabs>
<TabItem label="SDK" value="sdk">
```python
import os
from litellm import completion
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
# openai call
response = completion(
model = "claude-3-7-sonnet-latest",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Whats in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"format": "image/jpeg"
}
}
]
}
],
)
```
</TabItem>
<TabItem label="PROXY" value="proxy">
1. Define vision models on config.yaml
```yaml
model_list:
- model_name: gpt-4-vision-preview # OpenAI gpt-4-vision-preview
litellm_params:
model: openai/gpt-4-vision-preview
api_key: os.environ/OPENAI_API_KEY
- model_name: llava-hf # Custom OpenAI compatible model
litellm_params:
model: openai/llava-hf/llava-v1.6-vicuna-7b-hf
api_base: http://localhost:8000
api_key: fake-key
model_info:
supports_vision: True # set supports_vision to True so /model/info returns this attribute as True
```
2. Run proxy server
```bash
litellm --config config.yaml
```
3. Test it using the OpenAI Python SDK
```python
import os
from openai import OpenAI
client = OpenAI(
api_key="sk-1234", # your litellm proxy api key
)
response = client.chat.completions.create(
model = "gpt-4-vision-preview", # use model="llava-hf" to test your custom OpenAI endpoint
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Whats in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
"format": "image/jpeg"
}
}
]
}
],
)
```
</TabItem>
</Tabs>
## Spec
```
"image_url": str
OR
"image_url": {
"url": "url OR base64 encoded str",
"detail": "openai-only param",
"format": "specify mime-type of image"
}
```

View file

@ -0,0 +1,308 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Using Web Search
Use web search with litellm
| Feature | Details |
|---------|---------|
| Supported Endpoints | - `/chat/completions` <br/> - `/responses` |
| Supported Providers | `openai` |
| LiteLLM Cost Tracking | ✅ Supported |
| LiteLLM Version | `v1.63.15-nightly` or higher |
## `/chat/completions` (litellm.completion)
### Quick Start
<Tabs>
<TabItem value="sdk" label="SDK">
```python showLineNumbers
from litellm import completion
response = completion(
model="openai/gpt-4o-search-preview",
messages=[
{
"role": "user",
"content": "What was a positive news story from today?",
}
],
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gpt-4o-search-preview
litellm_params:
model: openai/gpt-4o-search-preview
api_key: os.environ/OPENAI_API_KEY
```
2. Start the proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```python showLineNumbers
from openai import OpenAI
# Point to your proxy server
client = OpenAI(
api_key="sk-1234",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="gpt-4o-search-preview",
messages=[
{
"role": "user",
"content": "What was a positive news story from today?"
}
]
)
```
</TabItem>
</Tabs>
### Search context size
<Tabs>
<TabItem value="sdk" label="SDK">
```python showLineNumbers
from litellm import completion
# Customize search context size
response = completion(
model="openai/gpt-4o-search-preview",
messages=[
{
"role": "user",
"content": "What was a positive news story from today?",
}
],
web_search_options={
"search_context_size": "low" # Options: "low", "medium" (default), "high"
}
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```python showLineNumbers
from openai import OpenAI
# Point to your proxy server
client = OpenAI(
api_key="sk-1234",
base_url="http://0.0.0.0:4000"
)
# Customize search context size
response = client.chat.completions.create(
model="gpt-4o-search-preview",
messages=[
{
"role": "user",
"content": "What was a positive news story from today?"
}
],
web_search_options={
"search_context_size": "low" # Options: "low", "medium" (default), "high"
}
)
```
</TabItem>
</Tabs>
## `/responses` (litellm.responses)
### Quick Start
<Tabs>
<TabItem value="sdk" label="SDK">
```python showLineNumbers
from litellm import responses
response = responses(
model="openai/gpt-4o",
input=[
{
"role": "user",
"content": "What was a positive news story from today?"
}
],
tools=[{
"type": "web_search_preview" # enables web search with default medium context size
}]
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
```
2. Start the proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```python showLineNumbers
from openai import OpenAI
# Point to your proxy server
client = OpenAI(
api_key="sk-1234",
base_url="http://0.0.0.0:4000"
)
response = client.responses.create(
model="gpt-4o",
tools=[{
"type": "web_search_preview"
}],
input="What was a positive news story from today?",
)
print(response.output_text)
```
</TabItem>
</Tabs>
### Search context size
<Tabs>
<TabItem value="sdk" label="SDK">
```python showLineNumbers
from litellm import responses
# Customize search context size
response = responses(
model="openai/gpt-4o",
input=[
{
"role": "user",
"content": "What was a positive news story from today?"
}
],
tools=[{
"type": "web_search_preview",
"search_context_size": "low" # Options: "low", "medium" (default), "high"
}]
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```python showLineNumbers
from openai import OpenAI
# Point to your proxy server
client = OpenAI(
api_key="sk-1234",
base_url="http://0.0.0.0:4000"
)
# Customize search context size
response = client.responses.create(
model="gpt-4o",
tools=[{
"type": "web_search_preview",
"search_context_size": "low" # Options: "low", "medium" (default), "high"
}],
input="What was a positive news story from today?",
)
print(response.output_text)
```
</TabItem>
</Tabs>
## Checking if a model supports web search
<Tabs>
<TabItem label="SDK" value="sdk">
Use `litellm.supports_web_search(model="openai/gpt-4o-search-preview")` -> returns `True` if model can perform web searches
```python showLineNumbers
assert litellm.supports_web_search(model="openai/gpt-4o-search-preview") == True
```
</TabItem>
<TabItem label="PROXY" value="proxy">
1. Define OpenAI models in config.yaml
```yaml
model_list:
- model_name: gpt-4o-search-preview
litellm_params:
model: openai/gpt-4o-search-preview
api_key: os.environ/OPENAI_API_KEY
model_info:
supports_web_search: True
```
2. Run proxy server
```bash
litellm --config config.yaml
```
3. Call `/model_group/info` to check if a model supports web search
```shell
curl -X 'GET' \
'http://localhost:4000/model_group/info' \
-H 'accept: application/json' \
-H 'x-api-key: sk-1234'
```
Expected Response
```json showLineNumbers
{
"data": [
{
"model_group": "gpt-4o-search-preview",
"providers": ["openai"],
"max_tokens": 128000,
"supports_web_search": true, # 👈 supports_web_search is true
}
]
}
```
</TabItem>
</Tabs>

View file

@ -46,7 +46,7 @@ For security inquiries, please contact us at support@berri.ai
|-------------------|-------------------------------------------------------------------------------------------------|
| SOC 2 Type I | Certified. Report available upon request on Enterprise plan. |
| SOC 2 Type II | In progress. Certificate available by April 15th, 2025 |
| ISO27001 | In progress. Certificate available by February 7th, 2025 |
| ISO 27001 | Certified. Report available upon request on Enterprise |
## Supported Data Regions for LiteLLM Cloud
@ -137,7 +137,7 @@ Point of contact email address for general security-related questions: krrish@be
Has the Vendor been audited / certified?
- SOC 2 Type I. Certified. Report available upon request on Enterprise plan.
- SOC 2 Type II. In progress. Certificate available by April 15th, 2025.
- ISO27001. In progress. Certificate available by February 7th, 2025.
- ISO 27001. Certified. Report available upon request on Enterprise plan.
Has an information security management system been implemented?
- Yes - [CodeQL](https://codeql.github.com/) and a comprehensive ISMS covering multiple security domains.

View file

@ -1,7 +1,7 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Embeddings
# /embeddings
## Quick Start
```python

View file

@ -1,3 +1,5 @@
import Image from '@theme/IdealImage';
# Enterprise
For companies that need SSO, user management and professional support for LiteLLM Proxy
@ -7,6 +9,8 @@ Get free 7-day trial key [here](https://www.litellm.ai/#trial)
Includes all enterprise features.
<Image img={require('../img/enterprise_vs_oss.png')} />
[**Procurement available via AWS / Azure Marketplace**](./data_security.md#legalcompliance-faqs)
@ -34,9 +38,9 @@ You can use our cloud product where we setup a dedicated instance for you.
Professional Support can assist with LLM/Provider integrations, deployment, upgrade management, and LLM Provider troubleshooting. We cant solve your own infrastructure-related issues but we will guide you to fix them.
- 1 hour for Sev0 issues
- 6 hours for Sev1
- 24h for Sev2-Sev3 between 7am 7pm PT (Monday through Saturday)
- 1 hour for Sev0 issues - 100% production traffic is failing
- 6 hours for Sev1 - <100% production traffic is failing
- 24h for Sev2-Sev3 between 7am 7pm PT (Monday through Saturday) - setup issues e.g. Redis working on our end, but not on your infrastructure.
- 72h SLA for patching vulnerabilities in the software.
**We can offer custom SLAs** based on your needs and the severity of the issue

View file

@ -0,0 +1,106 @@
# Contributing Code
## **Checklist before submitting a PR**
Here are the core requirements for any PR submitted to LiteLLM
- [ ] Add testing, **Adding at least 1 test is a hard requirement** - [see details](#2-adding-testing-to-your-pr)
- [ ] Ensure your PR passes the following tests:
- [ ] [Unit Tests](#3-running-unit-tests)
- [ ] [Formatting / Linting Tests](#35-running-linting-tests)
- [ ] Keep scope as isolated as possible. As a general rule, your changes should address 1 specific problem at a time
## Quick start
## 1. Setup your local dev environment
Here's how to modify the repo locally:
Step 1: Clone the repo
```shell
git clone https://github.com/BerriAI/litellm.git
```
Step 2: Install dev dependencies:
```shell
poetry install --with dev --extras proxy
```
That's it, your local dev environment is ready!
## 2. Adding Testing to your PR
- Add your test to the [`tests/litellm/` directory](https://github.com/BerriAI/litellm/tree/main/tests/litellm)
- This directory 1:1 maps the the `litellm/` directory, and can only contain mocked tests.
- Do not add real llm api calls to this directory.
### 2.1 File Naming Convention for `tests/litellm/`
The `tests/litellm/` directory follows the same directory structure as `litellm/`.
- `litellm/proxy/test_caching_routes.py` maps to `litellm/proxy/caching_routes.py`
- `test_{filename}.py` maps to `litellm/{filename}.py`
## 3. Running Unit Tests
run the following command on the root of the litellm directory
```shell
make test-unit
```
## 3.5 Running Linting Tests
run the following command on the root of the litellm directory
```shell
make lint
```
LiteLLM uses mypy for linting. On ci/cd we also run `black` for formatting.
## 4. Submit a PR with your changes!
- push your fork to your GitHub repo
- submit a PR from there
## Advanced
### Building LiteLLM Docker Image
Some people might want to build the LiteLLM docker image themselves. Follow these instructions if you want to build / run the LiteLLM Docker Image yourself.
Step 1: Clone the repo
```shell
git clone https://github.com/BerriAI/litellm.git
```
Step 2: Build the Docker Image
Build using Dockerfile.non_root
```shell
docker build -f docker/Dockerfile.non_root -t litellm_test_image .
```
Step 3: Run the Docker Image
Make sure config.yaml is present in the root directory. This is your litellm proxy config file.
```shell
docker run \
-v $(pwd)/proxy_config.yaml:/app/config.yaml \
-e DATABASE_URL="postgresql://xxxxxxxx" \
-e LITELLM_MASTER_KEY="sk-1234" \
-p 4000:4000 \
litellm_test_image \
--config /app/config.yaml --detailed_debug
```

View file

@ -2,10 +2,12 @@
import TabItem from '@theme/TabItem';
import Tabs from '@theme/Tabs';
# Files API
# Provider Files Endpoints
Files are used to upload documents that can be used with features like Assistants, Fine-tuning, and Batch API.
Use this to call the provider's `/files` endpoints directly, in the OpenAI format.
## Quick Start
- Upload a File
@ -14,48 +16,105 @@ Files are used to upload documents that can be used with features like Assistant
- Delete File
- Get File Content
<Tabs>
<TabItem value="proxy" label="LiteLLM PROXY Server">
```bash
$ export OPENAI_API_KEY="sk-..."
1. Setup config.yaml
$ litellm
# RUNNING on http://0.0.0.0:4000
```
# for /files endpoints
files_settings:
- custom_llm_provider: azure
api_base: https://exampleopenaiendpoint-production.up.railway.app
api_key: fake-key
api_version: "2023-03-15-preview"
- custom_llm_provider: openai
api_key: os.environ/OPENAI_API_KEY
```
**Upload a File**
2. Start LiteLLM PROXY Server
```bash
curl http://localhost:4000/v1/files \
-H "Authorization: Bearer sk-1234" \
-F purpose="fine-tune" \
-F file="@mydata.jsonl"
litellm --config /path/to/config.yaml
## RUNNING on http://0.0.0.0:4000
```
**List Files**
```bash
curl http://localhost:4000/v1/files \
-H "Authorization: Bearer sk-1234"
3. Use OpenAI's /files endpoints
Upload a File
```python
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="http://0.0.0.0:4000/v1"
)
client.files.create(
file=wav_data,
purpose="user_data",
extra_body={"custom_llm_provider": "openai"}
)
```
**Retrieve File Information**
```bash
curl http://localhost:4000/v1/files/file-abc123 \
-H "Authorization: Bearer sk-1234"
List Files
```python
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="http://0.0.0.0:4000/v1"
)
files = client.files.list(extra_body={"custom_llm_provider": "openai"})
print("files=", files)
```
**Delete File**
```bash
curl http://localhost:4000/v1/files/file-abc123 \
-X DELETE \
-H "Authorization: Bearer sk-1234"
Retrieve File Information
```python
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="http://0.0.0.0:4000/v1"
)
file = client.files.retrieve(file_id="file-abc123", extra_body={"custom_llm_provider": "openai"})
print("file=", file)
```
**Get File Content**
```bash
curl http://localhost:4000/v1/files/file-abc123/content \
-H "Authorization: Bearer sk-1234"
Delete File
```python
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="http://0.0.0.0:4000/v1"
)
response = client.files.delete(file_id="file-abc123", extra_body={"custom_llm_provider": "openai"})
print("delete response=", response)
```
Get File Content
```python
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="http://0.0.0.0:4000/v1"
)
content = client.files.content(file_id="file-abc123", extra_body={"custom_llm_provider": "openai"})
print("content=", content)
```
</TabItem>
@ -120,7 +179,7 @@ print("file content=", content)
### [OpenAI](#quick-start)
## [Azure OpenAI](./providers/azure#azure-batches-api)
### [Azure OpenAI](./providers/azure#azure-batches-api)
### [Vertex AI](./providers/vertex#batch-apis)

View file

@ -1,7 +1,7 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# [Beta] Fine-tuning API
# /fine_tuning
:::info

View file

@ -0,0 +1,66 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# SSL Security Settings
If you're in an environment using an older TTS bundle, with an older encryption, follow this guide.
LiteLLM uses HTTPX for network requests, unless otherwise specified.
1. Disable SSL verification
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
litellm.ssl_verify = False
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```yaml
litellm_settings:
ssl_verify: false
```
</TabItem>
<TabItem value="env_var" label="Environment Variables">
```bash
export SSL_VERIFY="False"
```
</TabItem>
</Tabs>
2. Lower security settings
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
litellm.ssl_security_level = 1
litellm.ssl_certificate = "/path/to/certificate.pem"
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```yaml
litellm_settings:
ssl_security_level: 1
ssl_certificate: "/path/to/certificate.pem"
```
</TabItem>
<TabItem value="env_var" label="Environment Variables">
```bash
export SSL_SECURITY_LEVEL="1"
export SSL_CERTIFICATE="/path/to/certificate.pem"
```
</TabItem>
</Tabs>

View file

@ -111,8 +111,8 @@ from litellm import completion
import os
# auth: run 'gcloud auth application-default'
os.environ["VERTEX_PROJECT"] = "hardy-device-386718"
os.environ["VERTEX_LOCATION"] = "us-central1"
os.environ["VERTEXAI_PROJECT"] = "hardy-device-386718"
os.environ["VERTEXAI_LOCATION"] = "us-central1"
response = completion(
model="vertex_ai/gemini-1.5-pro",

427
docs/my-website/docs/mcp.md Normal file
View file

@ -0,0 +1,427 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import Image from '@theme/IdealImage';
# /mcp [BETA] - Model Context Protocol
## Expose MCP tools on LiteLLM Proxy Server
This allows you to define tools that can be called by any MCP compatible client. Define your `mcp_servers` with LiteLLM and all your clients can list and call available tools.
<Image
img={require('../img/mcp_2.png')}
style={{width: '100%', display: 'block', margin: '2rem auto'}}
/>
<p style={{textAlign: 'left', color: '#666'}}>
LiteLLM MCP Architecture: Use MCP tools with all LiteLLM supported models
</p>
#### How it works
LiteLLM exposes the following MCP endpoints:
- `/mcp/tools/list` - List all available tools
- `/mcp/tools/call` - Call a specific tool with the provided arguments
When MCP clients connect to LiteLLM they can follow this workflow:
1. Connect to the LiteLLM MCP server
2. List all available tools on LiteLLM
3. Client makes LLM API request with tool call(s)
4. LLM API returns which tools to call and with what arguments
5. MCP client makes MCP tool calls to LiteLLM
6. LiteLLM makes the tool calls to the appropriate MCP server
7. LiteLLM returns the tool call results to the MCP client
#### Usage
#### 1. Define your tools on under `mcp_servers` in your config.yaml file.
LiteLLM allows you to define your tools on the `mcp_servers` section in your config.yaml file. All tools listed here will be available to MCP clients (when they connect to LiteLLM and call `list_tools`).
```yaml title="config.yaml" showLineNumbers
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-xxxxxxx
mcp_servers:
{
"zapier_mcp": {
"url": "https://actions.zapier.com/mcp/sk-akxxxxx/sse"
},
"fetch": {
"url": "http://localhost:8000/sse"
}
}
```
#### 2. Start LiteLLM Gateway
<Tabs>
<TabItem value="docker" label="Docker Run">
```shell title="Docker Run" showLineNumbers
docker run -d \
-p 4000:4000 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
--name my-app \
-v $(pwd)/my_config.yaml:/app/config.yaml \
my-app:latest \
--config /app/config.yaml \
--port 4000 \
--detailed_debug \
```
</TabItem>
<TabItem value="py" label="litellm pip">
```shell title="litellm pip" showLineNumbers
litellm --config config.yaml --detailed_debug
```
</TabItem>
</Tabs>
#### 3. Make an LLM API request
In this example we will do the following:
1. Use MCP client to list MCP tools on LiteLLM Proxy
2. Use `transform_mcp_tool_to_openai_tool` to convert MCP tools to OpenAI tools
3. Provide the MCP tools to `gpt-4o`
4. Handle tool call from `gpt-4o`
5. Convert OpenAI tool call to MCP tool call
6. Execute tool call on MCP server
```python title="MCP Client List Tools" showLineNumbers
import asyncio
from openai import AsyncOpenAI
from openai.types.chat import ChatCompletionUserMessageParam
from mcp import ClientSession
from mcp.client.sse import sse_client
from litellm.experimental_mcp_client.tools import (
transform_mcp_tool_to_openai_tool,
transform_openai_tool_call_request_to_mcp_tool_call_request,
)
async def main():
# Initialize clients
# point OpenAI client to LiteLLM Proxy
client = AsyncOpenAI(api_key="sk-1234", base_url="http://localhost:4000")
# Point MCP client to LiteLLM Proxy
async with sse_client("http://localhost:4000/mcp/") as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# 1. List MCP tools on LiteLLM Proxy
mcp_tools = await session.list_tools()
print("List of MCP tools for MCP server:", mcp_tools.tools)
# Create message
messages = [
ChatCompletionUserMessageParam(
content="Send an email about LiteLLM supporting MCP", role="user"
)
]
# 2. Use `transform_mcp_tool_to_openai_tool` to convert MCP tools to OpenAI tools
# Since OpenAI only supports tools in the OpenAI format, we need to convert the MCP tools to the OpenAI format.
openai_tools = [
transform_mcp_tool_to_openai_tool(tool) for tool in mcp_tools.tools
]
# 3. Provide the MCP tools to `gpt-4o`
response = await client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=openai_tools,
tool_choice="auto",
)
# 4. Handle tool call from `gpt-4o`
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
if tool_call:
# 5. Convert OpenAI tool call to MCP tool call
# Since MCP servers expect tools in the MCP format, we need to convert the OpenAI tool call to the MCP format.
# This is done using litellm.experimental_mcp_client.tools.transform_openai_tool_call_request_to_mcp_tool_call_request
mcp_call = (
transform_openai_tool_call_request_to_mcp_tool_call_request(
openai_tool=tool_call.model_dump()
)
)
# 6. Execute tool call on MCP server
result = await session.call_tool(
name=mcp_call.name, arguments=mcp_call.arguments
)
print("Result:", result)
# Run it
asyncio.run(main())
```
## LiteLLM Python SDK MCP Bridge
LiteLLM Python SDK acts as a MCP bridge to utilize MCP tools with all LiteLLM supported models. LiteLLM offers the following features for using MCP
- **List** Available MCP Tools: OpenAI clients can view all available MCP tools
- `litellm.experimental_mcp_client.load_mcp_tools` to list all available MCP tools
- **Call** MCP Tools: OpenAI clients can call MCP tools
- `litellm.experimental_mcp_client.call_openai_tool` to call an OpenAI tool on an MCP server
### 1. List Available MCP Tools
In this example we'll use `litellm.experimental_mcp_client.load_mcp_tools` to list all available MCP tools on any MCP server. This method can be used in two ways:
- `format="mcp"` - (default) Return MCP tools
- Returns: `mcp.types.Tool`
- `format="openai"` - Return MCP tools converted to OpenAI API compatible tools. Allows using with OpenAI endpoints.
- Returns: `openai.types.chat.ChatCompletionToolParam`
<Tabs>
<TabItem value="sdk" label="LiteLLM Python SDK">
```python title="MCP Client List Tools" showLineNumbers
# Create server parameters for stdio connection
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import os
import litellm
from litellm import experimental_mcp_client
server_params = StdioServerParameters(
command="python3",
# Make sure to update to the full absolute path to your mcp_server.py file
args=["./mcp_server.py"],
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize the connection
await session.initialize()
# Get tools
tools = await experimental_mcp_client.load_mcp_tools(session=session, format="openai")
print("MCP TOOLS: ", tools)
messages = [{"role": "user", "content": "what's (3 + 5)"}]
llm_response = await litellm.acompletion(
model="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY"),
messages=messages,
tools=tools,
)
print("LLM RESPONSE: ", json.dumps(llm_response, indent=4, default=str))
```
</TabItem>
<TabItem value="openai" label="OpenAI SDK + LiteLLM Proxy">
In this example we'll walk through how you can use the OpenAI SDK pointed to the LiteLLM proxy to call MCP tools. The key difference here is we use the OpenAI SDK to make the LLM API request
```python title="MCP Client List Tools" showLineNumbers
# Create server parameters for stdio connection
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import os
from openai import OpenAI
from litellm import experimental_mcp_client
server_params = StdioServerParameters(
command="python3",
# Make sure to update to the full absolute path to your mcp_server.py file
args=["./mcp_server.py"],
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize the connection
await session.initialize()
# Get tools using litellm mcp client
tools = await experimental_mcp_client.load_mcp_tools(session=session, format="openai")
print("MCP TOOLS: ", tools)
# Use OpenAI SDK pointed to LiteLLM proxy
client = OpenAI(
api_key="your-api-key", # Your LiteLLM proxy API key
base_url="http://localhost:4000" # Your LiteLLM proxy URL
)
messages = [{"role": "user", "content": "what's (3 + 5)"}]
llm_response = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools
)
print("LLM RESPONSE: ", llm_response)
```
</TabItem>
</Tabs>
### 2. List and Call MCP Tools
In this example we'll use
- `litellm.experimental_mcp_client.load_mcp_tools` to list all available MCP tools on any MCP server
- `litellm.experimental_mcp_client.call_openai_tool` to call an OpenAI tool on an MCP server
The first llm response returns a list of OpenAI tools. We take the first tool call from the LLM response and pass it to `litellm.experimental_mcp_client.call_openai_tool` to call the tool on the MCP server.
#### How `litellm.experimental_mcp_client.call_openai_tool` works
- Accepts an OpenAI Tool Call from the LLM response
- Converts the OpenAI Tool Call to an MCP Tool
- Calls the MCP Tool on the MCP server
- Returns the result of the MCP Tool call
<Tabs>
<TabItem value="sdk" label="LiteLLM Python SDK">
```python title="MCP Client List and Call Tools" showLineNumbers
# Create server parameters for stdio connection
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import os
import litellm
from litellm import experimental_mcp_client
server_params = StdioServerParameters(
command="python3",
# Make sure to update to the full absolute path to your mcp_server.py file
args=["./mcp_server.py"],
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize the connection
await session.initialize()
# Get tools
tools = await experimental_mcp_client.load_mcp_tools(session=session, format="openai")
print("MCP TOOLS: ", tools)
messages = [{"role": "user", "content": "what's (3 + 5)"}]
llm_response = await litellm.acompletion(
model="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY"),
messages=messages,
tools=tools,
)
print("LLM RESPONSE: ", json.dumps(llm_response, indent=4, default=str))
openai_tool = llm_response["choices"][0]["message"]["tool_calls"][0]
# Call the tool using MCP client
call_result = await experimental_mcp_client.call_openai_tool(
session=session,
openai_tool=openai_tool,
)
print("MCP TOOL CALL RESULT: ", call_result)
# send the tool result to the LLM
messages.append(llm_response["choices"][0]["message"])
messages.append(
{
"role": "tool",
"content": str(call_result.content[0].text),
"tool_call_id": openai_tool["id"],
}
)
print("final messages with tool result: ", messages)
llm_response = await litellm.acompletion(
model="gpt-4o",
api_key=os.getenv("OPENAI_API_KEY"),
messages=messages,
tools=tools,
)
print(
"FINAL LLM RESPONSE: ", json.dumps(llm_response, indent=4, default=str)
)
```
</TabItem>
<TabItem value="proxy" label="OpenAI SDK + LiteLLM Proxy">
In this example we'll walk through how you can use the OpenAI SDK pointed to the LiteLLM proxy to call MCP tools. The key difference here is we use the OpenAI SDK to make the LLM API request
```python title="MCP Client with OpenAI SDK" showLineNumbers
# Create server parameters for stdio connection
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
import os
from openai import OpenAI
from litellm import experimental_mcp_client
server_params = StdioServerParameters(
command="python3",
# Make sure to update to the full absolute path to your mcp_server.py file
args=["./mcp_server.py"],
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
# Initialize the connection
await session.initialize()
# Get tools using litellm mcp client
tools = await experimental_mcp_client.load_mcp_tools(session=session, format="openai")
print("MCP TOOLS: ", tools)
# Use OpenAI SDK pointed to LiteLLM proxy
client = OpenAI(
api_key="your-api-key", # Your LiteLLM proxy API key
base_url="http://localhost:8000" # Your LiteLLM proxy URL
)
messages = [{"role": "user", "content": "what's (3 + 5)"}]
llm_response = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools
)
print("LLM RESPONSE: ", llm_response)
# Get the first tool call
tool_call = llm_response.choices[0].message.tool_calls[0]
# Call the tool using MCP client
call_result = await experimental_mcp_client.call_openai_tool(
session=session,
openai_tool=tool_call.model_dump(),
)
print("MCP TOOL CALL RESULT: ", call_result)
# Send the tool result back to the LLM
messages.append(llm_response.choices[0].message.model_dump())
messages.append({
"role": "tool",
"content": str(call_result.content[0].text),
"tool_call_id": tool_call.id,
})
final_response = client.chat.completions.create(
model="gpt-4",
messages=messages,
tools=tools
)
print("FINAL RESPONSE: ", final_response)
```
</TabItem>
</Tabs>

View file

@ -1,7 +1,7 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Moderation
# /moderations
### Usage

View file

@ -1,4 +1,7 @@
import Image from '@theme/IdealImage';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Arize AI
@ -11,6 +14,8 @@ https://github.com/BerriAI/litellm
:::
<Image img={require('../../img/arize.png')} />
## Pre-Requisites
@ -24,7 +29,9 @@ You can also use the instrumentor option instead of the callback, which you can
```python
litellm.callbacks = ["arize"]
```
```python
import litellm
import os
@ -48,7 +55,7 @@ response = litellm.completion(
### Using with LiteLLM Proxy
1. Setup config.yaml
```yaml
model_list:
- model_name: gpt-4
@ -60,13 +67,134 @@ model_list:
litellm_settings:
callbacks: ["arize"]
general_settings:
master_key: "sk-1234" # can also be set as an environment variable
environment_variables:
ARIZE_SPACE_KEY: "d0*****"
ARIZE_API_KEY: "141a****"
ARIZE_ENDPOINT: "https://otlp.arize.com/v1" # OPTIONAL - your custom arize GRPC api endpoint
ARIZE_HTTP_ENDPOINT: "https://otlp.arize.com/v1" # OPTIONAL - your custom arize HTTP api endpoint. Set either this or ARIZE_ENDPOINT
ARIZE_HTTP_ENDPOINT: "https://otlp.arize.com/v1" # OPTIONAL - your custom arize HTTP api endpoint. Set either this or ARIZE_ENDPOINT or Neither (defaults to https://otlp.arize.com/v1 on grpc)
```
2. Start the proxy
```bash
litellm --config config.yaml
```
3. Test it!
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "Hi 👋 - i'm openai"}]}'
```
## Pass Arize Space/Key per-request
Supported parameters:
- `arize_api_key`
- `arize_space_key`
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
import os
# LLM API Keys
os.environ['OPENAI_API_KEY']=""
# set arize as a callback, litellm will send the data to arize
litellm.callbacks = ["arize"]
# openai call
response = litellm.completion(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Hi 👋 - i'm openai"}
],
arize_api_key=os.getenv("ARIZE_SPACE_2_API_KEY"),
arize_space_key=os.getenv("ARIZE_SPACE_2_KEY"),
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gpt-4
litellm_params:
model: openai/fake
api_key: fake-key
api_base: https://exampleopenaiendpoint-production.up.railway.app/
litellm_settings:
callbacks: ["arize"]
general_settings:
master_key: "sk-1234" # can also be set as an environment variable
```
2. Start the proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
<Tabs>
<TabItem value="curl" label="CURL">
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hi 👋 - i'm openai"}],
"arize_api_key": "ARIZE_SPACE_2_API_KEY",
"arize_space_key": "ARIZE_SPACE_2_KEY"
}'
```
</TabItem>
<TabItem value="openai_python" label="OpenAI Python">
```python
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
],
extra_body={
"arize_api_key": "ARIZE_SPACE_2_API_KEY",
"arize_space_key": "ARIZE_SPACE_2_KEY"
}
)
print(response)
```
</TabItem>
</Tabs>
</TabItem>
</Tabs>
## Support & Talk to Founders
- [Schedule Demo 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)

View file

@ -78,6 +78,9 @@ Following are the allowed fields in metadata, their types, and their description
* `context: Optional[Union[dict, str]]` - This is the context used as information for the prompt. For RAG applications, this is the "retrieved" data. You may log context as a string or as an object (dictionary).
* `expected_response: Optional[str]` - This is the reference response to compare against for evaluation purposes. This is useful for segmenting inference calls by expected response.
* `user_query: Optional[str]` - This is the user's query. For conversational applications, this is the user's last message.
* `tags: Optional[list]` - This is a list of tags. This is useful for segmenting inference calls by tags.
* `user_feedback: Optional[str]` - The end users feedback.
* `model_options: Optional[dict]` - This is a dictionary of model options. This is useful for getting insights into how model behavior affects your end users.
* `custom_attributes: Optional[dict]` - This is a dictionary of custom attributes. This is useful for additional information about the inference.
## Using a self hosted deployment of Athina

View file

@ -0,0 +1,95 @@
# OpenAI Passthrough
Pass-through endpoints for `/openai`
## Overview
| Feature | Supported | Notes |
|-------|-------|-------|
| Cost Tracking | ❌ | Not supported |
| Logging | ✅ | Works across all integrations |
| Streaming | ✅ | Fully supported |
### When to use this?
- For 90% of your use cases, you should use the [native LiteLLM OpenAI Integration](https://docs.litellm.ai/docs/providers/openai) (`/chat/completions`, `/embeddings`, `/completions`, `/images`, `/batches`, etc.)
- Use this passthrough to call less popular or newer OpenAI endpoints that LiteLLM doesn't fully support yet, such as `/assistants`, `/threads`, `/vector_stores`
Simply replace `https://api.openai.com` with `LITELLM_PROXY_BASE_URL/openai`
## Usage Examples
### Assistants API
#### Create OpenAI Client
Make sure you do the following:
- Point `base_url` to your `LITELLM_PROXY_BASE_URL/openai`
- Use your `LITELLM_API_KEY` as the `api_key`
```python
import openai
client = openai.OpenAI(
base_url="http://0.0.0.0:4000/openai", # <your-proxy-url>/openai
api_key="sk-anything" # <your-proxy-api-key>
)
```
#### Create an Assistant
```python
# Create an assistant
assistant = client.beta.assistants.create(
name="Math Tutor",
instructions="You are a math tutor. Help solve equations.",
model="gpt-4o",
)
```
#### Create a Thread
```python
# Create a thread
thread = client.beta.threads.create()
```
#### Add a Message to the Thread
```python
# Add a message
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="Solve 3x + 11 = 14",
)
```
#### Run the Assistant
```python
# Create a run to get the assistant's response
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id,
)
# Check run status
run_status = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
```
#### Retrieve Messages
```python
# List messages after the run completes
messages = client.beta.threads.messages.list(
thread_id=thread.id
)
```
#### Delete the Assistant
```python
# Delete the assistant when done
client.beta.assistants.delete(assistant.id)
```

View file

@ -15,6 +15,91 @@ Pass-through endpoints for Vertex AI - call provider-specific endpoint, in nativ
Just replace `https://REGION-aiplatform.googleapis.com` with `LITELLM_PROXY_BASE_URL/vertex_ai`
LiteLLM supports 3 flows for calling Vertex AI endpoints via pass-through:
1. **Specific Credentials**: Admin sets passthrough credentials for a specific project/region.
2. **Default Credentials**: Admin sets default credentials.
3. **Client-Side Credentials**: User can send client-side credentials through to Vertex AI (default behavior - if no default or mapped credentials are found, the request is passed through directly).
## Example Usage
<Tabs>
<TabItem value="specific_credentials" label="Specific Project/Region">
```yaml
model_list:
- model_name: gemini-1.0-pro
litellm_params:
model: vertex_ai/gemini-1.0-pro
vertex_project: adroit-crow-413218
vertex_region: us-central1
vertex_credentials: /path/to/credentials.json
use_in_pass_through: true # 👈 KEY CHANGE
```
</TabItem>
<TabItem value="default_credentials" label="Default Credentials">
<Tabs>
<TabItem value="yaml" label="Set in config.yaml">
```yaml
default_vertex_config:
vertex_project: adroit-crow-413218
vertex_region: us-central1
vertex_credentials: /path/to/credentials.json
```
</TabItem>
<TabItem value="env_var" label="Set in environment variables">
```bash
export DEFAULT_VERTEXAI_PROJECT="adroit-crow-413218"
export DEFAULT_VERTEXAI_LOCATION="us-central1"
export DEFAULT_GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
```
</TabItem>
</Tabs>
</TabItem>
<TabItem value="client_credentials" label="Client Credentials">
Try Gemini 2.0 Flash (curl)
```
MODEL_ID="gemini-2.0-flash-001"
PROJECT_ID="YOUR_PROJECT_ID"
```
```bash
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json" \
"${LITELLM_PROXY_BASE_URL}/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/${MODEL_ID}:streamGenerateContent" -d \
$'{
"contents": {
"role": "user",
"parts": [
{
"fileData": {
"mimeType": "image/png",
"fileUri": "gs://generativeai-downloads/images/scones.jpg"
}
},
{
"text": "Describe this picture."
}
]
}
}'
```
</TabItem>
</Tabs>
#### **Example Usage**
@ -22,7 +107,7 @@ Just replace `https://REGION-aiplatform.googleapis.com` with `LITELLM_PROXY_BASE
<TabItem value="curl" label="curl">
```bash
curl http://localhost:4000/vertex_ai/publishers/google/models/gemini-1.0-pro:generateContent \
curl http://localhost:4000/vertex_ai/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/${MODEL_ID}:generateContent \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{
@ -101,7 +186,7 @@ litellm
Let's call the Google AI Studio token counting endpoint
```bash
curl http://localhost:4000/vertex-ai/publishers/google/models/gemini-1.0-pro:generateContent \
curl http://localhost:4000/vertex-ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.0-pro:generateContent \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
@ -140,7 +225,7 @@ LiteLLM Proxy Server supports two methods of authentication to Vertex AI:
```shell
curl http://localhost:4000/vertex_ai/publishers/google/models/gemini-1.5-flash-001:generateContent \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.5-flash-001:generateContent \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{"contents":[{"role": "user", "parts":[{"text": "hi"}]}]}'
@ -152,7 +237,7 @@ curl http://localhost:4000/vertex_ai/publishers/google/models/gemini-1.5-flash-0
```shell
curl http://localhost:4000/vertex_ai/publishers/google/models/textembedding-gecko@001:predict \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/textembedding-gecko@001:predict \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{"instances":[{"content": "gm"}]}'
@ -162,7 +247,7 @@ curl http://localhost:4000/vertex_ai/publishers/google/models/textembedding-geck
### Imagen API
```shell
curl http://localhost:4000/vertex_ai/publishers/google/models/imagen-3.0-generate-001:predict \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/imagen-3.0-generate-001:predict \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{"instances":[{"prompt": "make an otter"}], "parameters": {"sampleCount": 1}}'
@ -172,7 +257,7 @@ curl http://localhost:4000/vertex_ai/publishers/google/models/imagen-3.0-generat
### Count Tokens API
```shell
curl http://localhost:4000/vertex_ai/publishers/google/models/gemini-1.5-flash-001:countTokens \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.5-flash-001:countTokens \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{"contents":[{"role": "user", "parts":[{"text": "hi"}]}]}'
@ -183,7 +268,7 @@ Create Fine Tuning Job
```shell
curl http://localhost:4000/vertex_ai/tuningJobs \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.5-flash-001:tuningJobs \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{
@ -243,7 +328,7 @@ Expected Response
```bash
curl http://localhost:4000/vertex_ai/publishers/google/models/gemini-1.0-pro:generateContent \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.0-pro:generateContent \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-d '{
@ -268,7 +353,7 @@ tags: ["vertex-js-sdk", "pass-through-endpoint"]
<TabItem value="curl" label="curl">
```bash
curl http://localhost:4000/vertex-ai/publishers/google/models/gemini-1.0-pro:generateContent \
curl http://localhost:4000/vertex_ai/v1/projects/${PROJECT_ID}/locations/us-central1/publishers/google/models/gemini-1.0-pro:generateContent \
-H "Content-Type: application/json" \
-H "x-litellm-api-key: Bearer sk-1234" \
-H "tags: vertex-js-sdk,pass-through-endpoint" \

View file

@ -0,0 +1,14 @@
# 🐕 Elroy
Elroy is a scriptable AI assistant that remembers and sets goals.
Interact through the command line, share memories via MCP, or build your own tools using Python.
[![Static Badge][github-shield]][github-url]
[![Discord][discord-shield]][discord-url]
[github-shield]: https://img.shields.io/badge/Github-repo-white?logo=github
[github-url]: https://github.com/elroy-bot/elroy
[discord-shield]:https://img.shields.io/discord/1200684659277832293?color=7289DA&label=Discord&logo=discord&logoColor=white
[discord-url]: https://discord.gg/5PJUY4eMce

View file

@ -0,0 +1,5 @@
PDL - A YAML-based approach to prompt programming
Github: https://github.com/IBM/prompt-declaration-language
PDL is a declarative approach to prompt programming, helping users to accumulate messages implicitly, with support for model chaining and tool use.

View file

@ -0,0 +1,9 @@
# pgai
[pgai](https://github.com/timescale/pgai) is a suite of tools to develop RAG, semantic search, and other AI applications more easily with PostgreSQL.
If you don't know what pgai is yet check out the [README](https://github.com/timescale/pgai)!
If you're already familiar with pgai, you can find litellm specific docs here:
- Litellm for [model calling](https://github.com/timescale/pgai/blob/main/docs/model_calling/litellm.md) in pgai
- Use the [litellm provider](https://github.com/timescale/pgai/blob/main/docs/vectorizer/api-reference.md#aiembedding_litellm) to automatically create embeddings for your data via the pgai vectorizer.

View file

@ -819,6 +819,160 @@ resp = litellm.completion(
print(f"\nResponse: {resp}")
```
## Usage - Thinking / `reasoning_content`
LiteLLM translates OpenAI's `reasoning_effort` to Anthropic's `thinking` parameter. [Code](https://github.com/BerriAI/litellm/blob/23051d89dd3611a81617d84277059cd88b2df511/litellm/llms/anthropic/chat/transformation.py#L298)
| reasoning_effort | thinking |
| ---------------- | -------- |
| "low" | "budget_tokens": 1024 |
| "medium" | "budget_tokens": 2048 |
| "high" | "budget_tokens": 4096 |
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
resp = completion(
model="anthropic/claude-3-7-sonnet-20250219",
messages=[{"role": "user", "content": "What is the capital of France?"}],
reasoning_effort="low",
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
- model_name: claude-3-7-sonnet-20250219
litellm_params:
model: anthropic/claude-3-7-sonnet-20250219
api_key: os.environ/ANTHROPIC_API_KEY
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```bash
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR-LITELLM-KEY>" \
-d '{
"model": "claude-3-7-sonnet-20250219",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"reasoning_effort": "low"
}'
```
</TabItem>
</Tabs>
**Expected Response**
```python
ModelResponse(
id='chatcmpl-c542d76d-f675-4e87-8e5f-05855f5d0f5e',
created=1740470510,
model='claude-3-7-sonnet-20250219',
object='chat.completion',
system_fingerprint=None,
choices=[
Choices(
finish_reason='stop',
index=0,
message=Message(
content="The capital of France is Paris.",
role='assistant',
tool_calls=None,
function_call=None,
provider_specific_fields={
'citations': None,
'thinking_blocks': [
{
'type': 'thinking',
'thinking': 'The capital of France is Paris. This is a very straightforward factual question.',
'signature': 'EuYBCkQYAiJAy6...'
}
]
}
),
thinking_blocks=[
{
'type': 'thinking',
'thinking': 'The capital of France is Paris. This is a very straightforward factual question.',
'signature': 'EuYBCkQYAiJAy6AGB...'
}
],
reasoning_content='The capital of France is Paris. This is a very straightforward factual question.'
)
],
usage=Usage(
completion_tokens=68,
prompt_tokens=42,
total_tokens=110,
completion_tokens_details=None,
prompt_tokens_details=PromptTokensDetailsWrapper(
audio_tokens=None,
cached_tokens=0,
text_tokens=None,
image_tokens=None
),
cache_creation_input_tokens=0,
cache_read_input_tokens=0
)
)
```
### Pass `thinking` to Anthropic models
You can also pass the `thinking` parameter to Anthropic models.
You can also pass the `thinking` parameter to Anthropic models.
<Tabs>
<TabItem value="sdk" label="SDK">
```python
response = litellm.completion(
model="anthropic/claude-3-7-sonnet-20250219",
messages=[{"role": "user", "content": "What is the capital of France?"}],
thinking={"type": "enabled", "budget_tokens": 1024},
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```bash
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_KEY" \
-d '{
"model": "anthropic/claude-3-7-sonnet-20250219",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"thinking": {"type": "enabled", "budget_tokens": 1024}
}'
```
</TabItem>
</Tabs>
## **Passing Extra Headers to Anthropic API**
Pass `extra_headers: dict` to `litellm.completion`
@ -927,8 +1081,10 @@ response = completion(
"content": [
{"type": "text", "text": "You are a very professional document summarization specialist. Please summarize the given document."},
{
"type": "image_url",
"image_url": f"data:application/pdf;base64,{encoded_file}", # 👈 PDF
"type": "file",
"file": {
"file_data": f"data:application/pdf;base64,{encoded_file}", # 👈 PDF
}
},
],
}
@ -973,8 +1129,10 @@ curl http://0.0.0.0:4000/v1/chat/completions \
"text": "You are a very professional document summarization specialist. Please summarize the given document"
},
{
"type": "image_url",
"image_url": "data:application/pdf;base64,{encoded_file}" # 👈 PDF
"type": "file",
"file": {
"file_data": f"data:application/pdf;base64,{encoded_file}", # 👈 PDF
}
}
}
]
@ -1135,3 +1293,4 @@ curl http://0.0.0.0:4000/v1/chat/completions \
</TabItem>
</Tabs>

View file

@ -291,14 +291,15 @@ response = completion(
)
```
## Azure O1 Models
## O-Series Models
| Model Name | Function Call |
|---------------------|----------------------------------------------------|
| o1-mini | `response = completion(model="azure/<your deployment name>", messages=messages)` |
| o1-preview | `response = completion(model="azure/<your deployment name>", messages=messages)` |
Azure OpenAI O-Series models are supported on LiteLLM.
Set `litellm.enable_preview_features = True` to use Azure O1 Models with streaming support.
LiteLLM routes any deployment name with `o1` or `o3` in the model name, to the O-Series [transformation](https://github.com/BerriAI/litellm/blob/91ed05df2962b8eee8492374b048d27cc144d08c/litellm/llms/azure/chat/o1_transformation.py#L4) logic.
To set this explicitly, set `model` to `azure/o_series/<your-deployment-name>`.
**Automatic Routing**
<Tabs>
<TabItem value="sdk" label="SDK">
@ -306,60 +307,112 @@ Set `litellm.enable_preview_features = True` to use Azure O1 Models with streami
```python
import litellm
litellm.enable_preview_features = True # 👈 KEY CHANGE
response = litellm.completion(
model="azure/<your deployment name>",
messages=[{"role": "user", "content": "What is the weather like in Boston?"}],
stream=True
)
for chunk in response:
print(chunk)
litellm.completion(model="azure/my-o3-deployment", messages=[{"role": "user", "content": "Hello, world!"}]) # 👈 Note: 'o3' in the deployment name
```
</TabItem>
<TabItem value="proxy" label="Proxy">
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: o1-mini
- model_name: o3-mini
litellm_params:
model: azure/o1-mini
api_base: "os.environ/AZURE_API_BASE"
api_key: "os.environ/AZURE_API_KEY"
api_version: "os.environ/AZURE_API_VERSION"
litellm_settings:
enable_preview_features: true # 👈 KEY CHANGE
model: azure/o3-model
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
```
2. Start proxy
</TabItem>
</Tabs>
**Explicit Routing**
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
litellm.completion(model="azure/o_series/my-random-deployment-name", messages=[{"role": "user", "content": "Hello, world!"}]) # 👈 Note: 'o_series/' in the deployment name
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```yaml
model_list:
- model_name: o3-mini
litellm_params:
model: azure/o_series/my-random-deployment-name
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
```
</TabItem>
</Tabs>
## Azure Audio Model
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
os.environ["AZURE_API_KEY"] = ""
os.environ["AZURE_API_BASE"] = ""
os.environ["AZURE_API_VERSION"] = ""
response = completion(
model="azure/azure-openai-4o-audio",
messages=[
{
"role": "user",
"content": "I want to try out speech to speech"
}
],
modalities=["text","audio"],
audio={"voice": "alloy", "format": "wav"}
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: azure-openai-4o-audio
litellm_params:
model: azure/azure-openai-4o-audio
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: os.environ/AZURE_API_VERSION
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it
3. Test it!
```python
import openai
client = openai.OpenAI(
api_key="anything",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(model="o1-mini", messages = [
{
"role": "user",
"content": "this is a test request, write a short poem"
}
],
stream=True)
for chunk in response:
print(chunk)
```bash
curl http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $LITELLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "azure-openai-4o-audio",
"messages": [{"role": "user", "content": "I want to try out speech to speech"}],
"modalities": ["text","audio"],
"audio": {"voice": "alloy", "format": "wav"}
}'
```
</TabItem>
</Tabs>
@ -425,7 +478,7 @@ response.stream_to_file(speech_file_path)
## **Authentication**
### Entrata ID - use `azure_ad_token`
### Entra ID - use `azure_ad_token`
This is a walkthrough on how to use Azure Active Directory Tokens - Microsoft Entra ID to make `litellm.completion()` calls
@ -492,7 +545,7 @@ model_list:
</TabItem>
</Tabs>
### Entrata ID - use tenant_id, client_id, client_secret
### Entra ID - use tenant_id, client_id, client_secret
Here is an example of setting up `tenant_id`, `client_id`, `client_secret` in your litellm proxy `config.yaml`
```yaml
@ -528,7 +581,7 @@ Example video of using `tenant_id`, `client_id`, `client_secret` with LiteLLM Pr
<iframe width="840" height="500" src="https://www.loom.com/embed/70d3f219ee7f4e5d84778b7f17bba506?sid=04b8ff29-485f-4cb8-929e-6b392722f36d" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>
### Entrata ID - use client_id, username, password
### Entra ID - use client_id, username, password
Here is an example of setting up `client_id`, `azure_username`, `azure_password` in your litellm proxy `config.yaml`
```yaml
@ -948,62 +1001,9 @@ Expected Response:
{"data":[{"id":"batch_R3V...}
```
## O-Series Models
Azure OpenAI O-Series models are supported on LiteLLM.
LiteLLM routes any deployment name with `o1` or `o3` in the model name, to the O-Series [transformation](https://github.com/BerriAI/litellm/blob/91ed05df2962b8eee8492374b048d27cc144d08c/litellm/llms/azure/chat/o1_transformation.py#L4) logic.
To set this explicitly, set `model` to `azure/o_series/<your-deployment-name>`.
**Automatic Routing**
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
litellm.completion(model="azure/my-o3-deployment", messages=[{"role": "user", "content": "Hello, world!"}]) # 👈 Note: 'o3' in the deployment name
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```yaml
model_list:
- model_name: o3-mini
litellm_params:
model: azure/o3-model
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
```
</TabItem>
</Tabs>
**Explicit Routing**
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
litellm.completion(model="azure/o_series/my-random-deployment-name", messages=[{"role": "user", "content": "Hello, world!"}]) # 👈 Note: 'o_series/' in the deployment name
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```yaml
model_list:
- model_name: o3-mini
litellm_params:
model: azure/o_series/my-random-deployment-name
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
```
</TabItem>
</Tabs>
@ -1076,32 +1076,24 @@ print(response)
```
### Parallel Function calling
### Tool Calling / Function Calling
See a detailed walthrough of parallel function calling with litellm [here](https://docs.litellm.ai/docs/completion/function_call)
<Tabs>
<TabItem value="sdk" label="SDK">
```python
# set Azure env variables
import os
import litellm
import json
os.environ['AZURE_API_KEY'] = "" # litellm reads AZURE_API_KEY from .env and sends the request
os.environ['AZURE_API_BASE'] = "https://openai-gpt-4-test-v-1.openai.azure.com/"
os.environ['AZURE_API_VERSION'] = "2023-07-01-preview"
import litellm
import json
# Example dummy function hard coded to return the same weather
# In production, this could be your backend API or an external API
def get_current_weather(location, unit="fahrenheit"):
"""Get the current weather in a given location"""
if "tokyo" in location.lower():
return json.dumps({"location": "Tokyo", "temperature": "10", "unit": "celsius"})
elif "san francisco" in location.lower():
return json.dumps({"location": "San Francisco", "temperature": "72", "unit": "fahrenheit"})
elif "paris" in location.lower():
return json.dumps({"location": "Paris", "temperature": "22", "unit": "celsius"})
else:
return json.dumps({"location": location, "temperature": "unknown"})
## Step 1: send the conversation and available functions to the model
messages = [{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}]
tools = [
{
"type": "function",
@ -1125,7 +1117,7 @@ tools = [
response = litellm.completion(
model="azure/chatgpt-functioncalling", # model = azure/<your-azure-deployment-name>
messages=messages,
messages=[{"role": "user", "content": "What's the weather like in San Francisco, Tokyo, and Paris?"}],
tools=tools,
tool_choice="auto", # auto is default, but we'll be explicit
)
@ -1134,8 +1126,49 @@ response_message = response.choices[0].message
tool_calls = response.choices[0].message.tool_calls
print("\nTool Choice:\n", tool_calls)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: azure-gpt-3.5
litellm_params:
model: azure/chatgpt-functioncalling
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
```
2. Start proxy
```bash
litellm --config config.yaml
```
3. Test it
```bash
curl -L -X POST 'http://localhost:4000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "azure-gpt-3.5",
"messages": [
{
"role": "user",
"content": "Hey, how'\''s it going? Thinking long and hard before replying - what is the meaning of the world and life itself"
}
]
}'
```
</TabItem>
</Tabs>
### Spend Tracking for Azure OpenAI Models (PROXY)
Set base model for cost tracking azure image-gen call

File diff suppressed because it is too large Load diff

View file

@ -23,14 +23,16 @@ import os
os.environ['CEREBRAS_API_KEY'] = ""
response = completion(
model="cerebras/meta/llama3-70b-instruct",
model="cerebras/llama3-70b-instruct",
messages=[
{
"role": "user",
"content": "What's the weather like in Boston today in Fahrenheit?",
"content": "What's the weather like in Boston today in Fahrenheit? (Write in JSON)",
}
],
max_tokens=10,
# The prompt should include JSON if 'json_object' is selected; otherwise, you will get error code 400.
response_format={ "type": "json_object" },
seed=123,
stop=["\n\n"],
@ -50,16 +52,18 @@ import os
os.environ['CEREBRAS_API_KEY'] = ""
response = completion(
model="cerebras/meta/llama3-70b-instruct",
model="cerebras/llama3-70b-instruct",
messages=[
{
"role": "user",
"content": "What's the weather like in Boston today in Fahrenheit?",
"content": "What's the weather like in Boston today in Fahrenheit? (Write in JSON)",
}
],
stream=True,
max_tokens=10,
response_format={ "type": "json_object" },
# The prompt should include JSON if 'json_object' is selected; otherwise, you will get error code 400.
response_format={ "type": "json_object" },
seed=123,
stop=["\n\n"],
temperature=0.2,

View file

@ -108,7 +108,7 @@ response = embedding(
### Usage
LiteLLM supports the v1 and v2 clients for Cohere rerank. By default, the `rerank` endpoint uses the v2 client, but you can specify the v1 client by explicitly calling `v1/rerank`
<Tabs>
<TabItem value="sdk" label="LiteLLM SDK Usage">

View file

@ -1,7 +1,7 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# 🆕 Databricks
# Databricks
LiteLLM supports all models on Databricks
@ -154,7 +154,205 @@ response = completion(
temperature: 0.5
```
## Passings Databricks specific params - 'instruction'
## Usage - Thinking / `reasoning_content`
LiteLLM translates OpenAI's `reasoning_effort` to Anthropic's `thinking` parameter. [Code](https://github.com/BerriAI/litellm/blob/23051d89dd3611a81617d84277059cd88b2df511/litellm/llms/anthropic/chat/transformation.py#L298)
| reasoning_effort | thinking |
| ---------------- | -------- |
| "low" | "budget_tokens": 1024 |
| "medium" | "budget_tokens": 2048 |
| "high" | "budget_tokens": 4096 |
Known Limitations:
- Support for passing thinking blocks back to Claude [Issue](https://github.com/BerriAI/litellm/issues/9790)
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
# set ENV variables (can also be passed in to .completion() - e.g. `api_base`, `api_key`)
os.environ["DATABRICKS_API_KEY"] = "databricks key"
os.environ["DATABRICKS_API_BASE"] = "databricks base url"
resp = completion(
model="databricks/databricks-claude-3-7-sonnet",
messages=[{"role": "user", "content": "What is the capital of France?"}],
reasoning_effort="low",
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
- model_name: claude-3-7-sonnet
litellm_params:
model: databricks/databricks-claude-3-7-sonnet
api_key: os.environ/DATABRICKS_API_KEY
api_base: os.environ/DATABRICKS_API_BASE
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```bash
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR-LITELLM-KEY>" \
-d '{
"model": "claude-3-7-sonnet",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"reasoning_effort": "low"
}'
```
</TabItem>
</Tabs>
**Expected Response**
```python
ModelResponse(
id='chatcmpl-c542d76d-f675-4e87-8e5f-05855f5d0f5e',
created=1740470510,
model='claude-3-7-sonnet-20250219',
object='chat.completion',
system_fingerprint=None,
choices=[
Choices(
finish_reason='stop',
index=0,
message=Message(
content="The capital of France is Paris.",
role='assistant',
tool_calls=None,
function_call=None,
provider_specific_fields={
'citations': None,
'thinking_blocks': [
{
'type': 'thinking',
'thinking': 'The capital of France is Paris. This is a very straightforward factual question.',
'signature': 'EuYBCkQYAiJAy6...'
}
]
}
),
thinking_blocks=[
{
'type': 'thinking',
'thinking': 'The capital of France is Paris. This is a very straightforward factual question.',
'signature': 'EuYBCkQYAiJAy6AGB...'
}
],
reasoning_content='The capital of France is Paris. This is a very straightforward factual question.'
)
],
usage=Usage(
completion_tokens=68,
prompt_tokens=42,
total_tokens=110,
completion_tokens_details=None,
prompt_tokens_details=PromptTokensDetailsWrapper(
audio_tokens=None,
cached_tokens=0,
text_tokens=None,
image_tokens=None
),
cache_creation_input_tokens=0,
cache_read_input_tokens=0
)
)
```
### Pass `thinking` to Anthropic models
You can also pass the `thinking` parameter to Anthropic models.
You can also pass the `thinking` parameter to Anthropic models.
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
# set ENV variables (can also be passed in to .completion() - e.g. `api_base`, `api_key`)
os.environ["DATABRICKS_API_KEY"] = "databricks key"
os.environ["DATABRICKS_API_BASE"] = "databricks base url"
response = litellm.completion(
model="databricks/databricks-claude-3-7-sonnet",
messages=[{"role": "user", "content": "What is the capital of France?"}],
thinking={"type": "enabled", "budget_tokens": 1024},
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```bash
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $LITELLM_KEY" \
-d '{
"model": "databricks/databricks-claude-3-7-sonnet",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"thinking": {"type": "enabled", "budget_tokens": 1024}
}'
```
</TabItem>
</Tabs>
## Supported Databricks Chat Completion Models
:::tip
**We support ALL Databricks models, just set `model=databricks/<any-model-on-databricks>` as a prefix when sending litellm requests**
:::
| Model Name | Command |
|----------------------------|------------------------------------------------------------------|
| databricks/databricks-claude-3-7-sonnet | `completion(model='databricks/databricks/databricks-claude-3-7-sonnet', messages=messages)` |
| databricks-meta-llama-3-1-70b-instruct | `completion(model='databricks/databricks-meta-llama-3-1-70b-instruct', messages=messages)` |
| databricks-meta-llama-3-1-405b-instruct | `completion(model='databricks/databricks-meta-llama-3-1-405b-instruct', messages=messages)` |
| databricks-dbrx-instruct | `completion(model='databricks/databricks-dbrx-instruct', messages=messages)` |
| databricks-meta-llama-3-70b-instruct | `completion(model='databricks/databricks-meta-llama-3-70b-instruct', messages=messages)` |
| databricks-llama-2-70b-chat | `completion(model='databricks/databricks-llama-2-70b-chat', messages=messages)` |
| databricks-mixtral-8x7b-instruct | `completion(model='databricks/databricks-mixtral-8x7b-instruct', messages=messages)` |
| databricks-mpt-30b-instruct | `completion(model='databricks/databricks-mpt-30b-instruct', messages=messages)` |
| databricks-mpt-7b-instruct | `completion(model='databricks/databricks-mpt-7b-instruct', messages=messages)` |
## Embedding Models
### Passing Databricks specific params - 'instruction'
For embedding models, databricks lets you pass in an additional param 'instruction'. [Full Spec](https://github.com/BerriAI/litellm/blob/43353c28b341df0d9992b45c6ce464222ebd7984/litellm/llms/databricks.py#L164)
@ -187,27 +385,6 @@ response = litellm.embedding(
instruction: "Represent this sentence for searching relevant passages:"
```
## Supported Databricks Chat Completion Models
:::tip
**We support ALL Databricks models, just set `model=databricks/<any-model-on-databricks>` as a prefix when sending litellm requests**
:::
| Model Name | Command |
|----------------------------|------------------------------------------------------------------|
| databricks-meta-llama-3-1-70b-instruct | `completion(model='databricks/databricks-meta-llama-3-1-70b-instruct', messages=messages)` |
| databricks-meta-llama-3-1-405b-instruct | `completion(model='databricks/databricks-meta-llama-3-1-405b-instruct', messages=messages)` |
| databricks-dbrx-instruct | `completion(model='databricks/databricks-dbrx-instruct', messages=messages)` |
| databricks-meta-llama-3-70b-instruct | `completion(model='databricks/databricks-meta-llama-3-70b-instruct', messages=messages)` |
| databricks-llama-2-70b-chat | `completion(model='databricks/databricks-llama-2-70b-chat', messages=messages)` |
| databricks-mixtral-8x7b-instruct | `completion(model='databricks/databricks-mixtral-8x7b-instruct', messages=messages)` |
| databricks-mpt-30b-instruct | `completion(model='databricks/databricks-mpt-30b-instruct', messages=messages)` |
| databricks-mpt-7b-instruct | `completion(model='databricks/databricks-mpt-7b-instruct', messages=messages)` |
## Supported Databricks Embedding Models
:::tip

View file

@ -365,7 +365,7 @@ curl -X POST 'http://0.0.0.0:4000/chat/completions' \
</Tabs>
## Specifying Safety Settings
In certain use-cases you may need to make calls to the models and pass [safety settigns](https://ai.google.dev/docs/safety_setting_gemini) different from the defaults. To do so, simple pass the `safety_settings` argument to `completion` or `acompletion`. For example:
In certain use-cases you may need to make calls to the models and pass [safety settings](https://ai.google.dev/docs/safety_setting_gemini) different from the defaults. To do so, simple pass the `safety_settings` argument to `completion` or `acompletion`. For example:
```python
response = completion(
@ -438,6 +438,179 @@ assert isinstance(
```
### Google Search Tool
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
os.environ["GEMINI_API_KEY"] = ".."
tools = [{"googleSearch": {}}] # 👈 ADD GOOGLE SEARCH
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "What is the weather in San Francisco?"}],
tools=tools,
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gemini-2.0-flash
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
```
2. Start Proxy
```bash
$ litellm --config /path/to/config.yaml
```
3. Make Request!
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gemini-2.0-flash",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{"googleSearch": {}}]
}
'
```
</TabItem>
</Tabs>
### Google Search Retrieval
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
os.environ["GEMINI_API_KEY"] = ".."
tools = [{"googleSearchRetrieval": {}}] # 👈 ADD GOOGLE SEARCH
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "What is the weather in San Francisco?"}],
tools=tools,
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gemini-2.0-flash
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
```
2. Start Proxy
```bash
$ litellm --config /path/to/config.yaml
```
3. Make Request!
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gemini-2.0-flash",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{"googleSearchRetrieval": {}}]
}
'
```
</TabItem>
</Tabs>
### Code Execution Tool
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
os.environ["GEMINI_API_KEY"] = ".."
tools = [{"codeExecution": {}}] # 👈 ADD GOOGLE SEARCH
response = completion(
model="gemini/gemini-2.0-flash",
messages=[{"role": "user", "content": "What is the weather in San Francisco?"}],
tools=tools,
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gemini-2.0-flash
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
```
2. Start Proxy
```bash
$ litellm --config /path/to/config.yaml
```
3. Make Request!
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gemini-2.0-flash",
"messages": [{"role": "user", "content": "What is the weather in San Francisco?"}],
"tools": [{"codeExecution": {}}]
}
'
```
</TabItem>
</Tabs>
## JSON Mode
<Tabs>
@ -589,8 +762,10 @@ response = litellm.completion(
"content": [
{"type": "text", "text": "Please summarize the audio."},
{
"type": "image_url",
"image_url": "data:audio/mp3;base64,{}".format(encoded_data), # 👈 SET MIME_TYPE + DATA
"type": "file",
"file": {
"file_data": "data:audio/mp3;base64,{}".format(encoded_data), # 👈 SET MIME_TYPE + DATA
}
},
],
}
@ -640,8 +815,11 @@ response = litellm.completion(
"content": [
{"type": "text", "text": "Please summarize the file."},
{
"type": "image_url",
"image_url": "https://storage..." # 👈 SET THE IMG URL
"type": "file",
"file": {
"file_id": "https://storage...", # 👈 SET THE IMG URL
"format": "application/pdf" # OPTIONAL
}
},
],
}
@ -668,8 +846,11 @@ response = litellm.completion(
"content": [
{"type": "text", "text": "Please summarize the file."},
{
"type": "image_url",
"image_url": "gs://..." # 👈 SET THE cloud storage bucket url
"type": "file",
"file": {
"file_id": "gs://storage...", # 👈 SET THE IMG URL
"format": "application/pdf" # OPTIONAL
}
},
],
}
@ -879,3 +1060,54 @@ response = await client.chat.completions.create(
</TabItem>
</Tabs>
## Image Generation
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
response = completion(
model="gemini/gemini-2.0-flash-exp-image-generation",
messages=[{"role": "user", "content": "Generate an image of a cat"}],
modalities=["image", "text"],
)
assert response.choices[0].message.content is not None # ".."
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gemini-2.0-flash-exp-image-generation
litellm_params:
model: gemini/gemini-2.0-flash-exp-image-generation
api_key: os.environ/GEMINI_API_KEY
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```bash
curl -L -X POST 'http://localhost:4000/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "gemini-2.0-flash-exp-image-generation",
"messages": [{"role": "user", "content": "Generate an image of a cat"}],
"modalities": ["image", "text"]
}'
```
</TabItem>
</Tabs>

View file

@ -0,0 +1,161 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# [BETA] Google AI Studio (Gemini) Files API
Use this to upload files to Google AI Studio (Gemini).
Useful to pass in large media files to Gemini's `/generateContent` endpoint.
| Action | Supported |
|----------|-----------|
| `create` | Yes |
| `delete` | No |
| `retrieve` | No |
| `list` | No |
## Usage
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import base64
import requests
from litellm import completion, create_file
import os
### UPLOAD FILE ###
# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')
file = create_file(
file=wav_data,
purpose="user_data",
extra_body={"custom_llm_provider": "gemini"},
api_key=os.getenv("GEMINI_API_KEY"),
)
print(f"file: {file}")
assert file is not None
### GENERATE CONTENT ###
completion = completion(
model="gemini-2.0-flash",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "file",
"file": {
"file_id": file.id,
"filename": "my-test-name",
"format": "audio/wav"
}
}
]
},
]
)
print(completion.choices[0].message)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: "gemini-2.0-flash"
litellm_params:
model: gemini/gemini-2.0-flash
api_key: os.environ/GEMINI_API_KEY
```
2. Start proxy
```bash
litellm --config config.yaml
```
3. Test it
```python
import base64
import requests
from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:4000",
api_key="sk-1234"
)
# Fetch the audio file and convert it to a base64 encoded string
url = "https://cdn.openai.com/API/docs/audio/alloy.wav"
response = requests.get(url)
response.raise_for_status()
wav_data = response.content
encoded_string = base64.b64encode(wav_data).decode('utf-8')
file = client.files.create(
file=wav_data,
purpose="user_data",
extra_body={"target_model_names": "gemini-2.0-flash"}
)
print(f"file: {file}")
assert file is not None
completion = client.chat.completions.create(
model="gemini-2.0-flash",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "file",
"file": {
"file_id": file.id,
"filename": "my-test-name",
"format": "audio/wav"
}
}
]
},
],
extra_body={"drop_params": True}
)
print(completion.choices[0].message)
```
</TabItem>
</Tabs>

View file

@ -2,466 +2,392 @@ import Image from '@theme/IdealImage';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Huggingface
# Hugging Face
LiteLLM supports running inference across multiple services for models hosted on the Hugging Face Hub.
LiteLLM supports the following types of Hugging Face models:
- **Serverless Inference Providers** - Hugging Face offers an easy and unified access to serverless AI inference through multiple inference providers, like [Together AI](https://together.ai) and [Sambanova](https://sambanova.ai). This is the fastest way to integrate AI in your products with a maintenance-free and scalable solution. More details in the [Inference Providers documentation](https://huggingface.co/docs/inference-providers/index).
- **Dedicated Inference Endpoints** - which is a product to easily deploy models to production. Inference is run by Hugging Face in a dedicated, fully managed infrastructure on a cloud provider of your choice. You can deploy your model on Hugging Face Inference Endpoints by following [these steps](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint).
- Serverless Inference API (free) - loaded and ready to use: https://huggingface.co/models?inference=warm&pipeline_tag=text-generation
- Dedicated Inference Endpoints (paid) - manual deployment: https://ui.endpoints.huggingface.co/
- All LLMs served via Hugging Face's Inference use [Text-generation-inference](https://huggingface.co/docs/text-generation-inference).
## Supported Models
### Serverless Inference Providers
You can check available models for an inference provider by going to [huggingface.co/models](https://huggingface.co/models), clicking the "Other" filter tab, and selecting your desired provider:
![Filter models by Inference Provider](../../img/hf_filter_inference_providers.png)
For example, you can find all Fireworks supported models [here](https://huggingface.co/models?inference_provider=fireworks-ai&sort=trending).
### Dedicated Inference Endpoints
Refer to the [Inference Endpoints catalog](https://endpoints.huggingface.co/catalog) for a list of available models.
## Usage
<Tabs>
<TabItem value="serverless" label="Serverless Inference Providers">
### Authentication
With a single Hugging Face token, you can access inference through multiple providers. Your calls are routed through Hugging Face and the usage is billed directly to your Hugging Face account at the standard provider API rates.
Simply set the `HF_TOKEN` environment variable with your Hugging Face token, you can create one here: https://huggingface.co/settings/tokens.
```bash
export HF_TOKEN="hf_xxxxxx"
```
or alternatively, you can pass your Hugging Face token as a parameter:
```python
completion(..., api_key="hf_xxxxxx")
```
### Getting Started
To use a Hugging Face model, specify both the provider and model you want to use in the following format:
```
huggingface/<provider>/<hf_org_or_user>/<hf_model>
```
Where `<hf_org_or_user>/<hf_model>` is the Hugging Face model ID and `<provider>` is the inference provider.
By default, if you don't specify a provider, LiteLLM will use the [HF Inference API](https://huggingface.co/docs/api-inference/en/index).
Examples:
```python
# Run DeepSeek-R1 inference through Together AI
completion(model="huggingface/together/deepseek-ai/DeepSeek-R1",...)
# Run Qwen2.5-72B-Instruct inference through Sambanova
completion(model="huggingface/sambanova/Qwen/Qwen2.5-72B-Instruct",...)
# Run Llama-3.3-70B-Instruct inference through HF Inference API
completion(model="huggingface/meta-llama/Llama-3.3-70B-Instruct",...)
```
<a target="_blank" href="https://colab.research.google.com/github/BerriAI/litellm/blob/main/cookbook/LiteLLM_HuggingFace.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
You need to tell LiteLLM when you're calling Huggingface.
This is done by adding the "huggingface/" prefix to `model`, example `completion(model="huggingface/<model_name>",...)`.
<Tabs>
<TabItem value="serverless" label="Serverless Inference API">
By default, LiteLLM will assume a Hugging Face call follows the [Messages API](https://huggingface.co/docs/text-generation-inference/messages_api), which is fully compatible with the OpenAI Chat Completion API.
<Tabs>
<TabItem value="sdk" label="SDK">
### Basic Completion
Here's an example of chat completion using the DeepSeek-R1 model through Together AI:
```python
import os
from litellm import completion
# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"
os.environ["HF_TOKEN"] = "hf_xxxxxx"
messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]
# e.g. Call 'https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct' from Serverless Inference API
response = completion(
model="huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{ "content": "Hello, how are you?","role": "user"}],
model="huggingface/together/deepseek-ai/DeepSeek-R1",
messages=[
{
"role": "user",
"content": "How many r's are in the word 'strawberry'?",
}
],
)
print(response)
```
### Streaming
Now, let's see what a streaming request looks like.
```python
import os
from litellm import completion
os.environ["HF_TOKEN"] = "hf_xxxxxx"
response = completion(
model="huggingface/together/deepseek-ai/DeepSeek-R1",
messages=[
{
"role": "user",
"content": "How many r's are in the word `strawberry`?",
}
],
stream=True,
)
for chunk in response:
print(chunk)
```
### Image Input
You can also pass images when the model supports it. Here is an example using [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model through Sambanova.
```python
from litellm import completion
# Set your Hugging Face Token
os.environ["HF_TOKEN"] = "hf_xxxxxx"
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
}
},
],
}
]
response = completion(
model="huggingface/sambanova/meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=messages,
)
print(response.choices[0])
```
### Function Calling
You can extend the model's capabilities by giving them access to tools. Here is an example with function calling using [Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) model through Sambanova.
```python
import os
from litellm import completion
# Set your Hugging Face Token
os.environ["HF_TOKEN"] = "hf_xxxxxx"
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}
}
]
messages = [
{
"role": "user",
"content": "What's the weather like in Boston today?",
}
]
response = completion(
model="huggingface/sambanova/meta-llama/Llama-3.3-70B-Instruct",
messages=messages,
tools=tools,
tool_choice="auto"
)
print(response)
```
</TabItem>
<TabItem value="endpoints" label="Inference Endpoints">
<a target="_blank" href="https://colab.research.google.com/github/BerriAI/litellm/blob/main/cookbook/LiteLLM_HuggingFace.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
### Basic Completion
After you have [deployed your Hugging Face Inference Endpoint](https://endpoints.huggingface.co/new) on dedicated infrastructure, you can run inference on it by providing the endpoint base URL in `api_base`, and indicating `huggingface/tgi` as the model name.
```python
import os
from litellm import completion
os.environ["HF_TOKEN"] = "hf_xxxxxx"
response = completion(
model="huggingface/tgi",
messages=[{"content": "Hello, how are you?", "role": "user"}],
api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/"
)
print(response)
```
### Streaming
```python
import os
from litellm import completion
os.environ["HF_TOKEN"] = "hf_xxxxxx"
response = completion(
model="huggingface/tgi",
messages=[{"content": "Hello, how are you?", "role": "user"}],
api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/",
stream=True
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Add models to your config.yaml
```yaml
model_list:
- model_name: llama-3.1-8B-instruct
litellm_params:
model: huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct
api_key: os.environ/HUGGINGFACE_API_KEY
```
2. Start the proxy
```bash
$ litellm --config /path/to/config.yaml --debug
```
3. Test it!
```shell
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "llama-3.1-8B-instruct",
"messages": [
{
"role": "user",
"content": "I like you!"
}
],
}'
```
</TabItem>
</Tabs>
</TabItem>
<TabItem value="classification" label="Text Classification">
Append `text-classification` to the model name
e.g. `huggingface/text-classification/<model-name>`
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import os
from litellm import completion
# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"
messages = [{ "content": "I like you, I love you!","role": "user"}]
# e.g. Call 'shahrukhx01/question-vs-statement-classifier' hosted on HF Inference endpoints
response = completion(
model="huggingface/text-classification/shahrukhx01/question-vs-statement-classifier",
messages=messages,
api_base="https://my-endpoint.endpoints.huggingface.cloud",
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Add models to your config.yaml
```yaml
model_list:
- model_name: bert-classifier
litellm_params:
model: huggingface/text-classification/shahrukhx01/question-vs-statement-classifier
api_key: os.environ/HUGGINGFACE_API_KEY
api_base: "https://my-endpoint.endpoints.huggingface.cloud"
```
2. Start the proxy
```bash
$ litellm --config /path/to/config.yaml --debug
```
3. Test it!
```shell
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "bert-classifier",
"messages": [
{
"role": "user",
"content": "I like you!"
}
],
}'
```
</TabItem>
</Tabs>
</TabItem>
<TabItem value="dedicated" label="Dedicated Inference Endpoints">
Steps to use
* Create your own Hugging Face dedicated endpoint here: https://ui.endpoints.huggingface.co/
* Set `api_base` to your deployed api base
* Add the `huggingface/` prefix to your model so litellm knows it's a huggingface Deployed Inference Endpoint
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import os
from litellm import completion
os.environ["HUGGINGFACE_API_KEY"] = ""
# TGI model: Call https://huggingface.co/glaiveai/glaive-coder-7b
# add the 'huggingface/' prefix to the model to set huggingface as the provider
# set api base to your deployed api endpoint from hugging face
response = completion(
model="huggingface/glaiveai/glaive-coder-7b",
messages=[{ "content": "Hello, how are you?","role": "user"}],
api_base="https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud"
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Add models to your config.yaml
```yaml
model_list:
- model_name: glaive-coder
litellm_params:
model: huggingface/glaiveai/glaive-coder-7b
api_key: os.environ/HUGGINGFACE_API_KEY
api_base: "https://wjiegasee9bmqke2.us-east-1.aws.endpoints.huggingface.cloud"
```
2. Start the proxy
```bash
$ litellm --config /path/to/config.yaml --debug
```
3. Test it!
```shell
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "glaive-coder",
"messages": [
{
"role": "user",
"content": "I like you!"
}
],
}'
```
</TabItem>
</Tabs>
</TabItem>
</Tabs>
## Streaming
<a target="_blank" href="https://colab.research.google.com/github/BerriAI/litellm/blob/main/cookbook/LiteLLM_HuggingFace.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
You need to tell LiteLLM when you're calling Huggingface.
This is done by adding the "huggingface/" prefix to `model`, example `completion(model="huggingface/<model_name>",...)`.
```python
import os
from litellm import completion
# [OPTIONAL] set env var
os.environ["HUGGINGFACE_API_KEY"] = "huggingface_api_key"
messages = [{ "content": "There's a llama in my garden 😱 What should I do?","role": "user"}]
# e.g. Call 'facebook/blenderbot-400M-distill' hosted on HF Inference endpoints
response = completion(
model="huggingface/facebook/blenderbot-400M-distill",
messages=messages,
api_base="https://my-endpoint.huggingface.cloud",
stream=True
)
print(response)
for chunk in response:
print(chunk)
print(chunk)
```
### Image Input
```python
import os
from litellm import completion
os.environ["HF_TOKEN"] = "hf_xxxxxx"
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
}
},
],
}
]
response = completion(
model="huggingface/tgi",
messages=messages,
api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/""
)
print(response.choices[0])
```
### Function Calling
```python
import os
from litellm import completion
os.environ["HF_TOKEN"] = "hf_xxxxxx"
functions = [{
"name": "get_weather",
"description": "Get the weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The location to get weather for"
}
},
"required": ["location"]
}
}]
response = completion(
model="huggingface/tgi",
messages=[{"content": "What's the weather like in San Francisco?", "role": "user"}],
api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/",
functions=functions
)
print(response)
```
</TabItem>
</Tabs>
## LiteLLM Proxy Server with Hugging Face models
You can set up a [LiteLLM Proxy Server](https://docs.litellm.ai/#litellm-proxy-server-llm-gateway) to serve Hugging Face models through any of the supported Inference Providers. Here's how to do it:
### Step 1. Setup the config file
In this case, we are configuring a proxy to serve `DeepSeek R1` from Hugging Face, using Together AI as the backend Inference Provider.
```yaml
model_list:
- model_name: my-r1-model
litellm_params:
model: huggingface/together/deepseek-ai/DeepSeek-R1
api_key: os.environ/HF_TOKEN # ensure you have `HF_TOKEN` in your .env
```
### Step 2. Start the server
```bash
litellm --config /path/to/config.yaml
```
### Step 3. Make a request to the server
<Tabs>
<TabItem value="curl" label="curl">
```shell
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
"model": "my-r1-model",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
]
}'
```
</TabItem>
<TabItem value="python" label="python">
```python
# pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:4000",
api_key="anything",
)
response = client.chat.completions.create(
model="my-r1-model",
messages=[
{"role": "user", "content": "Hello, how are you?"}
]
)
print(response)
```
</TabItem>
</Tabs>
## Embedding
LiteLLM supports Hugging Face's [text-embedding-inference](https://github.com/huggingface/text-embeddings-inference) format.
LiteLLM supports Hugging Face's [text-embedding-inference](https://github.com/huggingface/text-embeddings-inference) models as well.
```python
from litellm import embedding
import os
os.environ['HUGGINGFACE_API_KEY'] = ""
os.environ['HF_TOKEN'] = "hf_xxxxxx"
response = embedding(
model='huggingface/microsoft/codebert-base',
input=["good morning from litellm"]
)
```
## Advanced
### Setting API KEYS + API BASE
If required, you can set the api key + api base, set it in your os environment. [Code for how it's sent](https://github.com/BerriAI/litellm/blob/0100ab2382a0e720c7978fbf662cc6e6920e7e03/litellm/llms/huggingface_restapi.py#L25)
```python
import os
os.environ["HUGGINGFACE_API_KEY"] = ""
os.environ["HUGGINGFACE_API_BASE"] = ""
```
### Viewing Log probs
#### Using `decoder_input_details` - OpenAI `echo`
The `echo` param is supported by OpenAI Completions - Use `litellm.text_completion()` for this
```python
from litellm import text_completion
response = text_completion(
model="huggingface/bigcode/starcoder",
prompt="good morning",
max_tokens=10, logprobs=10,
echo=True
)
```
#### Output
```json
{
"id": "chatcmpl-3fc71792-c442-4ba1-a611-19dd0ac371ad",
"object": "text_completion",
"created": 1698801125.936519,
"model": "bigcode/starcoder",
"choices": [
{
"text": ", I'm going to make you a sand",
"index": 0,
"logprobs": {
"tokens": [
"good",
" morning",
",",
" I",
"'m",
" going",
" to",
" make",
" you",
" a",
" s",
"and"
],
"token_logprobs": [
"None",
-14.96875,
-2.2285156,
-2.734375,
-2.0957031,
-2.0917969,
-0.09429932,
-3.1132812,
-1.3203125,
-1.2304688,
-1.6201172,
-0.010292053
]
},
"finish_reason": "length"
}
],
"usage": {
"completion_tokens": 9,
"prompt_tokens": 2,
"total_tokens": 11
}
}
```
### Models with Prompt Formatting
For models with special prompt templates (e.g. Llama2), we format the prompt to fit their template.
#### Models with natively Supported Prompt Templates
| Model Name | Works for Models | Function Call | Required OS Variables |
| ------------------------------------ | ---------------------------------- | ----------------------------------------------------------------------------------------------------------------------- | ----------------------------------- |
| mistralai/Mistral-7B-Instruct-v0.1 | mistralai/Mistral-7B-Instruct-v0.1 | `completion(model='huggingface/mistralai/Mistral-7B-Instruct-v0.1', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
| meta-llama/Llama-2-7b-chat | All meta-llama llama2 chat models | `completion(model='huggingface/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
| tiiuae/falcon-7b-instruct | All falcon instruct models | `completion(model='huggingface/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
| mosaicml/mpt-7b-chat | All mpt chat models | `completion(model='huggingface/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
| codellama/CodeLlama-34b-Instruct-hf | All codellama instruct models | `completion(model='huggingface/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
| WizardLM/WizardCoder-Python-34B-V1.0 | All wizardcoder models | `completion(model='huggingface/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
| Phind/Phind-CodeLlama-34B-v2 | All phind-codellama models | `completion(model='huggingface/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")` | `os.environ['HUGGINGFACE_API_KEY']` |
**What if we don't support a model you need?**
You can also specify you're own custom prompt formatting, in case we don't have your model covered yet.
**Does this mean you have to specify a prompt for all models?**
No. By default we'll concatenate your message content to make a prompt.
**Default Prompt Template**
```python
def default_pt(messages):
return " ".join(message["content"] for message in messages)
```
[Code for how prompt formats work in LiteLLM](https://github.com/BerriAI/litellm/blob/main/litellm/llms/prompt_templates/factory.py)
#### Custom prompt templates
```python
import litellm
# Create your own custom prompt template works
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST] ",
"post_message": " [/INST]\n"
},
"assistant": {
"post_message": "\n"
}
}
)
def test_huggingface_custom_model():
model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
print(response['choices'][0]['message']['content'])
return response
test_huggingface_custom_model()
```
[Implementation Code](https://github.com/BerriAI/litellm/blob/c0b3da2c14c791a0b755f0b1e5a9ef065951ecbf/litellm/llms/huggingface_restapi.py#L52)
### Deploying a model on huggingface
You can use any chat/text model from Hugging Face with the following steps:
- Copy your model id/url from Huggingface Inference Endpoints
- [ ] Go to https://ui.endpoints.huggingface.co/
- [ ] Copy the url of the specific model you'd like to use
<Image img={require('../../img/hf_inference_endpoint.png')} alt="HF_Dashboard" style={{ maxWidth: '50%', height: 'auto' }}/>
- Set it as your model name
- Set your HUGGINGFACE_API_KEY as an environment variable
Need help deploying a model on huggingface? [Check out this guide.](https://huggingface.co/docs/inference-endpoints/guides/create_endpoint)
# output
Same as the OpenAI format, but also includes logprobs. [See the code](https://github.com/BerriAI/litellm/blob/b4b2dbf005142e0a483d46a07a88a19814899403/litellm/llms/huggingface_restapi.py#L115)
```json
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "\ud83d\ude31\n\nComment: @SarahSzabo I'm",
"role": "assistant",
"logprobs": -22.697942825499993
}
}
],
"created": 1693436637.38206,
"model": "https://ji16r2iys9a8rjk2.us-east-1.aws.endpoints.huggingface.cloud",
"usage": {
"prompt_tokens": 14,
"completion_tokens": 11,
"total_tokens": 25
}
}
```
# FAQ
**Does this support stop sequences?**
**How does billing work with Hugging Face Inference Providers?**
Yes, we support stop sequences - and you can pass as many as allowed by Hugging Face (or any provider!)
> Billing is centralized on your Hugging Face account, no matter which providers you are using. You are billed the standard provider API rates with no additional markup - Hugging Face simply passes through the provider costs. Note that [Hugging Face PRO](https://huggingface.co/subscribe/pro) users get $2 worth of Inference credits every month that can be used across providers.
**How do you deal with repetition penalty?**
**Do I need to create an account for each Inference Provider?**
We map the presence penalty parameter in openai to the repetition penalty parameter on Hugging Face. [See code](https://github.com/BerriAI/litellm/blob/b4b2dbf005142e0a483d46a07a88a19814899403/litellm/utils.py#L757).
> No, you don't need to create separate accounts. All requests are routed through Hugging Face, so you only need your HF token. This allows you to easily benchmark different providers and choose the one that best fits your needs.
We welcome any suggestions for improving our Hugging Face integration - Create an [issue](https://github.com/BerriAI/litellm/issues/new/choose)/[Join the Discord](https://discord.com/invite/wuPM9dRgDw)!
**Will more inference providers be supported by Hugging Face in the future?**
> Yes! New inference providers (and models) are being added gradually.
We welcome any suggestions for improving our Hugging Face integration - Create an [issue](https://github.com/BerriAI/litellm/issues/new/choose)/[Join the Discord](https://discord.com/invite/wuPM9dRgDw)!

View file

@ -1,3 +1,6 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Infinity
| Property | Details |
@ -12,6 +15,9 @@
```python
from litellm import rerank
import os
os.environ["INFINITY_API_BASE"] = "http://localhost:8080"
response = rerank(
model="infinity/rerank",
@ -65,3 +71,114 @@ curl http://0.0.0.0:4000/rerank \
```
## Supported Cohere Rerank API Params
| Param | Type | Description |
|-------|-------|-------|
| `query` | `str` | The query to rerank the documents against |
| `documents` | `list[str]` | The documents to rerank |
| `top_n` | `int` | The number of documents to return |
| `return_documents` | `bool` | Whether to return the documents in the response |
### Usage - Return Documents
<Tabs>
<TabItem value="sdk" label="SDK">
```python
response = rerank(
model="infinity/rerank",
query="What is the capital of France?",
documents=["Paris", "London", "Berlin", "Madrid"],
return_documents=True,
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
```bash
curl http://0.0.0.0:4000/rerank \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "custom-infinity-rerank",
"query": "What is the capital of France?",
"documents": [
"Paris",
"London",
"Berlin",
"Madrid"
],
"return_documents": True,
}'
```
</TabItem>
</Tabs>
## Pass Provider-specific Params
Any unmapped params will be passed to the provider as-is.
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import rerank
import os
os.environ["INFINITY_API_BASE"] = "http://localhost:8080"
response = rerank(
model="infinity/rerank",
query="What is the capital of France?",
documents=["Paris", "London", "Berlin", "Madrid"],
raw_scores=True, # 👈 PROVIDER-SPECIFIC PARAM
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: custom-infinity-rerank
litellm_params:
model: infinity/rerank
api_base: https://localhost:8080
raw_scores: True # 👈 EITHER SET PROVIDER-SPECIFIC PARAMS HERE OR IN REQUEST BODY
```
2. Start litellm
```bash
litellm --config /path/to/config.yaml
# RUNNING on http://0.0.0.0:4000
```
3. Test it!
```bash
curl http://0.0.0.0:4000/rerank \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "custom-infinity-rerank",
"query": "What is the capital of the United States?",
"documents": [
"Carson City is the capital city of the American state of Nevada.",
"The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean. Its capital is Saipan.",
"Washington, D.C. is the capital of the United States.",
"Capital punishment has existed in the United States since before it was a country."
],
"raw_scores": True # 👈 PROVIDER-SPECIFIC PARAM
}'
```
</TabItem>
</Tabs>

View file

@ -3,13 +3,15 @@ import TabItem from '@theme/TabItem';
# LiteLLM Proxy (LLM Gateway)
:::tip
[LiteLLM Providers a **self hosted** proxy server (AI Gateway)](../simple_proxy) to call all the LLMs in the OpenAI format
| Property | Details |
|-------|-------|
| Description | LiteLLM Proxy is an OpenAI-compatible gateway that allows you to interact with multiple LLM providers through a unified API. Simply use the `litellm_proxy/` prefix before the model name to route your requests through the proxy. |
| Provider Route on LiteLLM | `litellm_proxy/` (add this prefix to the model name, to route any requests to litellm_proxy - e.g. `litellm_proxy/your-model-name`) |
| Setup LiteLLM Gateway | [LiteLLM Gateway ↗](../simple_proxy) |
| Supported Endpoints |`/chat/completions`, `/completions`, `/embeddings`, `/audio/speech`, `/audio/transcriptions`, `/images`, `/rerank` |
:::
**[LiteLLM Proxy](../simple_proxy) is OpenAI compatible**, you just need the `litellm_proxy/` prefix before the model
## Required Variables
@ -55,7 +57,7 @@ messages = [{ "content": "Hello, how are you?","role": "user"}]
# litellm proxy call
response = completion(
model="litellm_proxy/your-model-name",
messages,
messages=messages,
api_base = "your-litellm-proxy-url",
api_key = "your-litellm-proxy-api-key"
)
@ -74,7 +76,7 @@ messages = [{ "content": "Hello, how are you?","role": "user"}]
# openai call
response = completion(
model="litellm_proxy/your-model-name",
messages,
messages=messages,
api_base = "your-litellm-proxy-url",
stream=True
)
@ -83,7 +85,76 @@ for chunk in response:
print(chunk)
```
## Embeddings
```python
import litellm
response = litellm.embedding(
model="litellm_proxy/your-embedding-model",
input="Hello world",
api_base="your-litellm-proxy-url",
api_key="your-litellm-proxy-api-key"
)
```
## Image Generation
```python
import litellm
response = litellm.image_generation(
model="litellm_proxy/dall-e-3",
prompt="A beautiful sunset over mountains",
api_base="your-litellm-proxy-url",
api_key="your-litellm-proxy-api-key"
)
```
## Audio Transcription
```python
import litellm
response = litellm.transcription(
model="litellm_proxy/whisper-1",
file="your-audio-file",
api_base="your-litellm-proxy-url",
api_key="your-litellm-proxy-api-key"
)
```
## Text to Speech
```python
import litellm
response = litellm.speech(
model="litellm_proxy/tts-1",
input="Hello world",
api_base="your-litellm-proxy-url",
api_key="your-litellm-proxy-api-key"
)
```
## Rerank
```python
import litellm
import litellm
response = litellm.rerank(
model="litellm_proxy/rerank-english-v2.0",
query="What is machine learning?",
documents=[
"Machine learning is a field of study in artificial intelligence",
"Biology is the study of living organisms"
],
api_base="your-litellm-proxy-url",
api_key="your-litellm-proxy-api-key"
)
```
## **Usage with Langchain, LLamaindex, OpenAI Js, Anthropic SDK, Instructor**
#### [Follow this doc to see how to use litellm proxy with langchain, llamaindex, anthropic etc](../proxy/user_keys)

View file

@ -202,6 +202,67 @@ curl -X POST 'http://0.0.0.0:4000/chat/completions' \
</TabItem>
</Tabs>
## Using Ollama FIM on `/v1/completions`
LiteLLM supports calling Ollama's `/api/generate` endpoint on `/v1/completions` requests.
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import litellm
litellm._turn_on_debug() # turn on debug to see the request
from litellm import completion
response = completion(
model="ollama/llama3.1",
prompt="Hello, world!",
api_base="http://localhost:11434"
)
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: "llama3.1"
litellm_params:
model: "ollama/llama3.1"
api_base: "http://localhost:11434"
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml --detailed_debug
# RUNNING ON http://0.0.0.0:4000
```
3. Test it!
```python
from openai import OpenAI
client = OpenAI(
api_key="anything", # 👈 PROXY KEY (can be anything, if master_key not set)
base_url="http://0.0.0.0:4000" # 👈 PROXY BASE URL
)
response = client.completions.create(
model="ollama/llama3.1",
prompt="Hello, world!",
api_base="http://localhost:11434"
)
print(response)
```
</TabItem>
</Tabs>
## Using ollama `api/chat`
In order to send ollama requests to `POST /api/chat` on your ollama server, set the model prefix to `ollama_chat`

View file

@ -228,6 +228,92 @@ response = completion(
```
## PDF File Parsing
OpenAI has a new `file` message type that allows you to pass in a PDF file and have it parsed into a structured output. [Read more](https://platform.openai.com/docs/guides/pdf-files?api-mode=chat&lang=python)
<Tabs>
<TabItem value="sdk" label="SDK">
```python
import base64
from litellm import completion
with open("draconomicon.pdf", "rb") as f:
data = f.read()
base64_string = base64.b64encode(data).decode("utf-8")
completion = completion(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{
"type": "file",
"file": {
"filename": "draconomicon.pdf",
"file_data": f"data:application/pdf;base64,{base64_string}",
}
},
{
"type": "text",
"text": "What is the first dragon in the book?",
}
],
},
],
)
print(completion.choices[0].message.content)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: openai-model
litellm_params:
model: gpt-4o
api_key: os.environ/OPENAI_API_KEY
```
2. Start the proxy
```bash
litellm --config config.yaml
```
3. Test it!
```bash
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-d '{
"model": "openai-model",
"messages": [
{"role": "user", "content": [
{
"type": "file",
"file": {
"filename": "draconomicon.pdf",
"file_data": f"data:application/pdf;base64,{base64_string}",
}
}
]}
]
}'
```
</TabItem>
</Tabs>
## OpenAI Fine Tuned Models
| Model Name | Function Call |
@ -239,6 +325,74 @@ response = completion(
| fine tuned `gpt-3.5-turbo-0613` | `response = completion(model="ft:gpt-3.5-turbo-0613", messages=messages)` |
## OpenAI Audio Transcription
LiteLLM supports OpenAI Audio Transcription endpoint.
Supported models:
| Model Name | Function Call |
|---------------------------|-----------------------------------------------------------------|
| `whisper-1` | `response = completion(model="whisper-1", file=audio_file)` |
| `gpt-4o-transcribe` | `response = completion(model="gpt-4o-transcribe", file=audio_file)` |
| `gpt-4o-mini-transcribe` | `response = completion(model="gpt-4o-mini-transcribe", file=audio_file)` |
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import transcription
import os
# set api keys
os.environ["OPENAI_API_KEY"] = ""
audio_file = open("/path/to/audio.mp3", "rb")
response = transcription(model="gpt-4o-transcribe", file=audio_file)
print(f"response: {response}")
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: gpt-4o-transcribe
litellm_params:
model: gpt-4o-transcribe
api_key: os.environ/OPENAI_API_KEY
model_info:
mode: audio_transcription
general_settings:
master_key: sk-1234
```
2. Start the proxy
```bash
litellm --config config.yaml
```
3. Test it!
```bash
curl --location 'http://0.0.0.0:8000/v1/audio/transcriptions' \
--header 'Authorization: Bearer sk-1234' \
--form 'file=@"/Users/krrishdholakia/Downloads/gettysburg.wav"' \
--form 'model="gpt-4o-transcribe"'
```
</TabItem>
</Tabs>
## Advanced
### Getting OpenAI API Response Headers
@ -449,26 +603,6 @@ response = litellm.acompletion(
)
```
### Using Helicone Proxy with LiteLLM
```python
import os
import litellm
from litellm import completion
os.environ["OPENAI_API_KEY"] = ""
# os.environ["OPENAI_API_BASE"] = ""
litellm.api_base = "https://oai.hconeai.com/v1"
litellm.headers = {
"Helicone-Auth": f"Bearer {os.getenv('HELICONE_API_KEY')}",
"Helicone-Cache-Enabled": "true",
}
messages = [{ "content": "Hello, how are you?","role": "user"}]
# openai call
response = completion("gpt-3.5-turbo", messages)
```
### Using OpenAI Proxy with LiteLLM
```python

View file

@ -10,9 +10,11 @@ LiteLLM supports all the text / chat / vision models from [OpenRouter](https://o
import os
from litellm import completion
os.environ["OPENROUTER_API_KEY"] = ""
os.environ["OPENROUTER_API_BASE"] = "" # [OPTIONAL] defaults to https://openrouter.ai/api/v1
os.environ["OR_SITE_URL"] = "" # optional
os.environ["OR_APP_NAME"] = "" # optional
os.environ["OR_SITE_URL"] = "" # [OPTIONAL]
os.environ["OR_APP_NAME"] = "" # [OPTIONAL]
response = completion(
model="openrouter/google/palm-2-chat-bison",

View file

@ -17,7 +17,7 @@ import os
os.environ['PERPLEXITYAI_API_KEY'] = ""
response = completion(
model="perplexity/mistral-7b-instruct",
model="perplexity/sonar-pro",
messages=messages
)
print(response)
@ -30,7 +30,7 @@ import os
os.environ['PERPLEXITYAI_API_KEY'] = ""
response = completion(
model="perplexity/mistral-7b-instruct",
model="perplexity/sonar-pro",
messages=messages,
stream=True
)
@ -45,19 +45,12 @@ All models listed here https://docs.perplexity.ai/docs/model-cards are supported
| Model Name | Function Call |
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| pplx-7b-chat | `completion(model="perplexity/pplx-7b-chat", messages)` |
| pplx-70b-chat | `completion(model="perplexity/pplx-70b-chat", messages)` |
| pplx-7b-online | `completion(model="perplexity/pplx-7b-online", messages)` |
| pplx-70b-online | `completion(model="perplexity/pplx-70b-online", messages)` |
| codellama-34b-instruct | `completion(model="perplexity/codellama-34b-instruct", messages)` |
| llama-2-13b-chat | `completion(model="perplexity/llama-2-13b-chat", messages)` |
| llama-2-70b-chat | `completion(model="perplexity/llama-2-70b-chat", messages)` |
| mistral-7b-instruct | `completion(model="perplexity/mistral-7b-instruct", messages)` |
| openhermes-2-mistral-7b | `completion(model="perplexity/openhermes-2-mistral-7b", messages)` |
| openhermes-2.5-mistral-7b | `completion(model="perplexity/openhermes-2.5-mistral-7b", messages)` |
| pplx-7b-chat-alpha | `completion(model="perplexity/pplx-7b-chat-alpha", messages)` |
| pplx-70b-chat-alpha | `completion(model="perplexity/pplx-70b-chat-alpha", messages)` |
| sonar-deep-research | `completion(model="perplexity/sonar-deep-research", messages)` |
| sonar-reasoning-pro | `completion(model="perplexity/sonar-reasoning-pro", messages)` |
| sonar-reasoning | `completion(model="perplexity/sonar-reasoning", messages)` |
| sonar-pro | `completion(model="perplexity/sonar-pro", messages)` |
| sonar | `completion(model="perplexity/sonar", messages)` |
| r1-1776 | `completion(model="perplexity/r1-1776", messages)` |

View file

@ -230,7 +230,7 @@ response = completion(
model="predibase/llama-3-8b-instruct",
messages = [{ "content": "Hello, how are you?","role": "user"}],
adapter_id="my_repo/3",
adapter_soruce="pbase",
adapter_source="pbase",
)
```

View file

@ -2,11 +2,11 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Sambanova
https://community.sambanova.ai/t/create-chat-completion-api/
https://cloud.sambanova.ai/
:::tip
**We support ALL Sambanova models, just set `model=sambanova/<any-model-on-sambanova>` as a prefix when sending litellm requests. For the complete supported model list, visit https://sambanova.ai/technology/models **
**We support ALL Sambanova models, just set `model=sambanova/<any-model-on-sambanova>` as a prefix when sending litellm requests. For the complete supported model list, visit https://docs.sambanova.ai/cloud/docs/get-started/supported-models **
:::
@ -27,12 +27,11 @@ response = completion(
messages=[
{
"role": "user",
"content": "What do you know about sambanova.ai",
"content": "What do you know about sambanova.ai. Give your response in json format",
}
],
max_tokens=10,
response_format={ "type": "json_object" },
seed=123,
stop=["\n\n"],
temperature=0.2,
top_p=0.9,
@ -54,13 +53,12 @@ response = completion(
messages=[
{
"role": "user",
"content": "What do you know about sambanova.ai",
"content": "What do you know about sambanova.ai. Give your response in json format",
}
],
stream=True,
max_tokens=10,
response_format={ "type": "json_object" },
seed=123,
stop=["\n\n"],
temperature=0.2,
top_p=0.9,

View file

@ -0,0 +1,90 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Snowflake
| Property | Details |
|-------|-------|
| Description | The Snowflake Cortex LLM REST API lets you access the COMPLETE function via HTTP POST requests|
| Provider Route on LiteLLM | `snowflake/` |
| Link to Provider Doc | [Snowflake ↗](https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-llm-rest-api) |
| Base URL | [https://{account-id}.snowflakecomputing.com/api/v2/cortex/inference:complete/](https://{account-id}.snowflakecomputing.com/api/v2/cortex/inference:complete) |
| Supported OpenAI Endpoints | `/chat/completions`, `/completions` |
Currently, Snowflake's REST API does not have an endpoint for `snowflake-arctic-embed` embedding models. If you want to use these embedding models with Litellm, you can call them through our Hugging Face provider.
Find the Arctic Embed models [here](https://huggingface.co/collections/Snowflake/arctic-embed-661fd57d50fab5fc314e4c18) on Hugging Face.
## Supported OpenAI Parameters
```
"temperature",
"max_tokens",
"top_p",
"response_format"
```
## API KEYS
Snowflake does have API keys. Instead, you access the Snowflake API with your JWT token and account identifier.
```python
import os
os.environ["SNOWFLAKE_JWT"] = "YOUR JWT"
os.environ["SNOWFLAKE_ACCOUNT_ID"] = "YOUR ACCOUNT IDENTIFIER"
```
## Usage
```python
from litellm import completion
## set ENV variables
os.environ["SNOWFLAKE_JWT"] = "YOUR JWT"
os.environ["SNOWFLAKE_ACCOUNT_ID"] = "YOUR ACCOUNT IDENTIFIER"
# Snowflake call
response = completion(
model="snowflake/mistral-7b",
messages = [{ "content": "Hello, how are you?","role": "user"}]
)
```
## Usage with LiteLLM Proxy
#### 1. Required env variables
```bash
export SNOWFLAKE_JWT=""
export SNOWFLAKE_ACCOUNT_ID = ""
```
#### 2. Start the proxy~
```yaml
model_list:
- model_name: mistral-7b
litellm_params:
model: snowflake/mistral-7b
api_key: YOUR_API_KEY
api_base: https://YOUR-ACCOUNT-ID.snowflakecomputing.com/api/v2/cortex/inference:complete
```
```bash
litellm --config /path/to/config.yaml
```
#### 3. Test it
```shell
curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "snowflake/mistral-7b",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
]
}
'
```

View file

@ -347,7 +347,7 @@ Return a `list[Recipe]`
completion(model="vertex_ai/gemini-1.5-flash-preview-0514", messages=messages, response_format={ "type": "json_object" })
```
### **Grounding**
### **Grounding - Web Search**
Add Google Search Result grounding to vertex ai calls.
@ -358,7 +358,7 @@ See the grounding metadata with `response_obj._hidden_params["vertex_ai_groundin
<Tabs>
<TabItem value="sdk" label="SDK">
```python
```python showLineNumbers
from litellm import completion
## SETUP ENVIRONMENT
@ -377,14 +377,36 @@ print(resp)
</TabItem>
<TabItem value="proxy" label="PROXY">
```bash
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python showLineNumbers
from openai import OpenAI
client = OpenAI(
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
base_url="http://0.0.0.0:4000/v1/" # point to litellm proxy
)
response = client.chat.completions.create(
model="gemini-pro",
messages=[{"role": "user", "content": "Who won the world cup?"}],
tools=[{"googleSearchRetrieval": {}}],
)
print(response)
```
</TabItem>
<TabItem value="curl" label="cURL">
```bash showLineNumbers
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gemini-pro",
"messages": [
{"role": "user", "content": "Hello, Claude!"}
{"role": "user", "content": "Who won the world cup?"}
],
"tools": [
{
@ -394,24 +416,98 @@ curl http://localhost:4000/v1/chat/completions \
}'
```
</TabItem>
</Tabs>
</TabItem>
</Tabs>
You can also use the `enterpriseWebSearch` tool for an [enterprise compliant search](https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/web-grounding-enterprise).
<Tabs>
<TabItem value="sdk" label="SDK">
```python showLineNumbers
from litellm import completion
## SETUP ENVIRONMENT
# !gcloud auth application-default login - run this to add vertex credentials to your env
tools = [{"enterpriseWebSearch": {}}] # 👈 ADD GOOGLE ENTERPRISE SEARCH
resp = litellm.completion(
model="vertex_ai/gemini-1.0-pro-001",
messages=[{"role": "user", "content": "Who won the world cup?"}],
tools=tools,
)
print(resp)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python showLineNumbers
from openai import OpenAI
client = OpenAI(
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
base_url="http://0.0.0.0:4000/v1/" # point to litellm proxy
)
response = client.chat.completions.create(
model="gemini-pro",
messages=[{"role": "user", "content": "Who won the world cup?"}],
tools=[{"enterpriseWebSearch": {}}],
)
print(response)
```
</TabItem>
<TabItem value="curl" label="cURL">
```bash showLineNumbers
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gemini-pro",
"messages": [
{"role": "user", "content": "Who won the world cup?"}
],
"tools": [
{
"enterpriseWebSearch": {}
}
]
}'
```
</TabItem>
</Tabs>
</TabItem>
</Tabs>
#### **Moving from Vertex AI SDK to LiteLLM (GROUNDING)**
If this was your initial VertexAI Grounding code,
```python
import vertexai
import vertexai
from vertexai.generative_models import GenerativeModel, GenerationConfig, Tool, grounding
vertexai.init(project=project_id, location="us-central1")
model = GenerativeModel("gemini-1.5-flash-001")
# Use Google Search for grounding
tool = Tool.from_google_search_retrieval(grounding.GoogleSearchRetrieval(disable_attributon=False))
tool = Tool.from_google_search_retrieval(grounding.GoogleSearchRetrieval())
prompt = "When is the next total solar eclipse in US?"
response = model.generate_content(
@ -428,7 +524,7 @@ print(response)
then, this is what it looks like now
```python
from litellm import completion
from litellm import completion
# !gcloud auth application-default login - run this to add vertex credentials to your env
@ -852,6 +948,7 @@ litellm.vertex_location = "us-central1 # Your Location
| claude-3-5-sonnet@20240620 | `completion('vertex_ai/claude-3-5-sonnet@20240620', messages)` |
| claude-3-sonnet@20240229 | `completion('vertex_ai/claude-3-sonnet@20240229', messages)` |
| claude-3-haiku@20240307 | `completion('vertex_ai/claude-3-haiku@20240307', messages)` |
| claude-3-7-sonnet@20250219 | `completion('vertex_ai/claude-3-7-sonnet@20250219', messages)` |
### Usage
@ -926,6 +1023,119 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \
</Tabs>
### Usage - `thinking` / `reasoning_content`
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
resp = completion(
model="vertex_ai/claude-3-7-sonnet-20250219",
messages=[{"role": "user", "content": "What is the capital of France?"}],
thinking={"type": "enabled", "budget_tokens": 1024},
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
- model_name: claude-3-7-sonnet-20250219
litellm_params:
model: vertex_ai/claude-3-7-sonnet-20250219
vertex_ai_project: "my-test-project"
vertex_ai_location: "us-west-1"
```
2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
3. Test it!
```bash
curl http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <YOUR-LITELLM-KEY>" \
-d '{
"model": "claude-3-7-sonnet-20250219",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"thinking": {"type": "enabled", "budget_tokens": 1024}
}'
```
</TabItem>
</Tabs>
**Expected Response**
```python
ModelResponse(
id='chatcmpl-c542d76d-f675-4e87-8e5f-05855f5d0f5e',
created=1740470510,
model='claude-3-7-sonnet-20250219',
object='chat.completion',
system_fingerprint=None,
choices=[
Choices(
finish_reason='stop',
index=0,
message=Message(
content="The capital of France is Paris.",
role='assistant',
tool_calls=None,
function_call=None,
provider_specific_fields={
'citations': None,
'thinking_blocks': [
{
'type': 'thinking',
'thinking': 'The capital of France is Paris. This is a very straightforward factual question.',
'signature': 'EuYBCkQYAiJAy6...'
}
]
}
),
thinking_blocks=[
{
'type': 'thinking',
'thinking': 'The capital of France is Paris. This is a very straightforward factual question.',
'signature': 'EuYBCkQYAiJAy6AGB...'
}
],
reasoning_content='The capital of France is Paris. This is a very straightforward factual question.'
)
],
usage=Usage(
completion_tokens=68,
prompt_tokens=42,
total_tokens=110,
completion_tokens_details=None,
prompt_tokens_details=PromptTokensDetailsWrapper(
audio_tokens=None,
cached_tokens=0,
text_tokens=None,
image_tokens=None
),
cache_creation_input_tokens=0,
cache_read_input_tokens=0
)
)
```
## Llama 3 API
| Model Name | Function Call |
@ -1253,6 +1463,103 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \
</Tabs>
## Gemini Pro
| Model Name | Function Call |
|------------------|--------------------------------------|
| gemini-pro | `completion('gemini-pro', messages)`, `completion('vertex_ai/gemini-pro', messages)` |
## Fine-tuned Models
You can call fine-tuned Vertex AI Gemini models through LiteLLM
| Property | Details |
|----------|---------|
| Provider Route | `vertex_ai/gemini/{MODEL_ID}` |
| Vertex Documentation | [Vertex AI - Fine-tuned Gemini Models](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini-use-supervised-tuning#test_the_tuned_model_with_a_prompt)|
| Supported Operations | `/chat/completions`, `/completions`, `/embeddings`, `/images` |
To use a model that follows the `/gemini` request/response format, simply set the model parameter as
```python title="Model parameter for calling fine-tuned gemini models"
model="vertex_ai/gemini/<your-finetuned-model>"
```
<Tabs>
<TabItem value="sdk" label="LiteLLM Python SDK">
```python showLineNumbers title="Example"
import litellm
import os
## set ENV variables
os.environ["VERTEXAI_PROJECT"] = "hardy-device-38811"
os.environ["VERTEXAI_LOCATION"] = "us-central1"
response = litellm.completion(
model="vertex_ai/gemini/<your-finetuned-model>", # e.g. vertex_ai/gemini/4965075652664360960
messages=[{ "content": "Hello, how are you?","role": "user"}],
)
```
</TabItem>
<TabItem value="proxy" label="LiteLLM Proxy">
1. Add Vertex Credentials to your env
```bash title="Authenticate to Vertex AI"
!gcloud auth application-default login
```
2. Setup config.yaml
```yaml showLineNumbers title="Add to litellm config"
- model_name: finetuned-gemini
litellm_params:
model: vertex_ai/gemini/<ENDPOINT_ID>
vertex_project: <PROJECT_ID>
vertex_location: <LOCATION>
```
3. Test it!
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python showLineNumbers title="Example request"
from openai import OpenAI
client = OpenAI(
api_key="your-litellm-key",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="finetuned-gemini",
messages=[
{"role": "user", "content": "hi"}
]
)
print(response)
```
</TabItem>
<TabItem value="curl" label="curl">
```bash showLineNumbers title="Example request"
curl --location 'https://0.0.0.0:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: <LITELLM_KEY>' \
--data '{"model": "finetuned-gemini" ,"messages":[{"role": "user", "content":[{"type": "text", "text": "hi"}]}]}'
```
</TabItem>
</Tabs>
</TabItem>
</Tabs>
## Model Garden
:::tip
@ -1363,67 +1670,6 @@ response = completion(
</Tabs>
## Gemini Pro
| Model Name | Function Call |
|------------------|--------------------------------------|
| gemini-pro | `completion('gemini-pro', messages)`, `completion('vertex_ai/gemini-pro', messages)` |
## Fine-tuned Models
Fine tuned models on vertex have a numerical model/endpoint id.
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
import os
## set ENV variables
os.environ["VERTEXAI_PROJECT"] = "hardy-device-38811"
os.environ["VERTEXAI_LOCATION"] = "us-central1"
response = completion(
model="vertex_ai/<your-finetuned-model>", # e.g. vertex_ai/4965075652664360960
messages=[{ "content": "Hello, how are you?","role": "user"}],
base_model="vertex_ai/gemini-1.5-pro" # the base model - used for routing
)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Add Vertex Credentials to your env
```bash
!gcloud auth application-default login
```
2. Setup config.yaml
```yaml
- model_name: finetuned-gemini
litellm_params:
model: vertex_ai/<ENDPOINT_ID>
vertex_project: <PROJECT_ID>
vertex_location: <LOCATION>
model_info:
base_model: vertex_ai/gemini-1.5-pro # IMPORTANT
```
3. Test it!
```bash
curl --location 'https://0.0.0.0:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: <LITELLM_KEY>' \
--data '{"model": "finetuned-gemini" ,"messages":[{"role": "user", "content":[{"type": "text", "text": "hi"}]}]}'
```
</TabItem>
</Tabs>
## Gemini Pro Vision
| Model Name | Function Call |
@ -1568,15 +1814,25 @@ assert isinstance(
```
## Usage - PDF / Videos / etc. Files
## Usage - PDF / Videos / Audio etc. Files
Pass any file supported by Vertex AI, through LiteLLM.
LiteLLM Supports the following file types passed in url.
Using `file` message type for VertexAI is live from v1.65.1+
```
Files with Cloud Storage URIs - gs://cloud-samples-data/generative-ai/image/boats.jpeg
Files with direct links - https://storage.googleapis.com/github-repo/img/gemini/intro/landmark3.jpg
Videos with Cloud Storage URIs - https://storage.googleapis.com/github-repo/img/gemini/multimodality_usecases_overview/pixel8.mp4
Base64 Encoded Local Files
```
<Tabs>
<TabItem value="sdk" label="SDK">
### **Using `gs://`**
### **Using `gs://` or any URL**
```python
from litellm import completion
@ -1588,8 +1844,11 @@ response = completion(
"content": [
{"type": "text", "text": "You are a very professional document summarization specialist. Please summarize the given document."},
{
"type": "image_url",
"image_url": "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf", # 👈 PDF
"type": "file",
"file": {
"file_id": "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
"format": "application/pdf" # OPTIONAL - specify mime-type
}
},
],
}
@ -1623,8 +1882,16 @@ response = completion(
"content": [
{"type": "text", "text": "You are a very professional document summarization specialist. Please summarize the given document."},
{
"type": "image_url",
"image_url": f"data:application/pdf;base64,{encoded_file}", # 👈 PDF
"type": "file",
"file": {
"file_data": f"data:application/pdf;base64,{encoded_file}", # 👈 PDF
}
},
{
"type": "audio_input",
"audio_input {
"audio_input": f"data:audio/mp3;base64,{encoded_file}", # 👈 AUDIO File ('file' message works as too)
}
},
],
}
@ -1670,8 +1937,11 @@ curl http://0.0.0.0:4000/v1/chat/completions \
"text": "You are a very professional document summarization specialist. Please summarize the given document"
},
{
"type": "image_url",
"image_url": "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf" # 👈 PDF
"type": "file",
"file": {
"file_id": "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
"format": "application/pdf" # OPTIONAL
}
}
}
]
@ -1698,11 +1968,18 @@ curl http://0.0.0.0:4000/v1/chat/completions \
"text": "You are a very professional document summarization specialist. Please summarize the given document"
},
{
"type": "image_url",
"image_url": "data:application/pdf;base64,{encoded_file}" # 👈 PDF
}
}
]
"type": "file",
"file": {
"file_data": f"data:application/pdf;base64,{encoded_file}", # 👈 PDF
},
},
{
"type": "audio_input",
"audio_input {
"audio_input": f"data:audio/mp3;base64,{encoded_file}", # 👈 AUDIO File ('file' message works as too)
}
},
]
}
],
"max_tokens": 300
@ -1712,6 +1989,7 @@ curl http://0.0.0.0:4000/v1/chat/completions \
</TabItem>
</Tabs>
## Chat Models
| Model Name | Function Call |
|------------------|--------------------------------------|
@ -1920,7 +2198,12 @@ print(response)
## **Multi-Modal Embeddings**
Usage
Known Limitations:
- Only supports 1 image / video / image per request
- Only supports GCS or base64 encoded images / videos
### Usage
<Tabs>
<TabItem value="sdk" label="SDK">
@ -2136,6 +2419,115 @@ print(f"Text Embedding: {embeddings.text_embedding}")
</Tabs>
### Text + Image + Video Embeddings
<Tabs>
<TabItem value="sdk" label="SDK">
Text + Image
```python
response = await litellm.aembedding(
model="vertex_ai/multimodalembedding@001",
input=["hey", "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"] # will be sent as a gcs image
)
```
Text + Video
```python
response = await litellm.aembedding(
model="vertex_ai/multimodalembedding@001",
input=["hey", "gs://my-bucket/embeddings/supermarket-video.mp4"] # will be sent as a gcs image
)
```
Image + Video
```python
response = await litellm.aembedding(
model="vertex_ai/multimodalembedding@001",
input=["gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png", "gs://my-bucket/embeddings/supermarket-video.mp4"] # will be sent as a gcs image
)
```
</TabItem>
<TabItem value="proxy" label="LiteLLM PROXY (Unified Endpoint)">
1. Add model to config.yaml
```yaml
model_list:
- model_name: multimodalembedding@001
litellm_params:
model: vertex_ai/multimodalembedding@001
vertex_project: "adroit-crow-413218"
vertex_location: "us-central1"
vertex_credentials: adroit-crow-413218-a956eef1a2a8.json
litellm_settings:
drop_params: True
```
2. Start Proxy
```
$ litellm --config /path/to/config.yaml
```
3. Make Request use OpenAI Python SDK, Langchain Python SDK
Text + Image
```python
import openai
client = openai.OpenAI(api_key="sk-1234", base_url="http://0.0.0.0:4000")
# # request sent to model set on litellm proxy, `litellm --model`
response = client.embeddings.create(
model="multimodalembedding@001",
input = ["hey", "gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png"],
)
print(response)
```
Text + Video
```python
import openai
client = openai.OpenAI(api_key="sk-1234", base_url="http://0.0.0.0:4000")
# # request sent to model set on litellm proxy, `litellm --model`
response = client.embeddings.create(
model="multimodalembedding@001",
input = ["hey", "gs://my-bucket/embeddings/supermarket-video.mp4"],
)
print(response)
```
Image + Video
```python
import openai
client = openai.OpenAI(api_key="sk-1234", base_url="http://0.0.0.0:4000")
# # request sent to model set on litellm proxy, `litellm --model`
response = client.embeddings.create(
model="multimodalembedding@001",
input = ["gs://cloud-samples-data/vertex-ai/llm/prompts/landmark1.png", "gs://my-bucket/embeddings/supermarket-video.mp4"],
)
print(response)
```
</TabItem>
</Tabs>
## **Image Generation Models**
Usage

View file

@ -157,6 +157,98 @@ curl -L -X POST 'http://0.0.0.0:4000/embeddings' \
</TabItem>
</Tabs>
## Send Video URL to VLLM
Example Implementation from VLLM [here](https://github.com/vllm-project/vllm/pull/10020)
There are two ways to send a video url to VLLM:
1. Pass the video url directly
```
{"type": "video_url", "video_url": {"url": video_url}},
```
2. Pass the video data as base64
```
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_data_base64}"}}
```
<Tabs>
<TabItem value="sdk" label="SDK">
```python
from litellm import completion
response = completion(
model="hosted_vllm/qwen", # pass the vllm model name
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the following video"
},
{
"type": "video_url",
"video_url": {
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}
}
]
}
],
api_base="https://hosted-vllm-api.co")
print(response)
```
</TabItem>
<TabItem value="proxy" label="PROXY">
1. Setup config.yaml
```yaml
model_list:
- model_name: my-model
litellm_params:
model: hosted_vllm/qwen # add hosted_vllm/ prefix to route as OpenAI provider
api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider
```
2. Start the proxy
```bash
$ litellm --config /path/to/config.yaml
# RUNNING on http://0.0.0.0:4000
```
3. Test it!
```bash
curl -X POST http://0.0.0.0:4000/chat/completions \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content":
[
{"type": "text", "text": "Summarize the following video"},
{"type": "video_url", "video_url": {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}}
]
}
]
}'
```
</TabItem>
</Tabs>
## (Deprecated) for `vllm pip package`
### Using - `litellm.completion`

View file

@ -18,13 +18,14 @@ os.environ['XAI_API_KEY']
```
## Sample Usage
```python
```python showLineNumbers title="LiteLLM python sdk usage - Non-streaming"
from litellm import completion
import os
os.environ['XAI_API_KEY'] = ""
response = completion(
model="xai/grok-2-latest",
model="xai/grok-3-mini-beta",
messages=[
{
"role": "user",
@ -45,13 +46,14 @@ print(response)
```
## Sample Usage - Streaming
```python
```python showLineNumbers title="LiteLLM python sdk usage - Streaming"
from litellm import completion
import os
os.environ['XAI_API_KEY'] = ""
response = completion(
model="xai/grok-2-latest",
model="xai/grok-3-mini-beta",
messages=[
{
"role": "user",
@ -75,14 +77,15 @@ for chunk in response:
```
## Sample Usage - Vision
```python
```python showLineNumbers title="LiteLLM python sdk usage - Vision"
import os
from litellm import completion
os.environ["XAI_API_KEY"] = "your-api-key"
response = completion(
model="xai/grok-2-latest",
model="xai/grok-2-vision-latest",
messages=[
{
"role": "user",
@ -110,7 +113,7 @@ Here's how to call a XAI model with the LiteLLM Proxy Server
1. Modify the config.yaml
```yaml
```yaml showLineNumbers
model_list:
- model_name: my-model
litellm_params:
@ -131,7 +134,7 @@ Here's how to call a XAI model with the LiteLLM Proxy Server
<TabItem value="openai" label="OpenAI Python v1.0.0+">
```python
```python showLineNumbers
import openai
client = openai.OpenAI(
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
@ -173,3 +176,81 @@ Here's how to call a XAI model with the LiteLLM Proxy Server
</Tabs>
## Reasoning Usage
LiteLLM supports reasoning usage for xAI models.
<Tabs>
<TabItem value="python" label="LiteLLM Python SDK">
```python showLineNumbers title="reasoning with xai/grok-3-mini-beta"
import litellm
response = litellm.completion(
model="xai/grok-3-mini-beta",
messages=[{"role": "user", "content": "What is 101*3?"}],
reasoning_effort="low",
)
print("Reasoning Content:")
print(response.choices[0].message.reasoning_content)
print("\nFinal Response:")
print(completion.choices[0].message.content)
print("\nNumber of completion tokens (input):")
print(completion.usage.completion_tokens)
print("\nNumber of reasoning tokens (input):")
print(completion.usage.completion_tokens_details.reasoning_tokens)
```
</TabItem>
<TabItem value="curl" label="LiteLLM Proxy - OpenAI SDK Usage">
```python showLineNumbers title="reasoning with xai/grok-3-mini-beta"
import openai
client = openai.OpenAI(
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
base_url="http://0.0.0.0:4000" # litellm-proxy-base url
)
response = client.chat.completions.create(
model="xai/grok-3-mini-beta",
messages=[{"role": "user", "content": "What is 101*3?"}],
reasoning_effort="low",
)
print("Reasoning Content:")
print(response.choices[0].message.reasoning_content)
print("\nFinal Response:")
print(completion.choices[0].message.content)
print("\nNumber of completion tokens (input):")
print(completion.usage.completion_tokens)
print("\nNumber of reasoning tokens (input):")
print(completion.usage.completion_tokens_details.reasoning_tokens)
```
</TabItem>
</Tabs>
**Example Response:**
```shell
Reasoning Content:
Let me calculate 101 multiplied by 3:
101 * 3 = 303.
I can double-check that: 100 * 3 is 300, and 1 * 3 is 3, so 300 + 3 = 303. Yes, that's correct.
Final Response:
The result of 101 multiplied by 3 is 303.
Number of completion tokens (input):
14
Number of reasoning tokens (input):
310
```

View file

@ -10,17 +10,13 @@ Role-based access control (RBAC) is based on Organizations, Teams and Internal U
## Roles
**Admin Roles**
- `proxy_admin`: admin over the platform
- `proxy_admin_viewer`: can login, view all keys, view all spend. **Cannot** create keys/delete keys/add new users
**Organization Roles**
- `org_admin`: admin over the organization. Can create teams and users within their organization
**Internal User Roles**
- `internal_user`: can login, view/create/delete their own keys, view their spend. **Cannot** add new users.
- `internal_user_viewer`: can login, view their own keys, view their own spend. **Cannot** create/delete keys, add new users.
| Role Type | Role Name | Permissions |
|-----------|-----------|-------------|
| **Admin** | `proxy_admin` | Admin over the platform |
| | `proxy_admin_viewer` | Can login, view all keys, view all spend. **Cannot** create keys/delete keys/add new users |
| **Organization** | `org_admin` | Admin over the organization. Can create teams and users within their organization |
| **Internal User** | `internal_user` | Can login, view/create/delete their own keys, view their spend. **Cannot** add new users |
| | `internal_user_viewer` | Can login, view their own keys, view their own spend. **Cannot** create/delete keys, add new users |
## Onboarding Organizations

View file

@ -147,11 +147,16 @@ Some SSO providers require a specific redirect url for login and logout. You can
- Login: `<your-proxy-base-url>/sso/key/generate`
- Logout: `<your-proxy-base-url>`
Here's the env var to set the logout url on the proxy
```bash
PROXY_LOGOUT_URL="https://www.google.com"
```
#### Step 3. Set `PROXY_BASE_URL` in your .env
Set this in your .env (so the proxy can set the correct redirect url)
```shell
PROXY_BASE_URL=https://litellm-api.up.railway.app/
PROXY_BASE_URL=https://litellm-api.up.railway.app
```
#### Step 4. Test flow

View file

@ -36,7 +36,7 @@ import TabItem from '@theme/TabItem';
- Virtual Key Rate Limit
- User Rate Limit
- Team Limit
- The `_PROXY_track_cost_callback` updates spend / usage in the LiteLLM database. [Here is everything tracked in the DB per request](https://github.com/BerriAI/litellm/blob/ba41a72f92a9abf1d659a87ec880e8e319f87481/schema.prisma#L172)
- The `_ProxyDBLogger` updates spend / usage in the LiteLLM database. [Here is everything tracked in the DB per request](https://github.com/BerriAI/litellm/blob/ba41a72f92a9abf1d659a87ec880e8e319f87481/schema.prisma#L172)
## Frequently Asked Questions

View file

@ -2,7 +2,6 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Caching
Cache LLM Responses
:::note
@ -10,14 +9,19 @@ For OpenAI/Anthropic Prompt Caching, go [here](../completion/prompt_caching.md)
:::
LiteLLM supports:
Cache LLM Responses. LiteLLM's caching system stores and reuses LLM responses to save costs and reduce latency. When you make the same request twice, the cached response is returned instead of calling the LLM API again.
### Supported Caches
- In Memory Cache
- Redis Cache
- Qdrant Semantic Cache
- Redis Semantic Cache
- s3 Bucket Cache
## Quick Start - Redis, s3 Cache, Semantic Cache
## Quick Start
<Tabs>
<TabItem value="redis" label="redis cache">
@ -369,9 +373,9 @@ $ litellm --config /path/to/config.yaml
</Tabs>
## Usage
## Using Caching - /chat/completions
### Basic
<Tabs>
<TabItem value="chat_completions" label="/chat/completions">
@ -416,6 +420,239 @@ curl --location 'http://0.0.0.0:4000/embeddings' \
</TabItem>
</Tabs>
### Dynamic Cache Controls
| Parameter | Type | Description |
|-----------|------|-------------|
| `ttl` | *Optional(int)* | Will cache the response for the user-defined amount of time (in seconds) |
| `s-maxage` | *Optional(int)* | Will only accept cached responses that are within user-defined range (in seconds) |
| `no-cache` | *Optional(bool)* | Will not store the response in cache. |
| `no-store` | *Optional(bool)* | Will not cache the response |
| `namespace` | *Optional(str)* | Will cache the response under a user-defined namespace |
Each cache parameter can be controlled on a per-request basis. Here are examples for each parameter:
### `ttl`
Set how long (in seconds) to cache a response.
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}],
model="gpt-3.5-turbo",
extra_body={
"cache": {
"ttl": 300 # Cache response for 5 minutes
}
}
)
```
</TabItem>
<TabItem value="curl" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"ttl": 300},
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
</TabItem>
</Tabs>
### `s-maxage`
Only accept cached responses that are within the specified age (in seconds).
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}],
model="gpt-3.5-turbo",
extra_body={
"cache": {
"s-maxage": 600 # Only use cache if less than 10 minutes old
}
}
)
```
</TabItem>
<TabItem value="curl" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"s-maxage": 600},
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
</TabItem>
</Tabs>
### `no-cache`
Force a fresh response, bypassing the cache.
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}],
model="gpt-3.5-turbo",
extra_body={
"cache": {
"no-cache": True # Skip cache check, get fresh response
}
}
)
```
</TabItem>
<TabItem value="curl" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"no-cache": true},
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
</TabItem>
</Tabs>
### `no-store`
Will not store the response in cache.
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}],
model="gpt-3.5-turbo",
extra_body={
"cache": {
"no-store": True # Don't cache this response
}
}
)
```
</TabItem>
<TabItem value="curl" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"no-store": true},
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
</TabItem>
</Tabs>
### `namespace`
Store the response under a specific cache namespace.
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[{"role": "user", "content": "Hello"}],
model="gpt-3.5-turbo",
extra_body={
"cache": {
"namespace": "my-custom-namespace" # Store in custom namespace
}
}
)
```
</TabItem>
<TabItem value="curl" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"namespace": "my-custom-namespace"},
"messages": [
{"role": "user", "content": "Hello"}
]
}'
```
</TabItem>
</Tabs>
## Set cache for proxy, but not on the actual llm api call
Use this if you just want to enable features like rate limiting, and loadbalancing across multiple instances.
@ -501,253 +738,6 @@ litellm_settings:
# /chat/completions, /completions, /embeddings, /audio/transcriptions
```
### **Turn on / off caching per request. **
The proxy support 4 cache-controls:
- `ttl`: *Optional(int)* - Will cache the response for the user-defined amount of time (in seconds).
- `s-maxage`: *Optional(int)* Will only accept cached responses that are within user-defined range (in seconds).
- `no-cache`: *Optional(bool)* Will not return a cached response, but instead call the actual endpoint.
- `no-store`: *Optional(bool)* Will not cache the response.
[Let us know if you need more](https://github.com/BerriAI/litellm/issues/1218)
**Turn off caching**
Set `no-cache=True`, this will not return a cached response
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
import os
from openai import OpenAI
client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
cache: {
"no-cache": True # will not return a cached response
}
}
)
```
</TabItem>
<TabItem value="curl" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"no-cache": True},
"messages": [
{"role": "user", "content": "Say this is a test"}
]
}'
```
</TabItem>
</Tabs>
**Turn on caching**
By default cache is always on
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
import os
from openai import OpenAI
client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo"
)
```
</TabItem>
<TabItem value="curl on" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{"role": "user", "content": "Say this is a test"}
]
}'
```
</TabItem>
</Tabs>
**Set `ttl`**
Set `ttl=600`, this will caches response for 10 minutes (600 seconds)
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
import os
from openai import OpenAI
client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
cache: {
"ttl": 600 # caches response for 10 minutes
}
}
)
```
</TabItem>
<TabItem value="curl on" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"ttl": 600},
"messages": [
{"role": "user", "content": "Say this is a test"}
]
}'
```
</TabItem>
</Tabs>
**Set `s-maxage`**
Set `s-maxage`, this will only get responses cached within last 10 minutes
<Tabs>
<TabItem value="openai" label="OpenAI Python SDK">
```python
import os
from openai import OpenAI
client = OpenAI(
# This is the default and can be omitted
api_key=os.environ.get("OPENAI_API_KEY"),
base_url="http://0.0.0.0:4000"
)
chat_completion = client.chat.completions.create(
messages=[
{
"role": "user",
"content": "Say this is a test",
}
],
model="gpt-3.5-turbo",
extra_body = { # OpenAI python accepts extra args in extra_body
cache: {
"s-maxage": 600 # only get responses cached within last 10 minutes
}
}
)
```
</TabItem>
<TabItem value="curl on" label="curl">
```shell
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gpt-3.5-turbo",
"cache": {"s-maxage": 600},
"messages": [
{"role": "user", "content": "Say this is a test"}
]
}'
```
</TabItem>
</Tabs>
### Turn on / off caching per Key.
1. Add cache params when creating a key [full list](#turn-on--off-caching-per-key)
```bash
curl -X POST 'http://0.0.0.0:4000/key/generate' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"user_id": "222",
"metadata": {
"cache": {
"no-cache": true
}
}
}'
```
2. Test it!
```bash
curl -X POST 'http://localhost:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <YOUR_NEW_KEY>' \
-d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "bom dia"}]}'
```
### Deleting Cache Keys - `/cache/delete`
In order to delete a cache key, send a request to `/cache/delete` with the `keys` you want to delete

View file

@ -70,6 +70,21 @@ class MyCustomHandler(CustomLogger): # https://docs.litellm.ai/docs/observabilit
response: str,
):
pass
aasync def async_post_call_streaming_iterator_hook(
self,
user_api_key_dict: UserAPIKeyAuth,
response: Any,
request_data: dict,
) -> AsyncGenerator[ModelResponseStream, None]:
"""
Passes the entire stream to the guardrail
This is useful for plugins that need to see the entire stream.
"""
async for item in response:
yield item
proxy_handler_instance = MyCustomHandler()
```

View file

@ -147,6 +147,7 @@ general_settings:
|------|------|-------------|
| completion_model | string | The default model to use for completions when `model` is not specified in the request |
| disable_spend_logs | boolean | If true, turns off writing each transaction to the database |
| disable_spend_updates | boolean | If true, turns off all spend updates to the DB. Including key/user/team spend updates. |
| disable_master_key_return | boolean | If true, turns off returning master key on UI. (checked on '/user/info' endpoint) |
| disable_retry_on_max_parallel_request_limit_error | boolean | If true, turns off retries when max parallel request limit is reached |
| disable_reset_budget | boolean | If true, turns off reset budget scheduled task |
@ -159,7 +160,7 @@ general_settings:
| database_url | string | The URL for the database connection [Set up Virtual Keys](virtual_keys) |
| database_connection_pool_limit | integer | The limit for database connection pool [Setting DB Connection Pool limit](#configure-db-pool-limits--connection-timeouts) |
| database_connection_timeout | integer | The timeout for database connections in seconds [Setting DB Connection Pool limit, timeout](#configure-db-pool-limits--connection-timeouts) |
| allow_requests_on_db_unavailable | boolean | If true, allows requests to succeed even if DB is unreachable. **Only use this if running LiteLLM in your VPC** This will allow requests to work even when LiteLLM cannot connect to the DB to verify a Virtual Key |
| allow_requests_on_db_unavailable | boolean | If true, allows requests to succeed even if DB is unreachable. **Only use this if running LiteLLM in your VPC** This will allow requests to work even when LiteLLM cannot connect to the DB to verify a Virtual Key [Doc on graceful db unavailability](prod#5-if-running-litellm-on-vpc-gracefully-handle-db-unavailability) |
| custom_auth | string | Write your own custom authentication logic [Doc Custom Auth](virtual_keys#custom-auth) |
| max_parallel_requests | integer | The max parallel requests allowed per deployment |
| global_max_parallel_requests | integer | The max parallel requests allowed on the proxy overall |
@ -177,7 +178,7 @@ general_settings:
| use_x_forwarded_for | str | If true, uses the X-Forwarded-For header to get the client IP address |
| service_account_settings | List[Dict[str, Any]] | Set `service_account_settings` if you want to create settings that only apply to service account keys (Doc on service accounts)[./service_accounts.md] |
| image_generation_model | str | The default model to use for image generation - ignores model set in request |
| store_model_in_db | boolean | If true, allows `/model/new` endpoint to store model information in db. Endpoint disabled by default. [Doc on `/model/new` endpoint](./model_management.md#create-a-new-model) |
| store_model_in_db | boolean | If true, enables storing model + credential information in the DB. |
| store_prompts_in_spend_logs | boolean | If true, allows prompts and responses to be stored in the spend logs table. |
| max_request_size_mb | int | The maximum size for requests in MB. Requests above this size will be rejected. |
| max_response_size_mb | int | The maximum size for responses in MB. LLM Responses above this size will not be sent. |
@ -405,6 +406,7 @@ router_settings:
| HELICONE_API_KEY | API key for Helicone service
| HOSTNAME | Hostname for the server, this will be [emitted to `datadog` logs](https://docs.litellm.ai/docs/proxy/logging#datadog)
| HUGGINGFACE_API_BASE | Base URL for Hugging Face API
| HUGGINGFACE_API_KEY | API key for Hugging Face API
| IAM_TOKEN_DB_AUTH | IAM token for database authentication
| JSON_LOGS | Enable JSON formatted logging
| JWT_AUDIENCE | Expected audience for JWT tokens
@ -447,6 +449,7 @@ router_settings:
| MICROSOFT_CLIENT_ID | Client ID for Microsoft services
| MICROSOFT_CLIENT_SECRET | Client secret for Microsoft services
| MICROSOFT_TENANT | Tenant ID for Microsoft Azure
| MICROSOFT_SERVICE_PRINCIPAL_ID | Service Principal ID for Microsoft Enterprise Application. (This is an advanced feature if you want litellm to auto-assign members to Litellm Teams based on their Microsoft Entra ID Groups)
| NO_DOCS | Flag to disable documentation generation
| NO_PROXY | List of addresses to bypass proxy
| OAUTH_TOKEN_INFO_ENDPOINT | Endpoint for OAuth token info retrieval
@ -466,6 +469,9 @@ router_settings:
| OTEL_SERVICE_NAME | Service name identifier for OpenTelemetry
| OTEL_TRACER_NAME | Tracer name for OpenTelemetry tracing
| PAGERDUTY_API_KEY | API key for PagerDuty Alerting
| PHOENIX_API_KEY | API key for Arize Phoenix
| PHOENIX_COLLECTOR_ENDPOINT | API endpoint for Arize Phoenix
| PHOENIX_COLLECTOR_HTTP_ENDPOINT | API http endpoint for Arize Phoenix
| POD_NAME | Pod name for the server, this will be [emitted to `datadog` logs](https://docs.litellm.ai/docs/proxy/logging#datadog) as `POD_NAME`
| PREDIBASE_API_BASE | Base URL for Predibase API
| PRESIDIO_ANALYZER_API_BASE | Base URL for Presidio Analyzer service
@ -475,7 +481,7 @@ router_settings:
| PROXY_ADMIN_ID | Admin identifier for proxy server
| PROXY_BASE_URL | Base URL for proxy service
| PROXY_LOGOUT_URL | URL for logging out of the proxy service
| PROXY_MASTER_KEY | Master key for proxy authentication
| LITELLM_MASTER_KEY | Master key for proxy authentication
| QDRANT_API_BASE | Base URL for Qdrant API
| QDRANT_API_KEY | API key for Qdrant service
| QDRANT_URL | Connection URL for Qdrant database
@ -496,9 +502,11 @@ router_settings:
| SMTP_USERNAME | Username for SMTP authentication (do not set if SMTP does not require auth)
| SPEND_LOGS_URL | URL for retrieving spend logs
| SSL_CERTIFICATE | Path to the SSL certificate file
| SSL_SECURITY_LEVEL | [BETA] Security level for SSL/TLS connections. E.g. `DEFAULT@SECLEVEL=1`
| SSL_VERIFY | Flag to enable or disable SSL certificate verification
| SUPABASE_KEY | API key for Supabase service
| SUPABASE_URL | Base URL for Supabase instance
| STORE_MODEL_IN_DB | If true, enables storing model + credential information in the DB.
| TEST_EMAIL_ADDRESS | Email address used for testing purposes
| UI_LOGO_PATH | Path to the logo image used in the UI
| UI_PASSWORD | Password for accessing the UI
@ -509,5 +517,5 @@ router_settings:
| UPSTREAM_LANGFUSE_RELEASE | Release version identifier for upstream Langfuse
| UPSTREAM_LANGFUSE_SECRET_KEY | Secret key for upstream Langfuse authentication
| USE_AWS_KMS | Flag to enable AWS Key Management Service for encryption
| USE_PRISMA_MIGRATE | Flag to use prisma migrate instead of prisma db push. Recommended for production environments.
| WEBHOOK_URL | URL for receiving webhooks from external services

View file

@ -448,6 +448,34 @@ model_list:
s/o to [@David Manouchehri](https://www.linkedin.com/in/davidmanouchehri/) for helping with this.
### Centralized Credential Management
Define credentials once and reuse them across multiple models. This helps with:
- Secret rotation
- Reducing config duplication
```yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: azure/gpt-4o
litellm_credential_name: default_azure_credential # Reference credential below
credential_list:
- credential_name: default_azure_credential
credential_values:
api_key: os.environ/AZURE_API_KEY # Load from environment
api_base: os.environ/AZURE_API_BASE
api_version: "2023-05-15"
credential_info:
description: "Production credentials for EU region"
```
#### Key Parameters
- `credential_name`: Unique identifier for the credential set
- `credential_values`: Key-value pairs of credentials/secrets (supports `os.environ/` syntax)
- `credential_info`: Key-value pairs of user provided credentials information. No key-value pairs are required, but the dictionary must exist.
### Load API Keys from Secret Managers (Azure Vault, etc)
[**Using Secret Managers with LiteLLM Proxy**](../secret)
@ -641,4 +669,4 @@ docker run --name litellm-proxy \
ghcr.io/berriai/litellm-database:main-latest
```
</TabItem>
</Tabs>
</Tabs>

View file

@ -6,6 +6,8 @@ import Image from '@theme/IdealImage';
Track spend for keys, users, and teams across 100+ LLMs.
LiteLLM automatically tracks spend for all known models. See our [model cost map](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json)
### How to Track Spend with LiteLLM
**Step 1**
@ -35,10 +37,10 @@ response = client.chat.completions.create(
"content": "this is a test request, write a short poem"
}
],
user="palantir",
extra_body={
user="palantir", # OPTIONAL: pass user to track spend by user
extra_body={
"metadata": {
"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"]
"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"] # ENTERPRISE: pass tags to track spend by tags
}
}
)
@ -63,9 +65,9 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \
"content": "what llm are you"
}
],
"user": "palantir",
"user": "palantir", # OPTIONAL: pass user to track spend by user
"metadata": {
"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"]
"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"] # ENTERPRISE: pass tags to track spend by tags
}
}'
```
@ -90,7 +92,7 @@ chat = ChatOpenAI(
user="palantir",
extra_body={
"metadata": {
"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"]
"tags": ["jobID:214590dsff09fds", "taskName:run_page_classification"] # ENTERPRISE: pass tags to track spend by tags
}
}
)
@ -150,8 +152,112 @@ Navigate to the Usage Tab on the LiteLLM UI (found on https://your-proxy-endpoin
</TabItem>
</Tabs>
## ✨ (Enterprise) API Endpoints to get Spend
### Getting Spend Reports - To Charge Other Teams, Customers, Users
### Allowing Non-Proxy Admins to access `/spend` endpoints
Use this when you want non-proxy admins to access `/spend` endpoints
:::info
Schedule a [meeting with us to get your Enterprise License](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
:::
##### Create Key
Create Key with with `permissions={"get_spend_routes": true}`
```shell
curl --location 'http://0.0.0.0:4000/key/generate' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"permissions": {"get_spend_routes": true}
}'
```
##### Use generated key on `/spend` endpoints
Access spend Routes with newly generate keys
```shell
curl -X GET 'http://localhost:4000/global/spend/report?start_date=2024-04-01&end_date=2024-06-30' \
-H 'Authorization: Bearer sk-H16BKvrSNConSsBYLGc_7A'
```
#### Reset Team, API Key Spend - MASTER KEY ONLY
Use `/global/spend/reset` if you want to:
- Reset the Spend for all API Keys, Teams. The `spend` for ALL Teams and Keys in `LiteLLM_TeamTable` and `LiteLLM_VerificationToken` will be set to `spend=0`
- LiteLLM will maintain all the logs in `LiteLLMSpendLogs` for Auditing Purposes
##### Request
Only the `LITELLM_MASTER_KEY` you set can access this route
```shell
curl -X POST \
'http://localhost:4000/global/spend/reset' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json'
```
##### Expected Responses
```shell
{"message":"Spend for all API Keys and Teams reset successfully","status":"success"}
```
## Daily Spend Breakdown API
Retrieve granular daily usage data for a user (by model, provider, and API key) with a single endpoint.
Example Request:
```shell title="Daily Spend Breakdown API" showLineNumbers
curl -L -X GET 'http://localhost:4000/user/daily/activity?start_date=2025-03-20&end_date=2025-03-27' \
-H 'Authorization: Bearer sk-...'
```
```json title="Daily Spend Breakdown API Response" showLineNumbers
{
"results": [
{
"date": "2025-03-27",
"metrics": {
"spend": 0.0177072,
"prompt_tokens": 111,
"completion_tokens": 1711,
"total_tokens": 1822,
"api_requests": 11
},
"breakdown": {
"models": {
"gpt-4o-mini": {
"spend": 1.095e-05,
"prompt_tokens": 37,
"completion_tokens": 9,
"total_tokens": 46,
"api_requests": 1
},
"providers": { "openai": { ... }, "azure_ai": { ... } },
"api_keys": { "3126b6eaf1...": { ... } }
}
}
],
"metadata": {
"total_spend": 0.7274667,
"total_prompt_tokens": 280990,
"total_completion_tokens": 376674,
"total_api_requests": 14
}
}
```
### API Reference
See our [Swagger API](https://litellm-api.up.railway.app/#/Budget%20%26%20Spend%20Tracking/get_user_daily_activity_user_daily_activity_get) for more details on the `/user/daily/activity` endpoint
## ✨ (Enterprise) Generate Spend Reports
Use this to charge other teams, customers, users
Use the `/global/spend/report` endpoint to get spend reports
@ -470,105 +576,6 @@ curl -X GET 'http://localhost:4000/global/spend/report?start_date=2024-04-01&end
</Tabs>
### Allowing Non-Proxy Admins to access `/spend` endpoints
Use this when you want non-proxy admins to access `/spend` endpoints
:::info
Schedule a [meeting with us to get your Enterprise License](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
:::
##### Create Key
Create Key with with `permissions={"get_spend_routes": true}`
```shell
curl --location 'http://0.0.0.0:4000/key/generate' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"permissions": {"get_spend_routes": true}
}'
```
##### Use generated key on `/spend` endpoints
Access spend Routes with newly generate keys
```shell
curl -X GET 'http://localhost:4000/global/spend/report?start_date=2024-04-01&end_date=2024-06-30' \
-H 'Authorization: Bearer sk-H16BKvrSNConSsBYLGc_7A'
```
#### Reset Team, API Key Spend - MASTER KEY ONLY
Use `/global/spend/reset` if you want to:
- Reset the Spend for all API Keys, Teams. The `spend` for ALL Teams and Keys in `LiteLLM_TeamTable` and `LiteLLM_VerificationToken` will be set to `spend=0`
- LiteLLM will maintain all the logs in `LiteLLMSpendLogs` for Auditing Purposes
##### Request
Only the `LITELLM_MASTER_KEY` you set can access this route
```shell
curl -X POST \
'http://localhost:4000/global/spend/reset' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json'
```
##### Expected Responses
```shell
{"message":"Spend for all API Keys and Teams reset successfully","status":"success"}
```
## Spend Tracking for Azure OpenAI Models
Set base model for cost tracking azure image-gen call
#### Image Generation
```yaml
model_list:
- model_name: dall-e-3
litellm_params:
model: azure/dall-e-3-test
api_version: 2023-06-01-preview
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
api_key: os.environ/AZURE_API_KEY
base_model: dall-e-3 # 👈 set dall-e-3 as base model
model_info:
mode: image_generation
```
#### Chat Completions / Embeddings
**Problem**: Azure returns `gpt-4` in the response when `azure/gpt-4-1106-preview` is used. This leads to inaccurate cost tracking
**Solution** ✅ : Set `base_model` on your config so litellm uses the correct model for calculating azure cost
Get the base model name from [here](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json)
Example config with `base_model`
```yaml
model_list:
- model_name: azure-gpt-3.5
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
model_info:
base_model: azure/gpt-4-1106-preview
```
## Custom Input/Output Pricing
👉 Head to [Custom Input/Output Pricing](https://docs.litellm.ai/docs/proxy/custom_pricing) to setup custom pricing or your models
## ✨ Custom Spend Log metadata
@ -587,4 +594,5 @@ Logging specific key,value pairs in spend logs metadata is an enterprise feature
Tracking spend with Custom tags is an enterprise feature. [See here](./enterprise.md#tracking-spend-for-custom-tags)
:::
:::

View file

@ -26,10 +26,12 @@ model_list:
- model_name: sagemaker-completion-model
litellm_params:
model: sagemaker/berri-benchmarking-Llama-2-70b-chat-hf-4
model_info:
input_cost_per_second: 0.000420
- model_name: sagemaker-embedding-model
litellm_params:
model: sagemaker/berri-benchmarking-gpt-j-6b-fp16
model_info:
input_cost_per_second: 0.000420
```
@ -55,11 +57,55 @@ model_list:
api_key: os.environ/AZURE_API_KEY
api_base: os.environ/AZURE_API_BASE
api_version: os.envrion/AZURE_API_VERSION
model_info:
input_cost_per_token: 0.000421 # 👈 ONLY to track cost per token
output_cost_per_token: 0.000520 # 👈 ONLY to track cost per token
```
### Debugging
## Override Model Cost Map
You can override [our model cost map](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) with your own custom pricing for a mapped model.
Just add a `model_info` key to your model in the config, and override the desired keys.
Example: Override Anthropic's model cost map for the `prod/claude-3-5-sonnet-20241022` model.
```yaml
model_list:
- model_name: "prod/claude-3-5-sonnet-20241022"
litellm_params:
model: "anthropic/claude-3-5-sonnet-20241022"
api_key: os.environ/ANTHROPIC_PROD_API_KEY
model_info:
input_cost_per_token: 0.000006
output_cost_per_token: 0.00003
cache_creation_input_token_cost: 0.0000075
cache_read_input_token_cost: 0.0000006
```
## Set 'base_model' for Cost Tracking (e.g. Azure deployments)
**Problem**: Azure returns `gpt-4` in the response when `azure/gpt-4-1106-preview` is used. This leads to inaccurate cost tracking
**Solution** ✅ : Set `base_model` on your config so litellm uses the correct model for calculating azure cost
Get the base model name from [here](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json)
Example config with `base_model`
```yaml
model_list:
- model_name: azure-gpt-3.5
litellm_params:
model: azure/chatgpt-v-2
api_base: os.environ/AZURE_API_BASE
api_key: os.environ/AZURE_API_KEY
api_version: "2023-07-01-preview"
model_info:
base_model: azure/gpt-4-1106-preview
```
## Debugging
If you're custom pricing is not being used or you're seeing errors, please check the following:

View file

@ -0,0 +1,194 @@
import Image from '@theme/IdealImage';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Custom Prompt Management
Connect LiteLLM to your prompt management system with custom hooks.
## Overview
<Image
img={require('../../img/custom_prompt_management.png')}
style={{width: '100%', display: 'block', margin: '2rem auto'}}
/>
## How it works
## Quick Start
### 1. Create Your Custom Prompt Manager
Create a class that inherits from `CustomPromptManagement` to handle prompt retrieval and formatting:
**Example Implementation**
Create a new file called `custom_prompt.py` and add this code. The key method here is `get_chat_completion_prompt` you can implement custom logic to retrieve and format prompts based on the `prompt_id` and `prompt_variables`.
```python
from typing import List, Tuple, Optional
from litellm.integrations.custom_prompt_management import CustomPromptManagement
from litellm.types.llms.openai import AllMessageValues
from litellm.types.utils import StandardCallbackDynamicParams
class MyCustomPromptManagement(CustomPromptManagement):
def get_chat_completion_prompt(
self,
model: str,
messages: List[AllMessageValues],
non_default_params: dict,
prompt_id: str,
prompt_variables: Optional[dict],
dynamic_callback_params: StandardCallbackDynamicParams,
) -> Tuple[str, List[AllMessageValues], dict]:
"""
Retrieve and format prompts based on prompt_id.
Returns:
- model: The model to use
- messages: The formatted messages
- non_default_params: Optional parameters like temperature
"""
# Example matching the diagram: Add system message for prompt_id "1234"
if prompt_id == "1234":
# Prepend system message while preserving existing messages
new_messages = [
{"role": "system", "content": "Be a good Bot!"},
] + messages
return model, new_messages, non_default_params
# Default: Return original messages if no prompt_id match
return model, messages, non_default_params
prompt_management = MyCustomPromptManagement()
```
### 2. Configure Your Prompt Manager in LiteLLM `config.yaml`
```yaml
model_list:
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY
litellm_settings:
callbacks: custom_prompt.prompt_management # sets litellm.callbacks = [prompt_management]
```
### 3. Start LiteLLM Gateway
<Tabs>
<TabItem value="docker" label="Docker Run">
Mount your `custom_logger.py` on the LiteLLM Docker container.
```shell
docker run -d \
-p 4000:4000 \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
--name my-app \
-v $(pwd)/my_config.yaml:/app/config.yaml \
-v $(pwd)/custom_logger.py:/app/custom_logger.py \
my-app:latest \
--config /app/config.yaml \
--port 4000 \
--detailed_debug \
```
</TabItem>
<TabItem value="py" label="litellm pip">
```shell
litellm --config config.yaml --detailed_debug
```
</TabItem>
</Tabs>
### 4. Test Your Custom Prompt Manager
When you pass `prompt_id="1234"`, the custom prompt manager will add a system message "Be a good Bot!" to your conversation:
<Tabs>
<TabItem value="openai" label="OpenAI Python v1.0.0+">
```python
from openai import OpenAI
client = OpenAI(
api_key="sk-1234",
base_url="http://0.0.0.0:4000"
)
response = client.chat.completions.create(
model="gemini-1.5-pro",
messages=[{"role": "user", "content": "hi"}],
prompt_id="1234"
)
print(response.choices[0].message.content)
```
</TabItem>
<TabItem value="langchain" label="Langchain">
```python
from langchain.chat_models import ChatOpenAI
from langchain.schema import HumanMessage
chat = ChatOpenAI(
model="gpt-4",
openai_api_key="sk-1234",
openai_api_base="http://0.0.0.0:4000",
extra_body={
"prompt_id": "1234"
}
)
messages = []
response = chat(messages)
print(response.content)
```
</TabItem>
<TabItem value="curl" label="Curl">
```shell
curl -X POST http://0.0.0.0:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gemini-1.5-pro",
"messages": [{"role": "user", "content": "hi"}],
"prompt_id": "1234"
}'
```
</TabItem>
</Tabs>
The request will be transformed from:
```json
{
"model": "gemini-1.5-pro",
"messages": [{"role": "user", "content": "hi"}],
"prompt_id": "1234"
}
```
To:
```json
{
"model": "gemini-1.5-pro",
"messages": [
{"role": "system", "content": "Be a good Bot!"},
{"role": "user", "content": "hi"}
]
}
```

View file

@ -0,0 +1,86 @@
import Image from '@theme/IdealImage';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# High Availability Setup (Resolve DB Deadlocks)
Resolve any Database Deadlocks you see in high traffic by using this setup
## What causes the problem?
LiteLLM writes `UPDATE` and `UPSERT` queries to the DB. When using 10+ instances of LiteLLM, these queries can cause deadlocks since each instance could simultaneously attempt to update the same `user_id`, `team_id`, `key` etc.
## How the high availability setup fixes the problem
- All instances will write to a Redis queue instead of the DB.
- A single instance will acquire a lock on the DB and flush the redis queue to the DB.
## How it works
### Stage 1. Each instance writes updates to redis
Each instance will accumlate the spend updates for a key, user, team, etc and write the updates to a redis queue.
<Image img={require('../../img/deadlock_fix_1.png')} style={{ width: '900px', height: 'auto' }} />
<p style={{textAlign: 'left', color: '#666'}}>
Each instance writes updates to redis
</p>
### Stage 2. A single instance flushes the redis queue to the DB
A single instance will acquire a lock on the DB and flush all elements in the redis queue to the DB.
- 1 instance will attempt to acquire the lock for the DB update job
- The status of the lock is stored in redis
- If the instance acquires the lock to write to DB
- It will read all updates from redis
- Aggregate all updates into 1 transaction
- Write updates to DB
- Release the lock
- Note: Only 1 instance can acquire the lock at a time, this limits the number of instances that can write to the DB at once
<Image img={require('../../img/deadlock_fix_2.png')} style={{ width: '900px', height: 'auto' }} />
<p style={{textAlign: 'left', color: '#666'}}>
A single instance flushes the redis queue to the DB
</p>
## Usage
### Required components
- Redis
- Postgres
### Setup on LiteLLM config
You can enable using the redis buffer by setting `use_redis_transaction_buffer: true` in the `general_settings` section of your `proxy_config.yaml` file.
Note: This setup requires litellm to be connected to a redis instance.
```yaml showLineNumbers title="litellm proxy_config.yaml"
general_settings:
use_redis_transaction_buffer: true
litellm_settings:
cache: True
cache_params:
type: redis
supported_call_types: [] # Optional: Set cache for proxy, but not on the actual llm api call
```
## Monitoring
LiteLLM emits the following prometheus metrics to monitor the health/status of the in memory buffer and redis buffer.
| Metric Name | Description | Storage Type |
|-----------------------------------------------------|-----------------------------------------------------------------------------|--------------|
| `litellm_pod_lock_manager_size` | Indicates which pod has the lock to write updates to the database. | Redis |
| `litellm_in_memory_daily_spend_update_queue_size` | Number of items in the in-memory daily spend update queue. These are the aggregate spend logs for each user. | In-Memory |
| `litellm_redis_daily_spend_update_queue_size` | Number of items in the Redis daily spend update queue. These are the aggregate spend logs for each user. | Redis |
| `litellm_in_memory_spend_update_queue_size` | In-memory aggregate spend values for keys, users, teams, team members, etc.| In-Memory |
| `litellm_redis_spend_update_queue_size` | Redis aggregate spend values for keys, users, teams, etc. | Redis |

View file

@ -46,18 +46,17 @@ You can see the full DB Schema [here](https://github.com/BerriAI/litellm/blob/ma
| Table Name | Description | Row Insert Frequency |
|------------|-------------|---------------------|
| LiteLLM_SpendLogs | Detailed logs of all API requests. Records token usage, spend, and timing information. Tracks which models and keys were used. | **High - every LLM API request** |
| LiteLLM_ErrorLogs | Captures failed requests and errors. Stores exception details and request information. Helps with debugging and monitoring. | **Medium - on errors only** |
| LiteLLM_SpendLogs | Detailed logs of all API requests. Records token usage, spend, and timing information. Tracks which models and keys were used. | **High - every LLM API request - Success or Failure** |
| LiteLLM_AuditLog | Tracks changes to system configuration. Records who made changes and what was modified. Maintains history of updates to teams, users, and models. | **Off by default**, **High - when enabled** |
## Disable `LiteLLM_SpendLogs` & `LiteLLM_ErrorLogs`
## Disable `LiteLLM_SpendLogs`
You can disable spend_logs and error_logs by setting `disable_spend_logs` and `disable_error_logs` to `True` on the `general_settings` section of your proxy_config.yaml file.
```yaml
general_settings:
disable_spend_logs: True # Disable writing spend logs to DB
disable_error_logs: True # Disable writing error logs to DB
disable_error_logs: True # Only disable writing error logs to DB, regular spend logs will still be written unless `disable_spend_logs: True`
```
### What is the impact of disabling these logs?

View file

@ -23,6 +23,12 @@ In the newly created guard's page, you can find a reference to the prompt policy
You can decide which detections will be enabled, and set the threshold for each detection.
:::info
When using LiteLLM with virtual keys, key-specific policies can be set directly in Aim's guards page by specifying the virtual key alias when creating the guard.
Only the aliases of your virtual keys (and not the actual key secrets) will be sent to Aim.
:::
### 3. Add Aim Guardrail on your LiteLLM config.yaml
Define your guardrails under the `guardrails` section
@ -37,7 +43,7 @@ guardrails:
- guardrail_name: aim-protected-app
litellm_params:
guardrail: aim
mode: pre_call # 'during_call' is also available
mode: [pre_call, post_call] # "During_call" is also available
api_key: os.environ/AIM_API_KEY
api_base: os.environ/AIM_API_BASE # Optional, use only when using a self-hosted Aim Outpost
```
@ -134,7 +140,7 @@ The above request should not be blocked, and you should receive a regular LLM re
</Tabs>
# Advanced
## Advanced
Aim Guard provides user-specific Guardrail policies, enabling you to apply tailored policies to individual users.
To utilize this feature, include the end-user's email in the request payload by setting the `x-aim-user-email` header of your request.

View file

@ -10,10 +10,12 @@ Use this is you want to write code to run a custom guardrail
### 1. Write a `CustomGuardrail` Class
A CustomGuardrail has 3 methods to enforce guardrails
A CustomGuardrail has 4 methods to enforce guardrails
- `async_pre_call_hook` - (Optional) modify input or reject request before making LLM API call
- `async_moderation_hook` - (Optional) reject request, runs while making LLM API call (help to lower latency)
- `async_post_call_success_hook`- (Optional) apply guardrail on input/output, runs after making LLM API call
- `async_post_call_streaming_iterator_hook` - (Optional) pass the entire stream to the guardrail
**[See detailed spec of methods here](#customguardrail-methods)**
@ -128,6 +130,23 @@ class myCustomGuardrail(CustomGuardrail):
):
raise ValueError("Guardrail failed Coffee Detected")
async def async_post_call_streaming_iterator_hook(
self,
user_api_key_dict: UserAPIKeyAuth,
response: Any,
request_data: dict,
) -> AsyncGenerator[ModelResponseStream, None]:
"""
Passes the entire stream to the guardrail
This is useful for guardrails that need to see the entire response, such as PII masking.
See Aim guardrail implementation for an example - https://github.com/BerriAI/litellm/blob/d0e022cfacb8e9ebc5409bb652059b6fd97b45c0/litellm/proxy/guardrails/guardrail_hooks/aim.py#L168
Triggered by mode: 'post_call'
"""
async for item in response:
yield item
```

View file

@ -17,6 +17,14 @@ model_list:
api_key: os.environ/OPENAI_API_KEY
guardrails:
- guardrail_name: general-guard
litellm_params:
guardrail: aim
mode: [pre_call, post_call]
api_key: os.environ/AIM_API_KEY
api_base: os.environ/AIM_API_BASE
default_on: true # Optional
- guardrail_name: "aporia-pre-guard"
litellm_params:
guardrail: aporia # supported values: "aporia", "lakera"
@ -45,6 +53,7 @@ guardrails:
- `pre_call` Run **before** LLM call, on **input**
- `post_call` Run **after** LLM call, on **input & output**
- `during_call` Run **during** LLM call, on **input** Same as `pre_call` but runs in parallel as LLM call. Response not returned until guardrail check completes
- A list of the above values to run multiple modes, e.g. `mode: [pre_call, post_call]`
## 2. Start LiteLLM Gateway
@ -569,4 +578,4 @@ guardrails: Union[
class DynamicGuardrailParams:
extra_body: Dict[str, Any] # Additional parameters for the guardrail
```
```

View file

@ -0,0 +1,21 @@
import Image from '@theme/IdealImage';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# Image URL Handling
<Image img={require('../../img/image_handling.png')} style={{ width: '900px', height: 'auto' }} />
Some LLM API's don't support url's for images, but do support base-64 strings.
For those, LiteLLM will:
1. Detect a URL being passed
2. Check if the LLM API supports a URL
3. Else, will download the base64
4. Send the provider a base64 string.
LiteLLM also caches this result, in-memory to reduce latency for subsequent calls.
The limit for an in-memory cache is 1MB.

View file

@ -0,0 +1,279 @@
import TabItem from '@theme/TabItem';
import Tabs from '@theme/Tabs';
import Image from '@theme/IdealImage';
# [BETA] Unified File ID
Reuse the same 'file id' across different providers.
| Feature | Description | Comments |
| --- | --- | --- |
| Proxy | ✅ | |
| SDK | ❌ | Requires postgres DB for storing file ids |
| Available across all providers | ✅ | |
Limitations of LiteLLM Managed Files:
- Only works for `/chat/completions` requests.
- Assumes just 1 model configured per model_name.
Follow [here](https://github.com/BerriAI/litellm/discussions/9632) for multiple models, batches support.
### 1. Setup config.yaml
```
model_list:
- model_name: "gemini-2.0-flash"
litellm_params:
model: vertex_ai/gemini-2.0-flash
vertex_project: my-project-id
vertex_location: us-central1
- model_name: "gpt-4o-mini-openai"
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
```
### 2. Start proxy
```bash
litellm --config /path/to/config.yaml
```
### 3. Test it!
Specify `target_model_names` to use the same file id across different providers. This is the list of model_names set via config.yaml (or 'public_model_names' on UI).
```python
target_model_names="gpt-4o-mini-openai, gemini-2.0-flash" # 👈 Specify model_names
```
Check `/v1/models` to see the list of available model names for a key.
#### **Store a PDF file**
```python
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:4000", api_key="sk-1234", max_retries=0)
# Download and save the PDF locally
url = (
"https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf"
)
response = requests.get(url)
response.raise_for_status()
# Save the PDF locally
with open("2403.05530.pdf", "wb") as f:
f.write(response.content)
file = client.files.create(
file=open("2403.05530.pdf", "rb"),
purpose="user_data", # can be any openai 'purpose' value
extra_body={"target_model_names": "gpt-4o-mini-openai, gemini-2.0-flash"}, # 👈 Specify model_names
)
print(f"file id={file.id}")
```
#### **Use the same file id across different providers**
<Tabs>
<TabItem value="openai" label="OpenAI">
```python
completion = client.chat.completions.create(
model="gpt-4o-mini-openai",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this recording?"},
{
"type": "file",
"file": {
"file_id": file.id,
},
},
],
},
]
)
print(completion.choices[0].message)
```
</TabItem>
<TabItem value="vertex" label="Vertex AI">
```python
completion = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this recording?"},
{
"type": "file",
"file": {
"file_id": file.id,
},
},
],
},
]
)
print(completion.choices[0].message)
```
</TabItem>
</Tabs>
### Complete Example
```python
import base64
import requests
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:4000", api_key="sk-1234", max_retries=0)
# Download and save the PDF locally
url = (
"https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf"
)
response = requests.get(url)
response.raise_for_status()
# Save the PDF locally
with open("2403.05530.pdf", "wb") as f:
f.write(response.content)
# Read the local PDF file
file = client.files.create(
file=open("2403.05530.pdf", "rb"),
purpose="user_data", # can be any openai 'purpose' value
extra_body={"target_model_names": "gpt-4o-mini-openai, vertex_ai/gemini-2.0-flash"},
)
print(f"file.id: {file.id}") # 👈 Unified file id
## GEMINI CALL ###
completion = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this recording?"},
{
"type": "file",
"file": {
"file_id": file.id,
},
},
],
},
]
)
print(completion.choices[0].message)
### OPENAI CALL ###
completion = client.chat.completions.create(
model="gpt-4o-mini-openai",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What is in this recording?"},
{
"type": "file",
"file": {
"file_id": file.id,
},
},
],
},
],
)
print(completion.choices[0].message)
```
### Supported Endpoints
#### Create a file - `/files`
```python
from openai import OpenAI
client = OpenAI(base_url="http://0.0.0.0:4000", api_key="sk-1234", max_retries=0)
# Download and save the PDF locally
url = (
"https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf"
)
response = requests.get(url)
response.raise_for_status()
# Save the PDF locally
with open("2403.05530.pdf", "wb") as f:
f.write(response.content)
# Read the local PDF file
file = client.files.create(
file=open("2403.05530.pdf", "rb"),
purpose="user_data", # can be any openai 'purpose' value
extra_body={"target_model_names": "gpt-4o-mini-openai, vertex_ai/gemini-2.0-flash"},
)
```
#### Retrieve a file - `/files/{file_id}`
```python
client = OpenAI(base_url="http://0.0.0.0:4000", api_key="sk-1234", max_retries=0)
file = client.files.retrieve(file_id=file.id)
```
#### Delete a file - `/files/{file_id}/delete`
```python
client = OpenAI(base_url="http://0.0.0.0:4000", api_key="sk-1234", max_retries=0)
file = client.files.delete(file_id=file.id)
```
### FAQ
**1. Does LiteLLM store the file?**
No, LiteLLM does not store the file. It only stores the file id's in the postgres DB.
**2. How does LiteLLM know which file to use for a given file id?**
LiteLLM stores a mapping of the litellm file id to the model-specific file id in the postgres DB. When a request comes in, LiteLLM looks up the model-specific file id and uses it in the request to the provider.
**3. How do file deletions work?**
When a file is deleted, LiteLLM deletes the mapping from the postgres DB, and the files on each provider.
### Architecture
<Image img={require('../../img/managed_files_arch.png')} style={{ width: '800px', height: 'auto' }} />

Some files were not shown because too many files have changed in this diff Show more