+# *🚅 litellm*
+[](https://pypi.org/project/litellm/)
+[](https://pypi.org/project/litellm/0.1.1/)
+[](https://github.com/BerriAI/litellm/actions/workflows/tests.yml)
+[](https://github.com/BerriAI/litellm/actions/workflows/publish_pypi.yml) 
-LiteLLM manages:
+[](https://discord.gg/wuPM9dRgDw)
-- Translate inputs to provider's `completion`, `embedding`, and `image_generation` endpoints
-- [Consistent output](https://docs.litellm.ai/docs/completion/output), text responses will always be available at `['choices'][0]['message']['content']`
-- Retry/fallback logic across multiple deployments (e.g. Azure/OpenAI) - [Router](https://docs.litellm.ai/docs/routing)
-- Set Budgets & Rate limits per project, api key, model [LiteLLM Proxy Server (LLM Gateway)](https://docs.litellm.ai/docs/simple_proxy)
+a simple & light 100 line package to call OpenAI, Azure, Cohere, Anthropic API Endpoints
-[**Jump to LiteLLM Proxy (LLM Gateway) Docs**](https://github.com/BerriAI/litellm?tab=readme-ov-file#openai-proxy---docs)
-[**Jump to Supported LLM Providers**](https://github.com/BerriAI/litellm?tab=readme-ov-file#supported-providers-docs)
+litellm manages:
+- translating inputs to completion and embedding endpoints
+- guarantees consistent output, text responses will always be available at `['choices'][0]['message']['content']`
-🚨 **Stable Release:** Use docker images with the `-stable` tag. These have undergone 12 hour load tests, before being published.
+# usage
-Support for more providers. Missing a provider or LLM Platform, raise a [feature request](https://github.com/BerriAI/litellm/issues/new?assignees=&labels=enhancement&projects=&template=feature_request.yml&title=%5BFeature%5D%3A+).
+Read the docs - https://litellm.readthedocs.io/en/latest/
-# Usage ([**Docs**](https://docs.litellm.ai/docs/))
-
-> [!IMPORTANT]
-> LiteLLM v1.0.0 now requires `openai>=1.0.0`. Migration guide [here](https://docs.litellm.ai/docs/migration)
-> LiteLLM v1.40.14+ now requires `pydantic>=2.0.0`. No changes required.
-
-
-
-
-
-```shell
+## quick start
+```
pip install litellm
```
```python
from litellm import completion
-import os
## set ENV variables
-os.environ["OPENAI_API_KEY"] = "your-openai-key"
-os.environ["COHERE_API_KEY"] = "your-cohere-key"
+# ENV variables can be set in .env file, too. Example in .env.example
+os.environ["OPENAI_API_KEY"] = "openai key"
+os.environ["COHERE_API_KEY"] = "cohere key"
messages = [{ "content": "Hello, how are you?","role": "user"}]
@@ -72,304 +35,26 @@ messages = [{ "content": "Hello, how are you?","role": "user"}]
response = completion(model="gpt-3.5-turbo", messages=messages)
# cohere call
-response = completion(model="command-nightly", messages=messages)
-print(response)
+response = completion("command-nightly", messages)
+
+# azure openai call
+response = completion("chatgpt-test", messages, azure=True)
+
+# openrouter call
+response = completion("google/palm-2-codechat-bison", messages)
+```
+Code Sample: [Getting Started Notebook](https://colab.research.google.com/drive/1gR3pY-JzDZahzpVdbGBtrNGDBmzUNJaJ?usp=sharing)
+
+Stable version
+```
+pip install litellm==0.1.1
```
-Call any model supported by a provider, with `model=/`. There might be provider-specific details here, so refer to [provider docs for more information](https://docs.litellm.ai/docs/providers)
+# hosted version
+- [Grab time if you want access 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
-## Async ([Docs](https://docs.litellm.ai/docs/completion/stream#async-completion))
+# why did I build this
+- **Need for simplicity**: My code started to get extremely complicated managing & translating calls between Azure, OpenAI, Cohere
-```python
-from litellm import acompletion
-import asyncio
-
-async def test_get_response():
- user_message = "Hello, how are you?"
- messages = [{"content": user_message, "role": "user"}]
- response = await acompletion(model="gpt-3.5-turbo", messages=messages)
- return response
-
-response = asyncio.run(test_get_response())
-print(response)
-```
-
-## Streaming ([Docs](https://docs.litellm.ai/docs/completion/stream))
-
-liteLLM supports streaming the model response back, pass `stream=True` to get a streaming iterator in response.
-Streaming is supported for all models (Bedrock, Huggingface, TogetherAI, Azure, OpenAI, etc.)
-
-```python
-from litellm import completion
-response = completion(model="gpt-3.5-turbo", messages=messages, stream=True)
-for part in response:
- print(part.choices[0].delta.content or "")
-
-# claude 2
-response = completion('claude-2', messages, stream=True)
-for part in response:
- print(part.choices[0].delta.content or "")
-```
-
-## Logging Observability ([Docs](https://docs.litellm.ai/docs/observability/callbacks))
-
-LiteLLM exposes pre defined callbacks to send data to Lunary, Langfuse, DynamoDB, s3 Buckets, Helicone, Promptlayer, Traceloop, Athina, Slack, MLflow
-
-```python
-from litellm import completion
-
-## set env variables for logging tools
-os.environ["LUNARY_PUBLIC_KEY"] = "your-lunary-public-key"
-os.environ["HELICONE_API_KEY"] = "your-helicone-auth-key"
-os.environ["LANGFUSE_PUBLIC_KEY"] = ""
-os.environ["LANGFUSE_SECRET_KEY"] = ""
-os.environ["ATHINA_API_KEY"] = "your-athina-api-key"
-
-os.environ["OPENAI_API_KEY"]
-
-# set callbacks
-litellm.success_callback = ["lunary", "langfuse", "athina", "helicone"] # log input/output to lunary, langfuse, supabase, athina, helicone etc
-
-#openai call
-response = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}])
-```
-
-# LiteLLM Proxy Server (LLM Gateway) - ([Docs](https://docs.litellm.ai/docs/simple_proxy))
-
-Track spend + Load Balance across multiple projects
-
-[Hosted Proxy (Preview)](https://docs.litellm.ai/docs/hosted)
-
-The proxy provides:
-
-1. [Hooks for auth](https://docs.litellm.ai/docs/proxy/virtual_keys#custom-auth)
-2. [Hooks for logging](https://docs.litellm.ai/docs/proxy/logging#step-1---create-your-custom-litellm-callback-class)
-3. [Cost tracking](https://docs.litellm.ai/docs/proxy/virtual_keys#tracking-spend)
-4. [Rate Limiting](https://docs.litellm.ai/docs/proxy/users#set-rate-limits)
-
-## 📖 Proxy Endpoints - [Swagger Docs](https://litellm-api.up.railway.app/)
-
-
-## Quick Start Proxy - CLI
-
-```shell
-pip install 'litellm[proxy]'
-```
-
-### Step 1: Start litellm proxy
-
-```shell
-$ litellm --model huggingface/bigcode/starcoder
-
-#INFO: Proxy running on http://0.0.0.0:4000
-```
-
-### Step 2: Make ChatCompletions Request to Proxy
-
-
-> [!IMPORTANT]
-> 💡 [Use LiteLLM Proxy with Langchain (Python, JS), OpenAI SDK (Python, JS) Anthropic SDK, Mistral SDK, LlamaIndex, Instructor, Curl](https://docs.litellm.ai/docs/proxy/user_keys)
-
-```python
-import openai # openai v1.0.0+
-client = openai.OpenAI(api_key="anything",base_url="http://0.0.0.0:4000") # set proxy to base_url
-# request sent to model set on litellm proxy, `litellm --model`
-response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
- {
- "role": "user",
- "content": "this is a test request, write a short poem"
- }
-])
-
-print(response)
-```
-
-## Proxy Key Management ([Docs](https://docs.litellm.ai/docs/proxy/virtual_keys))
-
-Connect the proxy with a Postgres DB to create proxy keys
-
-```bash
-# Get the code
-git clone https://github.com/BerriAI/litellm
-
-# Go to folder
-cd litellm
-
-# Add the master key - you can change this after setup
-echo 'LITELLM_MASTER_KEY="sk-1234"' > .env
-
-# Add the litellm salt key - you cannot change this after adding a model
-# It is used to encrypt / decrypt your LLM API Key credentials
-# We recommned - https://1password.com/password-generator/
-# password generator to get a random hash for litellm salt key
-echo 'LITELLM_SALT_KEY="sk-1234"' > .env
-
-source .env
-
-# Start
-docker-compose up
-```
-
-
-UI on `/ui` on your proxy server
-
-
-Set budgets and rate limits across multiple projects
-`POST /key/generate`
-
-### Request
-
-```shell
-curl 'http://0.0.0.0:4000/key/generate' \
---header 'Authorization: Bearer sk-1234' \
---header 'Content-Type: application/json' \
---data-raw '{"models": ["gpt-3.5-turbo", "gpt-4", "claude-2"], "duration": "20m","metadata": {"user": "ishaan@berri.ai", "team": "core-infra"}}'
-```
-
-### Expected Response
-
-```shell
-{
- "key": "sk-kdEXbIqZRwEeEiHwdg7sFA", # Bearer token
- "expires": "2023-11-19T01:38:25.838000+00:00" # datetime object
-}
-```
-
-## Supported Providers ([Docs](https://docs.litellm.ai/docs/providers))
-
-| Provider | [Completion](https://docs.litellm.ai/docs/#basic-usage) | [Streaming](https://docs.litellm.ai/docs/completion/stream#streaming-responses) | [Async Completion](https://docs.litellm.ai/docs/completion/stream#async-completion) | [Async Streaming](https://docs.litellm.ai/docs/completion/stream#async-streaming) | [Async Embedding](https://docs.litellm.ai/docs/embedding/supported_embedding) | [Async Image Generation](https://docs.litellm.ai/docs/image_generation) |
-|-------------------------------------------------------------------------------------|---------------------------------------------------------|---------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------|
-| [openai](https://docs.litellm.ai/docs/providers/openai) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [azure](https://docs.litellm.ai/docs/providers/azure) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [aws - sagemaker](https://docs.litellm.ai/docs/providers/aws_sagemaker) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [aws - bedrock](https://docs.litellm.ai/docs/providers/bedrock) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [google - vertex_ai](https://docs.litellm.ai/docs/providers/vertex) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
-| [google - palm](https://docs.litellm.ai/docs/providers/palm) | ✅ | ✅ | ✅ | ✅ | | |
-| [google AI Studio - gemini](https://docs.litellm.ai/docs/providers/gemini) | ✅ | ✅ | ✅ | ✅ | | |
-| [mistral ai api](https://docs.litellm.ai/docs/providers/mistral) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [cloudflare AI Workers](https://docs.litellm.ai/docs/providers/cloudflare_workers) | ✅ | ✅ | ✅ | ✅ | | |
-| [cohere](https://docs.litellm.ai/docs/providers/cohere) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [anthropic](https://docs.litellm.ai/docs/providers/anthropic) | ✅ | ✅ | ✅ | ✅ | | |
-| [empower](https://docs.litellm.ai/docs/providers/empower) | ✅ | ✅ | ✅ | ✅ |
-| [huggingface](https://docs.litellm.ai/docs/providers/huggingface) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [replicate](https://docs.litellm.ai/docs/providers/replicate) | ✅ | ✅ | ✅ | ✅ | | |
-| [together_ai](https://docs.litellm.ai/docs/providers/togetherai) | ✅ | ✅ | ✅ | ✅ | | |
-| [openrouter](https://docs.litellm.ai/docs/providers/openrouter) | ✅ | ✅ | ✅ | ✅ | | |
-| [ai21](https://docs.litellm.ai/docs/providers/ai21) | ✅ | ✅ | ✅ | ✅ | | |
-| [baseten](https://docs.litellm.ai/docs/providers/baseten) | ✅ | ✅ | ✅ | ✅ | | |
-| [vllm](https://docs.litellm.ai/docs/providers/vllm) | ✅ | ✅ | ✅ | ✅ | | |
-| [nlp_cloud](https://docs.litellm.ai/docs/providers/nlp_cloud) | ✅ | ✅ | ✅ | ✅ | | |
-| [aleph alpha](https://docs.litellm.ai/docs/providers/aleph_alpha) | ✅ | ✅ | ✅ | ✅ | | |
-| [petals](https://docs.litellm.ai/docs/providers/petals) | ✅ | ✅ | ✅ | ✅ | | |
-| [ollama](https://docs.litellm.ai/docs/providers/ollama) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [deepinfra](https://docs.litellm.ai/docs/providers/deepinfra) | ✅ | ✅ | ✅ | ✅ | | |
-| [perplexity-ai](https://docs.litellm.ai/docs/providers/perplexity) | ✅ | ✅ | ✅ | ✅ | | |
-| [Groq AI](https://docs.litellm.ai/docs/providers/groq) | ✅ | ✅ | ✅ | ✅ | | |
-| [Deepseek](https://docs.litellm.ai/docs/providers/deepseek) | ✅ | ✅ | ✅ | ✅ | | |
-| [anyscale](https://docs.litellm.ai/docs/providers/anyscale) | ✅ | ✅ | ✅ | ✅ | | |
-| [IBM - watsonx.ai](https://docs.litellm.ai/docs/providers/watsonx) | ✅ | ✅ | ✅ | ✅ | ✅ | |
-| [voyage ai](https://docs.litellm.ai/docs/providers/voyage) | | | | | ✅ | |
-| [xinference [Xorbits Inference]](https://docs.litellm.ai/docs/providers/xinference) | | | | | ✅ | |
-| [FriendliAI](https://docs.litellm.ai/docs/providers/friendliai) | ✅ | ✅ | ✅ | ✅ | | |
-
-[**Read the Docs**](https://docs.litellm.ai/docs/)
-
-## Contributing
-
-To contribute: Clone the repo locally -> Make a change -> Submit a PR with the change.
-
-Here's how to modify the repo locally:
-Step 1: Clone the repo
-
-```
-git clone https://github.com/BerriAI/litellm.git
-```
-
-Step 2: Navigate into the project, and install dependencies:
-
-```
-cd litellm
-poetry install -E extra_proxy -E proxy
-```
-
-Step 3: Test your change:
-
-```
-cd litellm/tests # pwd: Documents/litellm/litellm/tests
-poetry run flake8
-poetry run pytest .
-```
-
-Step 4: Submit a PR with your changes! 🚀
-
-- push your fork to your GitHub repo
-- submit a PR from there
-
-### Building LiteLLM Docker Image
-
-Follow these instructions if you want to build / run the LiteLLM Docker Image yourself.
-
-Step 1: Clone the repo
-
-```
-git clone https://github.com/BerriAI/litellm.git
-```
-
-Step 2: Build the Docker Image
-
-Build using Dockerfile.non_root
-```
-docker build -f docker/Dockerfile.non_root -t litellm_test_image .
-```
-
-Step 3: Run the Docker Image
-
-Make sure config.yaml is present in the root directory. This is your litellm proxy config file.
-```
-docker run \
- -v $(pwd)/proxy_config.yaml:/app/config.yaml \
- -e DATABASE_URL="postgresql://xxxxxxxx" \
- -e LITELLM_MASTER_KEY="sk-1234" \
- -p 4000:4000 \
- litellm_test_image \
- --config /app/config.yaml --detailed_debug
-```
-
-# Enterprise
-For companies that need better security, user management and professional support
-
-[Talk to founders](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
-
-This covers:
-- ✅ **Features under the [LiteLLM Commercial License](https://docs.litellm.ai/docs/proxy/enterprise):**
-- ✅ **Feature Prioritization**
-- ✅ **Custom Integrations**
-- ✅ **Professional Support - Dedicated discord + slack**
-- ✅ **Custom SLAs**
-- ✅ **Secure access with Single Sign-On**
-
-# Support / talk with founders
-
-- [Schedule Demo 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
-- [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
-- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
-- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
-
-# Why did we build this
-
-- **Need for simplicity**: Our code started to get extremely complicated managing & translating calls between Azure, OpenAI and Cohere.
-
-# Contributors
-
-
-
-
-
-
-
-
-
-
-
-
-
+# Support
+Contact us at ishaan@berri.ai / krrish@berri.ai
diff --git a/build/lib/litellm/__init__.py b/build/lib/litellm/__init__.py
new file mode 100644
index 000000000..fd66e12bf
--- /dev/null
+++ b/build/lib/litellm/__init__.py
@@ -0,0 +1,2 @@
+__version__ = "1.0.0"
+from .main import * # Import all the symbols from main.py
\ No newline at end of file
diff --git a/build/lib/litellm/main.py b/build/lib/litellm/main.py
new file mode 100644
index 000000000..d4fc60053
--- /dev/null
+++ b/build/lib/litellm/main.py
@@ -0,0 +1,429 @@
+import os, openai, cohere, replicate, sys
+from typing import Any
+from func_timeout import func_set_timeout, FunctionTimedOut
+from anthropic import Anthropic, HUMAN_PROMPT, AI_PROMPT
+import json
+import traceback
+import threading
+import dotenv
+import traceback
+import subprocess
+####### ENVIRONMENT VARIABLES ###################
+# Loading env variables using dotenv
+dotenv.load_dotenv()
+set_verbose = False
+
+####### COMPLETION MODELS ###################
+open_ai_chat_completion_models = [
+ 'gpt-3.5-turbo',
+ 'gpt-4'
+]
+open_ai_text_completion_models = [
+ 'text-davinci-003'
+]
+
+cohere_models = [
+ 'command-nightly',
+]
+
+anthropic_models = [
+ "claude-2",
+ "claude-instant-1"
+]
+
+####### EMBEDDING MODELS ###################
+open_ai_embedding_models = [
+ 'text-embedding-ada-002'
+]
+
+#############################################
+
+
+####### COMPLETION ENDPOINTS ################
+#############################################
+@func_set_timeout(10, allowOverride=True) ## https://pypi.org/project/func-timeout/ - timeouts, in case calls hang (e.g. Azure)
+def completion(model, messages, max_tokens=None, forceTimeout=10, azure=False, logger_fn=None):
+ try:
+ if azure == True:
+ # azure configs
+ openai.api_type = "azure"
+ openai.api_base = os.environ.get("AZURE_API_BASE")
+ openai.api_version = os.environ.get("AZURE_API_VERSION")
+ openai.api_key = os.environ.get("AZURE_API_KEY")
+ ## LOGGING
+ logging(model=model, input=input, azure=azure, logger_fn=logger_fn)
+ ## COMPLETION CALL
+ response = openai.ChatCompletion.create(
+ engine=model,
+ messages = messages
+ )
+ elif "replicate" in model:
+ # replicate defaults to os.environ.get("REPLICATE_API_TOKEN")
+ # checking in case user set it to REPLICATE_API_KEY instead
+ if not os.environ.get("REPLICATE_API_TOKEN") and os.environ.get("REPLICATE_API_KEY"):
+ replicate_api_token = os.environ.get("REPLICATE_API_KEY")
+ os.environ["REPLICATE_API_TOKEN"] = replicate_api_token
+ prompt = " ".join([message["content"] for message in messages])
+ input = [{"prompt": prompt}]
+ if max_tokens:
+ input["max_length"] = max_tokens # for t5 models
+ input["max_new_tokens"] = max_tokens # for llama2 models
+ ## LOGGING
+ logging(model=model, input=input, azure=azure, additional_args={"max_tokens": max_tokens}, logger_fn=logger_fn)
+ ## COMPLETION CALL
+ output = replicate.run(
+ model,
+ input=input)
+ response = ""
+ for item in output:
+ response += item
+ new_response = {
+ "choices": [
+ {
+ "finish_reason": "stop",
+ "index": 0,
+ "message": {
+ "content": response,
+ "role": "assistant"
+ }
+ }
+ ]
+ }
+ response = new_response
+ elif model in anthropic_models:
+ #anthropic defaults to os.environ.get("ANTHROPIC_API_KEY")
+ prompt = f"{HUMAN_PROMPT}"
+ for message in messages:
+ if "role" in message:
+ if message["role"] == "user":
+ prompt += f"{HUMAN_PROMPT}{message['content']}"
+ else:
+ prompt += f"{AI_PROMPT}{message['content']}"
+ else:
+ prompt += f"{HUMAN_PROMPT}{message['content']}"
+ prompt += f"{AI_PROMPT}"
+ anthropic = Anthropic()
+ if max_tokens:
+ max_tokens_to_sample = max_tokens
+ else:
+ max_tokens_to_sample = 300 # default in Anthropic docs https://docs.anthropic.com/claude/reference/client-libraries
+ ## LOGGING
+ logging(model=model, input=prompt, azure=azure, additional_args={"max_tokens": max_tokens}, logger_fn=logger_fn)
+ ## COMPLETION CALL
+ completion = anthropic.completions.create(
+ model=model,
+ prompt=prompt,
+ max_tokens_to_sample=max_tokens_to_sample
+ )
+ new_response = {
+ "choices": [
+ {
+ "finish_reason": "stop",
+ "index": 0,
+ "message": {
+ "content": completion.completion,
+ "role": "assistant"
+ }
+ }
+ ]
+ }
+ print(f"new response: {new_response}")
+ response = new_response
+ elif model in cohere_models:
+ cohere_key = os.environ.get("COHERE_API_KEY")
+ co = cohere.Client(cohere_key)
+ prompt = " ".join([message["content"] for message in messages])
+ ## LOGGING
+ logging(model=model, input=prompt, azure=azure, logger_fn=logger_fn)
+ ## COMPLETION CALL
+ response = co.generate(
+ model=model,
+ prompt = prompt
+ )
+ new_response = {
+ "choices": [
+ {
+ "finish_reason": "stop",
+ "index": 0,
+ "message": {
+ "content": response[0],
+ "role": "assistant"
+ }
+ }
+ ],
+ }
+ response = new_response
+
+ elif model in open_ai_chat_completion_models:
+ openai.api_type = "openai"
+ openai.api_base = "https://api.openai.com/v1"
+ openai.api_version = None
+ openai.api_key = os.environ.get("OPENAI_API_KEY")
+ ## LOGGING
+ logging(model=model, input=messages, azure=azure, logger_fn=logger_fn)
+ ## COMPLETION CALL
+ response = openai.ChatCompletion.create(
+ model=model,
+ messages = messages
+ )
+ elif model in open_ai_text_completion_models:
+ openai.api_type = "openai"
+ openai.api_base = "https://api.openai.com/v1"
+ openai.api_version = None
+ openai.api_key = os.environ.get("OPENAI_API_KEY")
+ prompt = " ".join([message["content"] for message in messages])
+ ## LOGGING
+ logging(model=model, input=prompt, azure=azure, logger_fn=logger_fn)
+ ## COMPLETION CALL
+ response = openai.Completion.create(
+ model=model,
+ prompt = prompt
+ )
+ else:
+ logging(model=model, input=messages, azure=azure, logger_fn=logger_fn)
+ return response
+ except Exception as e:
+ logging(model=model, input=messages, azure=azure, additional_args={"max_tokens": max_tokens}, logger_fn=logger_fn)
+ raise e
+
+
+### EMBEDDING ENDPOINTS ####################
+@func_set_timeout(60, allowOverride=True) ## https://pypi.org/project/func-timeout/
+def embedding(model, input=[], azure=False, forceTimeout=60, logger_fn=None):
+ response = None
+ if azure == True:
+ # azure configs
+ openai.api_type = "azure"
+ openai.api_base = os.environ.get("AZURE_API_BASE")
+ openai.api_version = os.environ.get("AZURE_API_VERSION")
+ openai.api_key = os.environ.get("AZURE_API_KEY")
+ ## LOGGING
+ logging(model=model, input=input, azure=azure, logger_fn=logger_fn)
+ ## EMBEDDING CALL
+ response = openai.Embedding.create(input=input, engine=model)
+ print_verbose(f"response_value: {str(response)[:50]}")
+ elif model in open_ai_embedding_models:
+ openai.api_type = "openai"
+ openai.api_base = "https://api.openai.com/v1"
+ openai.api_version = None
+ openai.api_key = os.environ.get("OPENAI_API_KEY")
+ ## LOGGING
+ logging(model=model, input=input, azure=azure, logger_fn=logger_fn)
+ ## EMBEDDING CALL
+ response = openai.Embedding.create(input=input, model=model)
+ print_verbose(f"response_value: {str(response)[:50]}")
+ else:
+ logging(model=model, input=input, azure=azure, logger_fn=logger_fn)
+
+ return response
+
+
+### CLIENT CLASS #################### make it easy to push completion/embedding runs to different sources -> sentry/posthog/slack, etc.
+class litellm_client:
+ def __init__(self, success_callback=[], failure_callback=[], verbose=False): # Constructor
+ set_verbose = verbose
+ self.success_callback = success_callback
+ self.failure_callback = failure_callback
+ self.logger_fn = None # if user passes in their own logging function
+ self.callback_list = list(set(self.success_callback + self.failure_callback))
+ self.set_callbacks()
+
+ ## COMPLETION CALL
+ def completion(self, model, messages, max_tokens=None, forceTimeout=10, azure=False, logger_fn=None, additional_details={}) -> Any:
+ try:
+ self.logger_fn = logger_fn
+ response = completion(model=model, messages=messages, max_tokens=max_tokens, forceTimeout=forceTimeout, azure=azure, logger_fn=self.handle_input)
+ my_thread = threading.Thread(target=self.handle_success, args=(model, messages, additional_details)) # don't interrupt execution of main thread
+ my_thread.start()
+ return response
+ except Exception as e:
+ args = locals() # get all the param values
+ self.handle_failure(e, args)
+ raise e
+
+ ## EMBEDDING CALL
+ def embedding(self, model, input=[], azure=False, logger_fn=None, forceTimeout=60, additional_details={}) -> Any:
+ try:
+ self.logger_fn = logger_fn
+ response = embedding(model, input, azure=azure, logger_fn=self.handle_input)
+ my_thread = threading.Thread(target=self.handle_success, args=(model, input, additional_details)) # don't interrupt execution of main thread
+ my_thread.start()
+ return response
+ except Exception as e:
+ args = locals() # get all the param values
+ self.handle_failure(e, args)
+ raise e
+
+
+ def set_callbacks(self): #instantiate any external packages
+ for callback in self.callback_list: # only install what's required
+ if callback == "sentry":
+ try:
+ import sentry_sdk
+ except ImportError:
+ print_verbose("Package 'sentry_sdk' is missing. Installing it...")
+ subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'sentry_sdk'])
+ import sentry_sdk
+ self.sentry_sdk = sentry_sdk
+ self.sentry_sdk.init(dsn=os.environ.get("SENTRY_API_URL"), traces_sample_rate=float(os.environ.get("SENTRY_API_TRACE_RATE")))
+ self.capture_exception = self.sentry_sdk.capture_exception
+ self.add_breadcrumb = self.sentry_sdk.add_breadcrumb
+ elif callback == "posthog":
+ try:
+ from posthog import Posthog
+ except:
+ print_verbose("Package 'posthog' is missing. Installing it...")
+ subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'posthog'])
+ from posthog import Posthog
+ self.posthog = Posthog(
+ project_api_key=os.environ.get("POSTHOG_API_KEY"),
+ host=os.environ.get("POSTHOG_API_URL"))
+ elif callback == "slack":
+ try:
+ from slack_bolt import App
+ except ImportError:
+ print_verbose("Package 'slack_bolt' is missing. Installing it...")
+ subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'slack_bolt'])
+ from slack_bolt import App
+ self.slack_app = App(
+ token=os.environ.get("SLACK_API_TOKEN"),
+ signing_secret=os.environ.get("SLACK_API_SECRET")
+ )
+ self.alerts_channel = os.environ["SLACK_API_CHANNEL"]
+
+ def handle_input(self, model_call_details={}):
+ if len(model_call_details.keys()) > 0:
+ model = model_call_details["model"] if "model" in model_call_details else None
+ if model:
+ for callback in self.callback_list:
+ if callback == "sentry": # add a sentry breadcrumb if user passed in sentry integration
+ self.add_breadcrumb(
+ category=f'{model}',
+ message='Trying request model {} input {}'.format(model, json.dumps(model_call_details)),
+ level='info',
+ )
+ if self.logger_fn and callable(self.logger_fn):
+ self.logger_fn(model_call_details)
+ pass
+
+ def handle_success(self, model, messages, additional_details):
+ success_handler = additional_details.pop("success_handler", None)
+ failure_handler = additional_details.pop("failure_handler", None)
+ additional_details["litellm_model"] = str(model)
+ additional_details["litellm_messages"] = str(messages)
+ for callback in self.success_callback:
+ try:
+ if callback == "posthog":
+ ph_obj = {}
+ for detail in additional_details:
+ ph_obj[detail] = additional_details[detail]
+ event_name = additional_details["successful_event"] if "successful_event" in additional_details else "litellm.succes_query"
+ if "user_id" in additional_details:
+ self.posthog.capture(additional_details["user_id"], event_name, ph_obj)
+ else:
+ self.posthog.capture(event_name, ph_obj)
+ pass
+ elif callback == "slack":
+ slack_msg = ""
+ if len(additional_details.keys()) > 0:
+ for detail in additional_details:
+ slack_msg += f"{detail}: {additional_details[detail]}\n"
+ slack_msg += f"Successful call"
+ self.slack_app.client.chat_postMessage(channel=self.alerts_channel, text=slack_msg)
+ except:
+ pass
+
+ if success_handler and callable(success_handler):
+ call_details = {
+ "model": model,
+ "messages": messages,
+ "additional_details": additional_details
+ }
+ success_handler(call_details)
+ pass
+
+ def handle_failure(self, exception, args):
+ args.pop("self")
+ additional_details = args.pop("additional_details", {})
+
+ success_handler = additional_details.pop("success_handler", None)
+ failure_handler = additional_details.pop("failure_handler", None)
+
+ for callback in self.failure_callback:
+ try:
+ if callback == "slack":
+ slack_msg = ""
+ for param in args:
+ slack_msg += f"{param}: {args[param]}\n"
+ if len(additional_details.keys()) > 0:
+ for detail in additional_details:
+ slack_msg += f"{detail}: {additional_details[detail]}\n"
+ slack_msg += f"Traceback: {traceback.format_exc()}"
+ self.slack_app.client.chat_postMessage(channel=self.alerts_channel, text=slack_msg)
+ elif callback == "sentry":
+ self.capture_exception(exception)
+ elif callback == "posthog":
+ if len(additional_details.keys()) > 0:
+ ph_obj = {}
+ for param in args:
+ ph_obj[param] += args[param]
+ for detail in additional_details:
+ ph_obj[detail] = additional_details[detail]
+ event_name = additional_details["failed_event"] if "failed_event" in additional_details else "litellm.failed_query"
+ if "user_id" in additional_details:
+ self.posthog.capture(additional_details["user_id"], event_name, ph_obj)
+ else:
+ self.posthog.capture(event_name, ph_obj)
+ else:
+ pass
+ except:
+ print(f"got an error calling {callback} - {traceback.format_exc()}")
+
+ if failure_handler and callable(failure_handler):
+ call_details = {
+ "exception": exception,
+ "additional_details": additional_details
+ }
+ failure_handler(call_details)
+ pass
+####### HELPER FUNCTIONS ################
+
+#Logging function -> log the exact model details + what's being sent | Non-Blocking
+def logging(model, input, azure=False, additional_args={}, logger_fn=None):
+ try:
+ model_call_details = {}
+ model_call_details["model"] = model
+ model_call_details["input"] = input
+ model_call_details["azure"] = azure
+ model_call_details["additional_args"] = additional_args
+ if logger_fn and callable(logger_fn):
+ try:
+ # log additional call details -> api key, etc.
+ if azure == True or model in open_ai_chat_completion_models or model in open_ai_chat_completion_models or model in open_ai_embedding_models:
+ model_call_details["api_type"] = openai.api_type
+ model_call_details["api_base"] = openai.api_base
+ model_call_details["api_version"] = openai.api_version
+ model_call_details["api_key"] = openai.api_key
+ elif "replicate" in model:
+ model_call_details["api_key"] = os.environ.get("REPLICATE_API_TOKEN")
+ elif model in anthropic_models:
+ model_call_details["api_key"] = os.environ.get("ANTHROPIC_API_KEY")
+ elif model in cohere_models:
+ model_call_details["api_key"] = os.environ.get("COHERE_API_KEY")
+
+ logger_fn(model_call_details) # Expectation: any logger function passed in by the user should accept a dict object
+ except:
+ print_verbose(f"Basic model call details: {model_call_details}")
+ print_verbose(f"[Non-Blocking] Exception occurred while logging {traceback.format_exc()}")
+ pass
+ else:
+ print_verbose(f"Basic model call details: {model_call_details}")
+ pass
+ except:
+ pass
+
+## Set verbose to true -> ```litellm.verbose = True```
+def print_verbose(print_statement):
+ if set_verbose:
+ print(f"LiteLLM: {print_statement}")
+ print("Get help - https://discord.com/invite/wuPM9dRgDw")
\ No newline at end of file
diff --git a/ci_cd/check_file_length.py b/ci_cd/check_file_length.py
deleted file mode 100644
index f23b79add..000000000
--- a/ci_cd/check_file_length.py
+++ /dev/null
@@ -1,28 +0,0 @@
-import sys
-
-
-def check_file_length(max_lines, filenames):
- bad_files = []
- for filename in filenames:
- with open(filename, "r") as file:
- lines = file.readlines()
- if len(lines) > max_lines:
- bad_files.append((filename, len(lines)))
- return bad_files
-
-
-if __name__ == "__main__":
- max_lines = int(sys.argv[1])
- filenames = sys.argv[2:]
-
- bad_files = check_file_length(max_lines, filenames)
- if bad_files:
- bad_files.sort(
- key=lambda x: x[1], reverse=True
- ) # Sort files by length in descending order
- for filename, length in bad_files:
- print(f"{filename}: {length} lines")
-
- sys.exit(1)
- else:
- sys.exit(0)
diff --git a/ci_cd/check_files_match.py b/ci_cd/check_files_match.py
deleted file mode 100644
index 18b6cf792..000000000
--- a/ci_cd/check_files_match.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import sys
-import filecmp
-import shutil
-
-
-def main(argv=None):
- print(
- "Comparing model_prices_and_context_window and litellm/model_prices_and_context_window_backup.json files... checking if they match."
- )
-
- file1 = "model_prices_and_context_window.json"
- file2 = "litellm/model_prices_and_context_window_backup.json"
-
- cmp_result = filecmp.cmp(file1, file2, shallow=False)
-
- if cmp_result:
- print(f"Passed! Files {file1} and {file2} match.")
- return 0
- else:
- print(
- f"Failed! Files {file1} and {file2} do not match. Copying content from {file1} to {file2}."
- )
- copy_content(file1, file2)
- return 1
-
-
-def copy_content(source, destination):
- shutil.copy2(source, destination)
-
-
-if __name__ == "__main__":
- sys.exit(main())
diff --git a/codecov.yaml b/codecov.yaml
deleted file mode 100644
index c25cf0fba..000000000
--- a/codecov.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-component_management:
- individual_components:
- - component_id: "Router"
- paths:
- - "router"
- - component_id: "LLMs"
- paths:
- - "*/llms/*"
- - component_id: "Caching"
- paths:
- - "*/caching/*"
- - ".*redis.*"
- - component_id: "litellm_logging"
- paths:
- - "*/integrations/*"
- - ".*litellm_logging.*"
- - component_id: "Proxy_Authentication"
- paths:
- - "*/proxy/auth/**"
-comment:
- layout: "header, diff, flags, components" # show component info in the PR comment
-
-coverage:
- status:
- project:
- default:
- target: auto
- threshold: 1% # at maximum allow project coverage to drop by 1%
- patch:
- default:
- target: auto
- threshold: 0% # patch coverage should be 100%
diff --git a/cookbook/Benchmarking_LLMs_by_use_case.ipynb b/cookbook/Benchmarking_LLMs_by_use_case.ipynb
deleted file mode 100644
index 80d96261b..000000000
--- a/cookbook/Benchmarking_LLMs_by_use_case.ipynb
+++ /dev/null
@@ -1,757 +0,0 @@
-{
- "nbformat": 4,
- "nbformat_minor": 0,
- "metadata": {
- "colab": {
- "provenance": []
- },
- "kernelspec": {
- "name": "python3",
- "display_name": "Python 3"
- },
- "language_info": {
- "name": "python"
- }
- },
- "cells": [
- {
- "cell_type": "markdown",
- "source": [
- "# LiteLLM - Benchmark Llama2, Claude1.2 and GPT3.5 for a use case\n",
- "In this notebook for a given use case we run the same question and view:\n",
- "* LLM Response\n",
- "* Response Time\n",
- "* Response Cost\n",
- "\n",
- "## Sample output for a question\n",
- ""
- ],
- "metadata": {
- "id": "4Cq-_Y-TKf0r"
- }
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "O3ENsWYB27Mb"
- },
- "outputs": [],
- "source": [
- "!pip install litellm"
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "## Example Use Case 1 - Code Generator\n",
- "### For this use case enter your system prompt and questions\n"
- ],
- "metadata": {
- "id": "Pk55Mjq_3DiR"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "# enter your system prompt if you have one\n",
- "system_prompt = \"\"\"\n",
- "You are a coding assistant helping users using litellm.\n",
- "litellm is a light package to simplify calling OpenAI, Azure, Cohere, Anthropic, Huggingface API Endpoints\n",
- "--\n",
- "Sample Usage:\n",
- "```\n",
- "pip install litellm\n",
- "from litellm import completion\n",
- "## set ENV variables\n",
- "os.environ[\"OPENAI_API_KEY\"] = \"openai key\"\n",
- "os.environ[\"COHERE_API_KEY\"] = \"cohere key\"\n",
- "messages = [{ \"content\": \"Hello, how are you?\",\"role\": \"user\"}]\n",
- "# openai call\n",
- "response = completion(model=\"gpt-3.5-turbo\", messages=messages)\n",
- "# cohere call\n",
- "response = completion(\"command-nightly\", messages)\n",
- "```\n",
- "\n",
- "\"\"\"\n",
- "\n",
- "\n",
- "# qustions/logs you want to run the LLM on\n",
- "questions = [\n",
- " \"what is litellm?\",\n",
- " \"why should I use LiteLLM\",\n",
- " \"does litellm support Anthropic LLMs\",\n",
- " \"write code to make a litellm completion call\",\n",
- "]"
- ],
- "metadata": {
- "id": "_1SZYJFB3HmQ"
- },
- "execution_count": 21,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "## Running questions\n",
- "### Select from 100+ LLMs here: https://docs.litellm.ai/docs/providers"
- ],
- "metadata": {
- "id": "AHH3cqeU3_ZT"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "import litellm\n",
- "from litellm import completion, completion_cost\n",
- "import os\n",
- "import time\n",
- "\n",
- "# optional use litellm dashboard to view logs\n",
- "# litellm.use_client = True\n",
- "# litellm.token = \"ishaan_2@berri.ai\" # set your email\n",
- "\n",
- "\n",
- "# set API keys\n",
- "os.environ['TOGETHERAI_API_KEY'] = \"\"\n",
- "os.environ['OPENAI_API_KEY'] = \"\"\n",
- "os.environ['ANTHROPIC_API_KEY'] = \"\"\n",
- "\n",
- "\n",
- "# select LLMs to benchmark\n",
- "# using https://api.together.xyz/playground for llama2\n",
- "# try any supported LLM here: https://docs.litellm.ai/docs/providers\n",
- "\n",
- "models = ['togethercomputer/llama-2-70b-chat', 'gpt-3.5-turbo', 'claude-instant-1.2']\n",
- "data = []\n",
- "\n",
- "for question in questions: # group by question\n",
- " for model in models:\n",
- " print(f\"running question: {question} for model: {model}\")\n",
- " start_time = time.time()\n",
- " # show response, response time, cost for each question\n",
- " response = completion(\n",
- " model=model,\n",
- " max_tokens=500,\n",
- " messages = [\n",
- " {\n",
- " \"role\": \"system\", \"content\": system_prompt\n",
- " },\n",
- " {\n",
- " \"role\": \"user\", \"content\": question\n",
- " }\n",
- " ],\n",
- " )\n",
- " end = time.time()\n",
- " total_time = end-start_time # response time\n",
- " # print(response)\n",
- " cost = completion_cost(response) # cost for completion\n",
- " raw_response = response['choices'][0]['message']['content'] # response string\n",
- "\n",
- "\n",
- " # add log to pandas df\n",
- " data.append(\n",
- " {\n",
- " 'Model': model,\n",
- " 'Question': question,\n",
- " 'Response': raw_response,\n",
- " 'ResponseTime': total_time,\n",
- " 'Cost': cost\n",
- " })"
- ],
- "metadata": {
- "id": "BpQD4A5339L3"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "## View Benchmarks for LLMs"
- ],
- "metadata": {
- "id": "apOSV3PBLa5Y"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "from IPython.display import display\n",
- "from IPython.core.interactiveshell import InteractiveShell\n",
- "InteractiveShell.ast_node_interactivity = \"all\"\n",
- "from IPython.display import HTML\n",
- "import pandas as pd\n",
- "\n",
- "df = pd.DataFrame(data)\n",
- "grouped_by_question = df.groupby('Question')\n",
- "\n",
- "for question, group_data in grouped_by_question:\n",
- " print(f\"Question: {question}\")\n",
- " HTML(group_data.to_html())\n"
- ],
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 1000
- },
- "id": "CJqBlqUh_8Ws",
- "outputId": "e02c3427-d8c6-4614-ff07-6aab64247ff6"
- },
- "execution_count": 22,
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Question: does litellm support Anthropic LLMs\n"
- ]
- },
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- ""
- ],
- "text/html": [
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
Model
\n",
- "
Question
\n",
- "
Response
\n",
- "
ResponseTime
\n",
- "
Cost
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
6
\n",
- "
togethercomputer/llama-2-70b-chat
\n",
- "
does litellm support Anthropic LLMs
\n",
- "
Yes, litellm supports Anthropic LLMs.\\n\\nIn the example usage you provided, the `completion` function is called with the `model` parameter set to `\"gpt-3.5-turbo\"` for OpenAI and `\"command-nightly\"` for Cohere.\\n\\nTo use an Anthropic LLM with litellm, you would set the `model` parameter to the name of the Anthropic model you want to use, followed by the version number, if applicable. For example:\\n```\\nresponse = completion(model=\"anthropic-gpt-2\", messages=messages)\\n```\\nThis would call the Anthropic GPT-2 model to generate a completion for the given input messages.\\n\\nNote that you will need to set the `ANTHROPIC_API_KEY` environment variable to your Anthropic API key before making the call. You can do this by running the following command in your terminal:\\n```\\nos.environ[\"ANTHROPIC_API_KEY\"] = \"your-anthropic-api-key\"\\n```\\nReplace `\"your-anthropic-api-key\"` with your actual Anthropic API key.\\n\\nOnce you've set the environment variable, you can use the `completion` function with the `model` parameter set to an Anthropic model name to call the Anthropic API and generate a completion.
\n",
- "
21.513009
\n",
- "
0.001347
\n",
- "
\n",
- "
\n",
- "
7
\n",
- "
gpt-3.5-turbo
\n",
- "
does litellm support Anthropic LLMs
\n",
- "
No, currently litellm does not support Anthropic LLMs. It mainly focuses on simplifying the usage of OpenAI, Azure, Cohere, and Huggingface API endpoints.
\n",
- "
8.656510
\n",
- "
0.000342
\n",
- "
\n",
- "
\n",
- "
8
\n",
- "
claude-instant-1.2
\n",
- "
does litellm support Anthropic LLMs
\n",
- "
Yes, litellm supports calling Anthropic LLMs through the completion function.\\n\\nTo use an Anthropic model with litellm:\\n\\n1. Set the ANTHROPIC_API_KEY environment variable with your Anthropic API key\\n\\n2. Pass the model name as the 'model' argument to completion(). Anthropic model names follow the format 'anthropic/<model_name>'\\n\\nFor example:\\n\\n```python \\nimport os\\nfrom litellm import completion\\n\\nos.environ[\"ANTHROPIC_API_KEY\"] = \"your_anthropic_api_key\"\\n\\nmessages = [{\"content\": \"Hello\", \"role\": \"user\"}]\\n\\nresponse = completion(model=\"anthropic/constitutional\", messages=messages)\\n```\\n\\nThis would call the Constitutional AI model from Anthropic.\\n\\nSo in summary, litellm provides a simple interface to call any Anthropic models as long as you specify the model name correctly and set the ANTHROPIC_API_KEY env variable.
Litellm is a lightweight Python package that simplifies calling various AI API endpoints, including OpenAI, Azure, Cohere, Anthropic, and Hugging Face. It provides a convenient interface for making requests to these APIs, allowing developers to easily integrate them into their applications. With Litellm, developers can quickly and easily interact with multiple AI models and services, without having to handle the details of authentication, API calls, and response parsing. This makes it easier to build and deploy AI-powered applications, and can help developers save time and effort.
\n",
- "
13.479644
\n",
- "
0.000870
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
gpt-3.5-turbo
\n",
- "
what is litellm?
\n",
- "
litellm is a light package that provides a simplified interface for making API calls to various language models and APIs. It abstracts away the complexities of handling network requests, authentication, and response parsing, making it easier for developers to integrate powerful language models into their applications.\\n\\nWith litellm, you can quickly make API calls to models like OpenAI's GPT-3.5 Turbo, Azure's Text Analytics, Cohere's Command API, Anthropic's API, and Huggingface's models. It also supports additional functionality like conversational AI, summarization, translation, and more.\\n\\nBy using litellm, you can focus on your application logic without getting tangled in the details of API integration, allowing you to quickly build intelligent and conversational applications.
\n",
- "
8.324332
\n",
- "
0.000566
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
claude-instant-1.2
\n",
- "
what is litellm?
\n",
- "
litellm is a Python library that simplifies calling various AI API endpoints like OpenAI, Azure, Cohere, Anthropic, and Huggingface. \\n\\nSome key things to know about litellm:\\n\\n- It provides a consistent interface for completing prompts and generating responses from different AI models through a single method called completion().\\n\\n- You specify the API (e.g. OpenAI, Cohere etc.) and model either by name or by setting environment variables before making the completion call.\\n\\n- This avoids having to use different SDKs or APIs for each provider and standardizes the call structure. \\n\\n- It handles things like setting headers, encoding inputs, parsing responses so the user doesn't have to deal with those details.\\n\\n- The goal is to make it easy to try different AI APIs and models without having to change code or learn different interfaces.\\n\\n- It's lightweight with no other dependencies required besides what's needed for each API (e.g. openai, azure SDKs etc.).\\n\\nSo in summary, litellm is a small library that provides a common way to interact with multiple conversational AI APIs through a single Python method, avoiding the need to directly use each provider's specific SDK.
\\nThere are several reasons why you might want to use LiteLLM:\\n\\n1. Simplified API calls: LiteLLM provides a simple and consistent API for calling various language models, making it easier to use multiple models and switch between them.\\n2. Environment variable configuration: LiteLLM allows you to set environment variables for API keys and model names, making it easier to manage and switch between different models and APIs.\\n3. Support for multiple models and APIs: LiteLLM supports a wide range of language models and APIs, including OpenAI, Azure, Cohere, Anthropic, and Hugging Face.\\n4. Easy integration with popular frameworks: LiteLLM can be easily integrated with popular frameworks such as PyTorch and TensorFlow, making it easy to use with your existing codebase.\\n5. Lightweight: LiteLLM is a lightweight package, making it easy to install and use, even on resource-constrained devices.\\n6. Flexible: LiteLLM allows you to define your own models and APIs, making it easy to use with custom models and APIs.\\n7. Extensive documentation: LiteLLM has extensive documentation, making it easy to get started and learn how to use the package.\\n8. Active community: LiteLLM has an active community of developers and users, making it easy to get help and feedback on your projects.\\n\\nOverall, LiteLLM can help you to simplify your workflow, improve your productivity, and make it easier to work with multiple language models and APIs.
\n",
- "
23.777885
\n",
- "
0.001443
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
gpt-3.5-turbo
\n",
- "
why should I use LiteLLM
\n",
- "
LiteLLM is a lightweight Python package that simplifies the process of making API calls to various language models. Here are some reasons why you should use LiteLLM:\\n\\n1. **Simplified API Calls**: LiteLLM abstracts away the complexity of making API calls to different language models. It provides a unified interface for invoking models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more.\\n\\n2. **Easy Integration**: LiteLLM seamlessly integrates with your existing codebase. You can import the package and start making API calls with just a few lines of code.\\n\\n3. **Flexibility**: LiteLLM supports a variety of language models, including GPT-3, GPT-Neo, chatGPT, and more. You can choose the model that suits your requirements and easily switch between them.\\n\\n4. **Convenience**: LiteLLM handles the authentication and connection details for you. You just need to set the relevant environment variables, and the package takes care of the rest.\\n\\n5. **Quick Prototyping**: LiteLLM is ideal for rapid prototyping and experimentation. With its simple API, you can quickly generate text, chat with models, and build interactive applications.\\n\\n6. **Community Support**: LiteLLM is actively maintained and supported by a community of developers. You can find help, share ideas, and collaborate with others to enhance your projects.\\n\\nOverall, LiteLLM simplifies the process of making API calls to language models, saving you time and effort while providing flexibility and convenience.
\n",
- "
12.109881
\n",
- "
0.000881
\n",
- "
\n",
- "
\n",
- "
5
\n",
- "
claude-instant-1.2
\n",
- "
why should I use LiteLLM
\n",
- "
Here are some key reasons why you may want to consider using LiteLLM:\\n\\n- Simplifies calling multiple large language models - LiteLLM provides a unified API to call models from different providers like OpenAI, Azure, HuggingFace, Anthropic etc. This avoids having to deal with different APIs from each provider.\\n\\n- Easy to get started - LiteLLM is very lightweight and simple to install with just one pip install. It has minimal configuration needed with just API keys. This lowers the barrier to start experimenting with LLMs.\\n\\n- Standard interface - The completion API provides a standard way to send messages and get responses back regardless of underlying model. This keeps the code interface consistent. \\n\\n- Model agnostic - Your code doesn't need to change to switch between different models. You can easily compare performance of models from different providers.\\n\\n- Open source - As it is open source, the library can be expanded to support more models and providers over time via community contributions.\\n\\n- Cost effective - LiteLLM helps avoid vendor lock-in. You can optimize costs by testing multiple models and choosing the best suitable and cost-effective option for your needs.\\n\\nSo in summary, L
\\nTo make a completion call using litellm, you can use the `completion` function from the litellm package. Here's an example of how to use it:\\n```\\nfrom litellm import completion\\n\\n# Set ENV variables\\nos.environ[\"OPENAI_API_KEY\"] = \"your_openai_api_key\"\\nos.environ[\"COHERE_API_KEY\"] = \"your_cohere_api_key\"\\n\\n# Define the messages to be completed\\nmessages = [\\n {\\n \"content\": \"Hello, how are you?\",\\n \"role\": \"user\"\\n }\\n]\\n\\n# Make a completion call using OpenAI\\nresponse = completion(model=\"gpt-3.5-turbo\", messages=messages)\\n\\n# Make a completion call using Cohere\\nresponse = completion(\"command-nightly\", messages)\\n```\\nIn this example, we first set the ENV variables for the OpenAI and Cohere API keys. Then, we define a list of messages to be completed, which in this case contains a single message with the content \"Hello, how are you?\" and the role \"user\".\\n\\nNext, we make two completion calls using the `completion` function from litellm. The first call uses the OpenAI model `gpt-3.5-turbo` and passes in the list of messages. The second call uses the Cohere model `command-nightly` and passes in the same list of messages.\\n\\nThe `completion` function returns a response object that contains the completed messages. You can then use the `response.messages` attribute to access the completed messages.\\n\\nHere's an example of how to access the completed messages:\\n```\\n# Print the completed messages\\nprint(response.messages)\\n```\\nThis will print the completed messages, which should look something like this:\\n```\\n[{'content': 'Hello, how are you?', 'role': 'user', 'completion': 'I am doing well, thanks for asking.'}]\\n```\\nNote that the `completion` attribute contains the completed message. You can use this attribute to retrieve the completed message and use it in your application.
\n",
- "
32.192453
\n",
- "
0.001821
\n",
- "
\n",
- "
\n",
- "
10
\n",
- "
gpt-3.5-turbo
\n",
- "
write code to make a litellm completion call
\n",
- "
Sure! Here's an example code to make a completion call using the litellm package:\\n\\n```python\\nfrom litellm import completion\\n\\n# Set the model and messages\\nmodel = \"gpt-3.5-turbo\"\\nmessages = [\\n {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\\n {\"role\": \"user\", \"content\": \"Who won the world series in 2020?\"},\\n {\"role\": \"assistant\", \"content\": \"The Los Angeles Dodgers won the World Series in 2020.\"},\\n {\"role\": \"user\", \"content\": \"Where was it played?\"}\\n]\\n\\n# Make the completion call\\nresponse = completion(model=model, messages=messages)\\n\\n# Print the assistant's reply\\nassistant_reply = response[\"choices\"][0][\"message\"][\"content\"]\\nprint(\"Assistant: \", assistant_reply)\\n```\\n\\nMake sure you have the litellm package installed (`pip install litellm`) and set the necessary environment variables for the API keys before running this code.
\n",
- "
9.377155
\n",
- "
0.000686
\n",
- "
\n",
- "
\n",
- "
11
\n",
- "
claude-instant-1.2
\n",
- "
write code to make a litellm completion call
\n",
- "
Here is an example of making a completion call using litellm:\\n\\n```python\\nimport os\\nfrom litellm import completion\\n\\n# Set API keys as environment variables\\nos.environ[\"OPENAI_API_KEY\"] = \"your openai api key\" \\n\\n# Conversation context \\nmessages = [{\\n \"content\": \"Hello, how can I help you today?\",\\n \"role\": \"assistant\"\\n}]\\n\\n# Make completion call with GPT-3 model\\nresponse = completion(\\n model=\"gpt-3.5-turbo\", \\n messages=messages\\n)\\n\\nprint(response)\\n```\\n\\nTo break it down:\\n\\n- Import completion from litellm\\n- Set the OPENAI_API_KEY env var \\n- Define a messages list with the conversation context\\n- Call completion(), specifying the model (\"gpt-3.5-turbo\") and messages\\n- It will return the response from the API\\n- Print the response\\n\\nThis makes a simple completion call to OpenAI GPT-3 using litellm to handle the API details. You can also call other models like Cohere or Anthropic by specifying their name instead of the OpenAI
\n",
- "
9.839988
\n",
- "
0.001578
\n",
- "
\n",
- " \n",
- "
"
- ]
- },
- "metadata": {},
- "execution_count": 22
- }
- ]
- },
- {
- "cell_type": "markdown",
- "source": [
- "## Use Case 2 - Rewrite user input concisely"
- ],
- "metadata": {
- "id": "bmtAbC1rGVAm"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "# enter your system prompt if you have one\n",
- "system_prompt = \"\"\"\n",
- "For a given user input, rewrite the input to make be more concise.\n",
- "\"\"\"\n",
- "\n",
- "# user input for re-writing questions\n",
- "questions = [\n",
- " \"LiteLLM is a lightweight Python package that simplifies the process of making API calls to various language models. Here are some reasons why you should use LiteLLM:\\n\\n1. **Simplified API Calls**: LiteLLM abstracts away the complexity of making API calls to different language models. It provides a unified interface for invoking models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more.\\n\\n2. **Easy Integration**: LiteLLM seamlessly integrates with your existing codebase. You can import the package and start making API calls with just a few lines of code.\\n\\n3. **Flexibility**: LiteLLM supports a variety of language models, including GPT-3, GPT-Neo, chatGPT, and more. You can choose the model that suits your requirements and easily switch between them.\\n\\n4. **Convenience**: LiteLLM handles the authentication and connection details for you. You just need to set the relevant environment variables, and the package takes care of the rest.\\n\\n5. **Quick Prototyping**: LiteLLM is ideal for rapid prototyping and experimentation. With its simple API, you can quickly generate text, chat with models, and build interactive applications.\\n\\n6. **Community Support**: LiteLLM is actively maintained and supported by a community of developers. You can find help, share ideas, and collaborate with others to enhance your projects.\\n\\nOverall, LiteLLM simplifies the process of making API calls to language models, saving you time and effort while providing flexibility and convenience\",\n",
- " \"Hi everyone! I'm [your name] and I'm currently working on [your project/role involving LLMs]. I came across LiteLLM and was really excited by how it simplifies working with different LLM providers. I'm hoping to use LiteLLM to [build an app/simplify my code/test different models etc]. Before finding LiteLLM, I was struggling with [describe any issues you faced working with multiple LLMs]. With LiteLLM's unified API and automatic translation between providers, I think it will really help me to [goals you have for using LiteLLM]. Looking forward to being part of this community and learning more about how I can build impactful applications powered by LLMs!Let me know if you would like me to modify or expand on any part of this suggested intro. I'm happy to provide any clarification or additional details you need!\",\n",
- " \"Traceloop is a platform for monitoring and debugging the quality of your LLM outputs. It provides you with a way to track the performance of your LLM application; rollout changes with confidence; and debug issues in production. It is based on OpenTelemetry, so it can provide full visibility to your LLM requests, as well vector DB usage, and other infra in your stack.\"\n",
- "]"
- ],
- "metadata": {
- "id": "boiHO1PhGXSL"
- },
- "execution_count": 23,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "## Run Questions"
- ],
- "metadata": {
- "id": "fwNcC_obICUc"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "import litellm\n",
- "from litellm import completion, completion_cost\n",
- "import os\n",
- "import time\n",
- "\n",
- "# optional use litellm dashboard to view logs\n",
- "# litellm.use_client = True\n",
- "# litellm.token = \"ishaan_2@berri.ai\" # set your email\n",
- "\n",
- "os.environ['TOGETHERAI_API_KEY'] = \"\"\n",
- "os.environ['OPENAI_API_KEY'] = \"\"\n",
- "os.environ['ANTHROPIC_API_KEY'] = \"\"\n",
- "\n",
- "models = ['togethercomputer/llama-2-70b-chat', 'gpt-3.5-turbo', 'claude-instant-1.2'] # enter llms to benchmark\n",
- "data_2 = []\n",
- "\n",
- "for question in questions: # group by question\n",
- " for model in models:\n",
- " print(f\"running question: {question} for model: {model}\")\n",
- " start_time = time.time()\n",
- " # show response, response time, cost for each question\n",
- " response = completion(\n",
- " model=model,\n",
- " max_tokens=500,\n",
- " messages = [\n",
- " {\n",
- " \"role\": \"system\", \"content\": system_prompt\n",
- " },\n",
- " {\n",
- " \"role\": \"user\", \"content\": \"User input:\" + question\n",
- " }\n",
- " ],\n",
- " )\n",
- " end = time.time()\n",
- " total_time = end-start_time # response time\n",
- " # print(response)\n",
- " cost = completion_cost(response) # cost for completion\n",
- " raw_response = response['choices'][0]['message']['content'] # response string\n",
- " #print(raw_response, total_time, cost)\n",
- "\n",
- " # add to pandas df\n",
- " data_2.append(\n",
- " {\n",
- " 'Model': model,\n",
- " 'Question': question,\n",
- " 'Response': raw_response,\n",
- " 'ResponseTime': total_time,\n",
- " 'Cost': cost\n",
- " })\n",
- "\n",
- "\n"
- ],
- "metadata": {
- "id": "KtBjZ1mUIBiJ"
- },
- "execution_count": null,
- "outputs": []
- },
- {
- "cell_type": "markdown",
- "source": [
- "## View Logs - Group by Question"
- ],
- "metadata": {
- "id": "-PCYIzG5M0II"
- }
- },
- {
- "cell_type": "code",
- "source": [
- "from IPython.display import display\n",
- "from IPython.core.interactiveshell import InteractiveShell\n",
- "InteractiveShell.ast_node_interactivity = \"all\"\n",
- "from IPython.display import HTML\n",
- "import pandas as pd\n",
- "\n",
- "df = pd.DataFrame(data_2)\n",
- "grouped_by_question = df.groupby('Question')\n",
- "\n",
- "for question, group_data in grouped_by_question:\n",
- " print(f\"Question: {question}\")\n",
- " HTML(group_data.to_html())\n"
- ],
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 1000
- },
- "id": "-3R5-2q8IiL2",
- "outputId": "c4a0d9e5-bb21-4de0-fc4c-9f5e71d0f177"
- },
- "execution_count": 20,
- "outputs": [
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Question: Hi everyone! I'm [your name] and I'm currently working on [your project/role involving LLMs]. I came across LiteLLM and was really excited by how it simplifies working with different LLM providers. I'm hoping to use LiteLLM to [build an app/simplify my code/test different models etc]. Before finding LiteLLM, I was struggling with [describe any issues you faced working with multiple LLMs]. With LiteLLM's unified API and automatic translation between providers, I think it will really help me to [goals you have for using LiteLLM]. Looking forward to being part of this community and learning more about how I can build impactful applications powered by LLMs!Let me know if you would like me to modify or expand on any part of this suggested intro. I'm happy to provide any clarification or additional details you need!\n"
- ]
- },
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- ""
- ],
- "text/html": [
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
Model
\n",
- "
Question
\n",
- "
Response
\n",
- "
ResponseTime
\n",
- "
Cost
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
3
\n",
- "
togethercomputer/llama-2-70b-chat
\n",
- "
Hi everyone! I'm [your name] and I'm currently working on [your project/role involving LLMs]. I came across LiteLLM and was really excited by how it simplifies working with different LLM providers. I'm hoping to use LiteLLM to [build an app/simplify my code/test different models etc]. Before finding LiteLLM, I was struggling with [describe any issues you faced working with multiple LLMs]. With LiteLLM's unified API and automatic translation between providers, I think it will really help me to [goals you have for using LiteLLM]. Looking forward to being part of this community and learning more about how I can build impactful applications powered by LLMs!Let me know if you would like me to modify or expand on any part of this suggested intro. I'm happy to provide any clarification or additional details you need!
\n",
- "
\\nHere's a more concise version of the user input:\\n\\n\"Hi everyone! I'm [your name] and I'm working on [your project/role involving LLMs]. I recently discovered LiteLLM and I'm excited to use it to [build an app/simplify my code/test different models etc]. Before LiteLLM, I struggled with [describe any issues you faced working with multiple LLMs]. I'm looking forward to using LiteLLM's unified API and automatic translation to achieve my goals. I'm eager to learn more about building impactful applications powered by LLMs and to be part of this community. Let me know if you have any questions or need further clarification.\"\\n\\nIn this revised version, we've kept the essential information and removed some of the extraneous language. We've also rephrased some of the sentences to make them more concise and easier to read.
\n",
- "
18.300620
\n",
- "
0.001200
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
gpt-3.5-turbo
\n",
- "
Hi everyone! I'm [your name] and I'm currently working on [your project/role involving LLMs]. I came across LiteLLM and was really excited by how it simplifies working with different LLM providers. I'm hoping to use LiteLLM to [build an app/simplify my code/test different models etc]. Before finding LiteLLM, I was struggling with [describe any issues you faced working with multiple LLMs]. With LiteLLM's unified API and automatic translation between providers, I think it will really help me to [goals you have for using LiteLLM]. Looking forward to being part of this community and learning more about how I can build impactful applications powered by LLMs!Let me know if you would like me to modify or expand on any part of this suggested intro. I'm happy to provide any clarification or additional details you need!
\n",
- "
User input: Hi, I'm [your name] and I'm excited about using LiteLLM to simplify working with different LLM providers. Before finding LiteLLM, I faced challenges working with multiple LLMs. With LiteLLM's unified API and automatic translation, I believe it will help me achieve my goals of [state your goals]. I look forward to being part of this community and learning how to build impactful applications with LLMs. Let me know if you need any further clarification or details.
\n",
- "
7.385472
\n",
- "
0.000525
\n",
- "
\n",
- "
\n",
- "
5
\n",
- "
claude-instant-1.2
\n",
- "
Hi everyone! I'm [your name] and I'm currently working on [your project/role involving LLMs]. I came across LiteLLM and was really excited by how it simplifies working with different LLM providers. I'm hoping to use LiteLLM to [build an app/simplify my code/test different models etc]. Before finding LiteLLM, I was struggling with [describe any issues you faced working with multiple LLMs]. With LiteLLM's unified API and automatic translation between providers, I think it will really help me to [goals you have for using LiteLLM]. Looking forward to being part of this community and learning more about how I can build impactful applications powered by LLMs!Let me know if you would like me to modify or expand on any part of this suggested intro. I'm happy to provide any clarification or additional details you need!
\n",
- "
Here is a more concise rewrite of the user input:\\n\\nHi everyone, I'm [your name]. I'm currently [your project/role] and came across LiteLLM, which simplifies working with different LLMs through its unified API. I hope to [build an app/simplify code/test models] with LiteLLM since I previously struggled with [issues]. LiteLLM's automatic translation between providers will help me [goals] and build impactful LLM applications. Looking forward to learning more as part of this community. Let me know if you need any clarification on my plans to use LiteLLM.
\n",
- "
8.628217
\n",
- "
0.001022
\n",
- "
\n",
- " \n",
- "
"
- ]
- },
- "metadata": {},
- "execution_count": 20
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Question: LiteLLM is a lightweight Python package that simplifies the process of making API calls to various language models. Here are some reasons why you should use LiteLLM:\n",
- "\n",
- "1. **Simplified API Calls**: LiteLLM abstracts away the complexity of making API calls to different language models. It provides a unified interface for invoking models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more.\n",
- "\n",
- "2. **Easy Integration**: LiteLLM seamlessly integrates with your existing codebase. You can import the package and start making API calls with just a few lines of code.\n",
- "\n",
- "3. **Flexibility**: LiteLLM supports a variety of language models, including GPT-3, GPT-Neo, chatGPT, and more. You can choose the model that suits your requirements and easily switch between them.\n",
- "\n",
- "4. **Convenience**: LiteLLM handles the authentication and connection details for you. You just need to set the relevant environment variables, and the package takes care of the rest.\n",
- "\n",
- "5. **Quick Prototyping**: LiteLLM is ideal for rapid prototyping and experimentation. With its simple API, you can quickly generate text, chat with models, and build interactive applications.\n",
- "\n",
- "6. **Community Support**: LiteLLM is actively maintained and supported by a community of developers. You can find help, share ideas, and collaborate with others to enhance your projects.\n",
- "\n",
- "Overall, LiteLLM simplifies the process of making API calls to language models, saving you time and effort while providing flexibility and convenience\n"
- ]
- },
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- ""
- ],
- "text/html": [
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
Model
\n",
- "
Question
\n",
- "
Response
\n",
- "
ResponseTime
\n",
- "
Cost
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
togethercomputer/llama-2-70b-chat
\n",
- "
LiteLLM is a lightweight Python package that simplifies the process of making API calls to various language models. Here are some reasons why you should use LiteLLM:\\n\\n1. **Simplified API Calls**: LiteLLM abstracts away the complexity of making API calls to different language models. It provides a unified interface for invoking models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more.\\n\\n2. **Easy Integration**: LiteLLM seamlessly integrates with your existing codebase. You can import the package and start making API calls with just a few lines of code.\\n\\n3. **Flexibility**: LiteLLM supports a variety of language models, including GPT-3, GPT-Neo, chatGPT, and more. You can choose the model that suits your requirements and easily switch between them.\\n\\n4. **Convenience**: LiteLLM handles the authentication and connection details for you. You just need to set the relevant environment variables, and the package takes care of the rest.\\n\\n5. **Quick Prototyping**: LiteLLM is ideal for rapid prototyping and experimentation. With its simple API, you can quickly generate text, chat with models, and build interactive applications.\\n\\n6. **Community Support**: LiteLLM is actively maintained and supported by a community of developers. You can find help, share ideas, and collaborate with others to enhance your projects.\\n\\nOverall, LiteLLM simplifies the process of making API calls to language models, saving you time and effort while providing flexibility and convenience
\n",
- "
Here's a more concise version of the user input:\\n\\nLiteLLM is a lightweight Python package that simplifies API calls to various language models. It abstracts away complexity, integrates seamlessly, supports multiple models, and handles authentication. It's ideal for rapid prototyping and has community support. It saves time and effort while providing flexibility and convenience.
\n",
- "
11.294250
\n",
- "
0.001251
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
gpt-3.5-turbo
\n",
- "
LiteLLM is a lightweight Python package that simplifies the process of making API calls to various language models. Here are some reasons why you should use LiteLLM:\\n\\n1. **Simplified API Calls**: LiteLLM abstracts away the complexity of making API calls to different language models. It provides a unified interface for invoking models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more.\\n\\n2. **Easy Integration**: LiteLLM seamlessly integrates with your existing codebase. You can import the package and start making API calls with just a few lines of code.\\n\\n3. **Flexibility**: LiteLLM supports a variety of language models, including GPT-3, GPT-Neo, chatGPT, and more. You can choose the model that suits your requirements and easily switch between them.\\n\\n4. **Convenience**: LiteLLM handles the authentication and connection details for you. You just need to set the relevant environment variables, and the package takes care of the rest.\\n\\n5. **Quick Prototyping**: LiteLLM is ideal for rapid prototyping and experimentation. With its simple API, you can quickly generate text, chat with models, and build interactive applications.\\n\\n6. **Community Support**: LiteLLM is actively maintained and supported by a community of developers. You can find help, share ideas, and collaborate with others to enhance your projects.\\n\\nOverall, LiteLLM simplifies the process of making API calls to language models, saving you time and effort while providing flexibility and convenience
\n",
- "
LiteLLM is a lightweight Python package that simplifies API calls to various language models. Here's why you should use it:\\n1. Simplified API Calls: Works with multiple models (OpenAI, Azure, Cohere, Anthropic, Huggingface).\\n2. Easy Integration: Import and start using it quickly in your codebase.\\n3. Flexibility: Supports GPT-3, GPT-Neo, chatGPT, etc. easily switch between models.\\n4. Convenience: Handles authentication and connection details, just set environment variables.\\n5. Quick Prototyping: Great for rapid prototyping and building interactive applications.\\n6. Community Support: Actively maintained and supported by a developer community.
\n",
- "
9.778315
\n",
- "
0.000795
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
claude-instant-1.2
\n",
- "
LiteLLM is a lightweight Python package that simplifies the process of making API calls to various language models. Here are some reasons why you should use LiteLLM:\\n\\n1. **Simplified API Calls**: LiteLLM abstracts away the complexity of making API calls to different language models. It provides a unified interface for invoking models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more.\\n\\n2. **Easy Integration**: LiteLLM seamlessly integrates with your existing codebase. You can import the package and start making API calls with just a few lines of code.\\n\\n3. **Flexibility**: LiteLLM supports a variety of language models, including GPT-3, GPT-Neo, chatGPT, and more. You can choose the model that suits your requirements and easily switch between them.\\n\\n4. **Convenience**: LiteLLM handles the authentication and connection details for you. You just need to set the relevant environment variables, and the package takes care of the rest.\\n\\n5. **Quick Prototyping**: LiteLLM is ideal for rapid prototyping and experimentation. With its simple API, you can quickly generate text, chat with models, and build interactive applications.\\n\\n6. **Community Support**: LiteLLM is actively maintained and supported by a community of developers. You can find help, share ideas, and collaborate with others to enhance your projects.\\n\\nOverall, LiteLLM simplifies the process of making API calls to language models, saving you time and effort while providing flexibility and convenience
\n",
- "
Here is a concise rewrite of the user input:\\n\\nLiteLLM is a lightweight Python package that simplifies accessing various language models. It provides a unified interface for models from OpenAI, Azure, Cohere, Anthropic, Huggingface, and more. Key benefits include simplified API calls, easy integration, flexibility to use different models, automated handling of authentication, and support for quick prototyping. The actively maintained package saves time by abstracting away complexity while offering convenience and a collaborative community.
\n",
- "
7.697520
\n",
- "
0.001098
\n",
- "
\n",
- " \n",
- "
"
- ]
- },
- "metadata": {},
- "execution_count": 20
- },
- {
- "output_type": "stream",
- "name": "stdout",
- "text": [
- "Question: Traceloop is a platform for monitoring and debugging the quality of your LLM outputs. It provides you with a way to track the performance of your LLM application; rollout changes with confidence; and debug issues in production. It is based on OpenTelemetry, so it can provide full visibility to your LLM requests, as well vector DB usage, and other infra in your stack.\n"
- ]
- },
- {
- "output_type": "execute_result",
- "data": {
- "text/plain": [
- ""
- ],
- "text/html": [
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
Model
\n",
- "
Question
\n",
- "
Response
\n",
- "
ResponseTime
\n",
- "
Cost
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
6
\n",
- "
togethercomputer/llama-2-70b-chat
\n",
- "
Traceloop is a platform for monitoring and debugging the quality of your LLM outputs. It provides you with a way to track the performance of your LLM application; rollout changes with confidence; and debug issues in production. It is based on OpenTelemetry, so it can provide full visibility to your LLM requests, as well vector DB usage, and other infra in your stack.
\n",
- "
\\nRewritten input: Traceloop is a platform for monitoring and debugging LLM outputs. It allows users to track performance, rollout changes confidently, and debug issues in production. It uses OpenTelemetry for full visibility into LLM requests, vector DB usage, and other infrastructure.
\n",
- "
9.060444
\n",
- "
0.000525
\n",
- "
\n",
- "
\n",
- "
7
\n",
- "
gpt-3.5-turbo
\n",
- "
Traceloop is a platform for monitoring and debugging the quality of your LLM outputs. It provides you with a way to track the performance of your LLM application; rollout changes with confidence; and debug issues in production. It is based on OpenTelemetry, so it can provide full visibility to your LLM requests, as well vector DB usage, and other infra in your stack.
\n",
- "
Traceloop is a platform for monitoring and debugging the quality of your LLM outputs. It helps track performance, rollout changes, and debug issues in production. It is based on OpenTelemetry, providing visibility to LLM requests, vector DB usage, and other infrastructure in your stack.
\n",
- "
7.304661
\n",
- "
0.000283
\n",
- "
\n",
- "
\n",
- "
8
\n",
- "
claude-instant-1.2
\n",
- "
Traceloop is a platform for monitoring and debugging the quality of your LLM outputs. It provides you with a way to track the performance of your LLM application; rollout changes with confidence; and debug issues in production. It is based on OpenTelemetry, so it can provide full visibility to your LLM requests, as well vector DB usage, and other infra in your stack.
\n",
- "
Here is a more concise rewrite of the user input:\\n\\nTraceloop monitors and debugs LLM quality. It tracks LLM performance, enables confident changes, and debugs production issues. Based on OpenTelemetry, Traceloop provides full visibility into LLM requests, vector DB usage, and other stack infrastructure.
\n"
- ],
- "text/plain": [
- "Model Name claude-instant-1 \\\n",
- "Prompt \n",
- "\\nIs paul graham a writer? Yes, Paul Graham is considered a writer in ad... \n",
- "\\nWhat has Paul Graham done? Paul Graham has made significant contribution... \n",
- "\\nWhat is Paul Graham known for? Paul Graham is known for several things:\\n\\n-... \n",
- "\\nWhere does Paul Graham live? Based on the information provided:\\n\\n- Paul ... \n",
- "\\nWho is Paul Graham? Paul Graham is an influential computer scient... \n",
- "\n",
- "Model Name gpt-3.5-turbo-0613 \\\n",
- "Prompt \n",
- "\\nIs paul graham a writer? Yes, Paul Graham is a writer. He has written s... \n",
- "\\nWhat has Paul Graham done? Paul Graham has achieved several notable accom... \n",
- "\\nWhat is Paul Graham known for? Paul Graham is known for his work on the progr... \n",
- "\\nWhere does Paul Graham live? According to the given information, Paul Graha... \n",
- "\\nWho is Paul Graham? Paul Graham is an English computer scientist, ... \n",
- "\n",
- "Model Name gpt-3.5-turbo-16k-0613 \\\n",
- "Prompt \n",
- "\\nIs paul graham a writer? Yes, Paul Graham is a writer. He has authored ... \n",
- "\\nWhat has Paul Graham done? Paul Graham has made significant contributions... \n",
- "\\nWhat is Paul Graham known for? Paul Graham is known for his work on the progr... \n",
- "\\nWhere does Paul Graham live? Paul Graham currently lives in England, where ... \n",
- "\\nWho is Paul Graham? Paul Graham is an English computer scientist, ... \n",
- "\n",
- "Model Name gpt-4-0613 \\\n",
- "Prompt \n",
- "\\nIs paul graham a writer? Yes, Paul Graham is a writer. He is an essayis... \n",
- "\\nWhat has Paul Graham done? Paul Graham is known for his work on the progr... \n",
- "\\nWhat is Paul Graham known for? Paul Graham is known for his work on the progr... \n",
- "\\nWhere does Paul Graham live? The text does not provide a current place of r... \n",
- "\\nWho is Paul Graham? Paul Graham is an English computer scientist, ... \n",
- "\n",
- "Model Name replicate/llama-2-70b-chat:58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781 \n",
- "Prompt \n",
- "\\nIs paul graham a writer? Yes, Paul Graham is an author. According to t... \n",
- "\\nWhat has Paul Graham done? Paul Graham has had a diverse career in compu... \n",
- "\\nWhat is Paul Graham known for? Paul Graham is known for many things, includi... \n",
- "\\nWhere does Paul Graham live? Based on the information provided, Paul Graha... \n",
- "\\nWho is Paul Graham? Paul Graham is an English computer scientist,... "
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import pandas as pd\n",
- "\n",
- "# Create an empty list to store the row data\n",
- "table_data = []\n",
- "\n",
- "# Iterate through the list and extract the required data\n",
- "for item in result:\n",
- " prompt = item['prompt'][0]['content'].replace(context, \"\") # clean the prompt for easy comparison\n",
- " model = item['response']['model']\n",
- " response = item['response']['choices'][0]['message']['content']\n",
- " table_data.append([prompt, model, response])\n",
- "\n",
- "# Create a DataFrame from the table data\n",
- "df = pd.DataFrame(table_data, columns=['Prompt', 'Model Name', 'Response'])\n",
- "\n",
- "# Pivot the DataFrame to get the desired table format\n",
- "table = df.pivot(index='Prompt', columns='Model Name', values='Response')\n",
- "table"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {
- "id": "zOxUM40PINDC"
- },
- "source": [
- "# Load Test endpoint\n",
- "\n",
- "Run 100+ simultaneous queries across multiple providers to see when they fail + impact on latency"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "ZkQf_wbcIRQ9"
- },
- "outputs": [],
- "source": [
- "models=[\"gpt-3.5-turbo\", \"replicate/llama-2-70b-chat:58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781\", \"claude-instant-1\"]\n",
- "context = \"\"\"Paul Graham (/ɡræm/; born 1964)[3] is an English computer scientist, essayist, entrepreneur, venture capitalist, and author. He is best known for his work on the programming language Lisp, his former startup Viaweb (later renamed Yahoo! Store), cofounding the influential startup accelerator and seed capital firm Y Combinator, his essays, and Hacker News. He is the author of several computer programming books, including: On Lisp,[4] ANSI Common Lisp,[5] and Hackers & Painters.[6] Technology journalist Steven Levy has described Graham as a \"hacker philosopher\".[7] Graham was born in England, where he and his family maintain permanent residence. However he is also a citizen of the United States, where he was educated, lived, and worked until 2016.\"\"\"\n",
- "prompt = \"Where does Paul Graham live?\"\n",
- "final_prompt = context + prompt\n",
- "result = load_test_model(models=models, prompt=final_prompt, num_calls=5)"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {
- "id": "8vSNBFC06aXY"
- },
- "source": [
- "## Visualize the data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 552
- },
- "id": "SZfiKjLV3-n8",
- "outputId": "00f7f589-b3da-43ed-e982-f9420f074b8d"
- },
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
"
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "import matplotlib.pyplot as plt\n",
- "\n",
- "## calculate avg response time\n",
- "unique_models = set(result[\"response\"]['model'] for result in result[\"results\"])\n",
- "model_dict = {model: {\"response_time\": []} for model in unique_models}\n",
- "for completion_result in result[\"results\"]:\n",
- " model_dict[completion_result[\"response\"][\"model\"]][\"response_time\"].append(completion_result[\"response_time\"])\n",
- "\n",
- "avg_response_time = {}\n",
- "for model, data in model_dict.items():\n",
- " avg_response_time[model] = sum(data[\"response_time\"]) / len(data[\"response_time\"])\n",
- "\n",
- "models = list(avg_response_time.keys())\n",
- "response_times = list(avg_response_time.values())\n",
- "\n",
- "plt.bar(models, response_times)\n",
- "plt.xlabel('Model', fontsize=10)\n",
- "plt.ylabel('Average Response Time')\n",
- "plt.title('Average Response Times for each Model')\n",
- "\n",
- "plt.xticks(models, [model[:15]+'...' if len(model) > 15 else model for model in models], rotation=45)\n",
- "plt.show()"
- ]
- },
- {
- "attachments": {},
- "cell_type": "markdown",
- "metadata": {
- "id": "inSDIE3_IRds"
- },
- "source": [
- "# Duration Test endpoint\n",
- "\n",
- "Run load testing for 2 mins. Hitting endpoints with 100+ queries every 15 seconds."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "id": "ePIqDx2EIURH"
- },
- "outputs": [],
- "source": [
- "models=[\"gpt-3.5-turbo\", \"replicate/llama-2-70b-chat:58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781\", \"claude-instant-1\"]\n",
- "context = \"\"\"Paul Graham (/ɡræm/; born 1964)[3] is an English computer scientist, essayist, entrepreneur, venture capitalist, and author. He is best known for his work on the programming language Lisp, his former startup Viaweb (later renamed Yahoo! Store), cofounding the influential startup accelerator and seed capital firm Y Combinator, his essays, and Hacker News. He is the author of several computer programming books, including: On Lisp,[4] ANSI Common Lisp,[5] and Hackers & Painters.[6] Technology journalist Steven Levy has described Graham as a \"hacker philosopher\".[7] Graham was born in England, where he and his family maintain permanent residence. However he is also a citizen of the United States, where he was educated, lived, and worked until 2016.\"\"\"\n",
- "prompt = \"Where does Paul Graham live?\"\n",
- "final_prompt = context + prompt\n",
- "result = load_test_model(models=models, prompt=final_prompt, num_calls=100, interval=15, duration=120)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/",
- "height": 552
- },
- "id": "k6rJoELM6t1K",
- "outputId": "f4968b59-3bca-4f78-a88b-149ad55e3cf7"
- },
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
\n"
- ],
- "application/vnd.google.colaboratory.module+javascript": "\n import \"https://ssl.gstatic.com/colaboratory/data_table/881c4a0d49046431/data_table.js\";\n\n const table = window.createDataTable({\n data: [[{\n 'v': 0,\n 'f': \"0\",\n },\n\"how do i call completion() using LiteLLM\",\n\"To call the `completion()` function using LiteLLM, you need to follow these steps:\\n\\n1. Install the `litellm` package by running `pip install litellm` in your terminal.\\n2. Import the `Completion` class from the `litellm` module.\\n3. Initialize an instance of the `Completion` class by providing the required parameters like the API endpoint URL and your API key.\\n4. Call the `complete()` method on the `Completion` instance and pass the text prompt as a string.\\n5. Retrieve the generated completion from the response object and use it as desired.\\n\\nHere's an example:\\n\\n```python\\nfrom litellm.completion import Completion\\n\\n# Initialize the Completion client\\ncompletion_client = Completion(\\n model_name='gpt-3.5-turbo',\\n api_key='your_api_key',\\n endpoint='https://your_endpoint_url'\\n)\\n\\n# Call the completion() method\\nresponse = completion_client.complete(\\\"Once upon a time\\\")\\n\\n# Retrieve the generated completion\\ncompletion = response['choices'][0]['text']\\n\\nprint(completion)\\n```\\n\\nMake sure to replace `'gpt-3.5-turbo'` with the desired model name, `'your_api_key'` with your actual API key, and `'https://your_endpoint_url'` with the correct API endpoint URL provided by your service provider.\\n\\nNote: The above example assumes you have a valid API key and endpoint URL for the OpenAI GPT-3.5-turbo model. Make sure to obtain the necessary credentials according to the API you are using.\",\n\" Here is how you can call the completion() method using LiteLLM:\\n\\nFirst, import LiteLLM:\\n\\n```python\\nimport litellm as lm\\n```\\n\\nThen create a LiteLLM object, specifying the API you want to use (e.g. \\\"openai\\\"):\\n\\n```python \\nai = lm.LiteLLM(\\\"openai\\\")\\n```\\n\\nNow you can call the completion() method on the ai object:\\n\\n```python\\nresponse = ai.completion(\\n prompt=\\\"Hello\\\", \\n model=\\\"text-davinci-003\\\",\\n max_tokens=100\\n)\\n```\\n\\nThe completion() method takes parameters like:\\n\\n- prompt (str): The prompt text to complete \\n- model (str): The AI model to use\\n- max_tokens (int): The maximum number of tokens to generate\\n\\nIt returns a Python dictionary with the AI's response.\\n\\nYou can then access the generated text using:\\n\\n```python\\nprint(response[\\\"choices\\\"][0][\\\"text\\\"]) \\n```\\n\\nSo LiteLLM provides a simple unified interface to call the underlying AI APIs. The completion() method works similarly for OpenAI\"],\n [{\n 'v': 1,\n 'f': \"1\",\n },\n\"does LiteLLM support VertexAI\",\n\"Yes, LiteLLM does support Google Cloud Vertex AI. It provides convenient wrappers and simplified functions to call Vertex AI API endpoints for natural language processing tasks such as text classification, entity extraction, sentiment analysis, etc. You can easily integrate LiteLLM with Vertex AI in your code to leverage its capabilities.\",\n\" Unfortunately, LiteLLM does not currently support VertexAI. LiteLLM focuses on providing a simple interface to call the APIs of services like OpenAI, Azure, Cohere, Anthropic, and Hugging Face. \\n\\nVertexAI is Google's managed machine learning platform. Integrating VertexAI would require additional development work to wrap the VertexAI SDK in a simple interface like LiteLLM provides for other services. \\n\\nHowever, LiteLLM is open source, so it is possible for someone to contribute support for VertexAI. The maintainers would likely welcome a pull request to add VertexAI as an option if implemented well. But out of the box, LiteLLM does not have built-in support for calling VertexAI APIs.\\n\\nThe key services LiteLLM supports are:\\n\\n- OpenAI (GPT, Codex, DALL-E)\\n- Azure Cognitive Services (Text Analytics, Computer Vision, Speech) \\n- Cohere\\n- Anthropic AI\\n- Hugging Face Transformers\\n\\nSo while it doesn't cover every ML API provider, it does make it easy to use the most popular natural language, speech, and vision APIs through a simple interface. Adding VertexAI\"],\n [{\n 'v': 2,\n 'f': \"2\",\n },\n\"how do I set my keys on replicate llama2?\",\n\"To set your keys on Replicate Llama2, follow these steps:\\n\\n1. Open the Llama2 dashboard in your browser.\\n2. Click on the \\\"Settings\\\" tab in the top menu.\\n3. Scroll down to the \\\"API Keys\\\" section.\\n4. Click on the \\\"Add a Key\\\" button.\\n5. Enter a name for your API key to help you identify it later.\\n6. Select the provider for your API key from the dropdown menu. For example, you can select \\\"OpenAI\\\" for OpenAI GPT-3 access.\\n7. Enter your API key in the provided input field. Make sure to copy it correctly.\\n8. Click on the \\\"Save\\\" button to save your API key.\\n\\nNote: The actual steps may vary slightly depending on the platform or interface you are using to access Llama2.\",\n\" Here are the steps to set your API keys on Replicate for litellm:\\n\\n1. Go to your Replicate project settings and select the Environment tab.\\n\\n2. Under Environment Variables, click Add Variable.\\n\\n3. Add variables for the API keys you want to use. The variable names should match the ones used in litellm:\\n\\n- `OPENAI_API_KEY` for OpenAI \\n- `AZURE_API_KEY` for Azure Cognitive Services\\n- `COHERE_API_KEY` for Cohere\\n- `ANTHROPIC_API_KEY` for Anthropic\\n- `HUGGINGFACE_API_KEY` for Hugging Face\\n\\n4. Set the value to your actual API key for each service. Make sure to treat the values as secrets.\\n\\n5. Make sure your litellm code is referencing the environment variable names, for example:\\n\\n```python\\nimport litellm as lm\\n\\nlm.auth(openai_key=os.getenv(\\\"OPENAI_API_KEY\\\")) \\n```\\n\\n6. Restart your Replicate runtime to load the new environment variables.\\n\\nNow litellm will use your\"]],\n columns: [[\"number\", \"index\"], [\"string\", \"Question\"], [\"string\", \"gpt-3.5-turbo\"], [\"string\", \"claude-2\"]],\n columnOptions: [{\"width\": \"1px\", \"className\": \"index_column\"}],\n rowsPerPage: 25,\n helpUrl: \"https://colab.research.google.com/notebooks/data_table.ipynb\",\n suppressOutputScrolling: true,\n minimumWidth: undefined,\n });\n\n function appendQuickchartButton(parentElement) {\n let quickchartButtonContainerElement = document.createElement('div');\n quickchartButtonContainerElement.innerHTML = `\n
-
-"""
-
-# see supported values for "voice" on vertex here:
-# https://console.cloud.google.com/vertex-ai/generative/speech/text-to-speech
-response = client.audio.speech.create(
- model = "vertex-tts",
- input=ssml,
- voice={'languageCode': 'en-US', 'name': 'en-US-Studio-O'},
-)
-print("response from proxy", response)
-```
-
-
-
-
-
-### Forcing SSML Usage
-
-You can force the use of SSML by setting the `use_ssml` parameter to `True`. This is useful when you want to ensure that your input is treated as SSML, even if it doesn't contain the `` tags.
-
-Here are examples of how to force SSML usage:
-
-
-
-
-
-Vertex AI does not support passing a `model` param - so passing `model=vertex_ai/` is the only required param
-
-
-```python
-speech_file_path = Path(__file__).parent / "speech_vertex.mp3"
-
-
-ssml = """
-
-
-
-"""
-
-# see supported values for "voice" on vertex here:
-# https://console.cloud.google.com/vertex-ai/generative/speech/text-to-speech
-response = client.audio.speech.create(
- model = "vertex-tts",
- input=ssml, # pass as None since OpenAI SDK requires this param
- voice={'languageCode': 'en-US', 'name': 'en-US-Studio-O'},
- extra_body={"use_ssml": True},
-)
-print("response from proxy", response)
-```
-
-
-
-
-## Extra
-
-### Using `GOOGLE_APPLICATION_CREDENTIALS`
-Here's the code for storing your service account credentials as `GOOGLE_APPLICATION_CREDENTIALS` environment variable:
-
-
-```python
-import os
-import tempfile
-
-def load_vertex_ai_credentials():
- # Define the path to the vertex_key.json file
- print("loading vertex ai credentials")
- filepath = os.path.dirname(os.path.abspath(__file__))
- vertex_key_path = filepath + "/vertex_key.json"
-
- # Read the existing content of the file or create an empty dictionary
- try:
- with open(vertex_key_path, "r") as file:
- # Read the file content
- print("Read vertexai file path")
- content = file.read()
-
- # If the file is empty or not valid JSON, create an empty dictionary
- if not content or not content.strip():
- service_account_key_data = {}
- else:
- # Attempt to load the existing JSON content
- file.seek(0)
- service_account_key_data = json.load(file)
- except FileNotFoundError:
- # If the file doesn't exist, create an empty dictionary
- service_account_key_data = {}
-
- # Create a temporary file
- with tempfile.NamedTemporaryFile(mode="w+", delete=False) as temp_file:
- # Write the updated content to the temporary file
- json.dump(service_account_key_data, temp_file, indent=2)
-
- # Export the temporary file as GOOGLE_APPLICATION_CREDENTIALS
- os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.abspath(temp_file.name)
-```
-
-
-### Using GCP Service Account
-
-:::info
-
-Trying to deploy LiteLLM on Google Cloud Run? Tutorial [here](https://docs.litellm.ai/docs/proxy/deploy#deploy-on-google-cloud-run)
-
-:::
-
-1. Figure out the Service Account bound to the Google Cloud Run service
-
-
-
-2. Get the FULL EMAIL address of the corresponding Service Account
-
-3. Next, go to IAM & Admin > Manage Resources , select your top-level project that houses your Google Cloud Run Service
-
-Click `Add Principal`
-
-
-
-4. Specify the Service Account as the principal and Vertex AI User as the role
-
-
-
-Once that's done, when you deploy the new container in the Google Cloud Run service, LiteLLM will have automatic access to all Vertex AI endpoints.
-
-
-s/o @[Darien Kindlund](https://www.linkedin.com/in/kindlund/) for this tutorial
-
-
-
-
diff --git a/docs/my-website/docs/providers/vllm.md b/docs/my-website/docs/providers/vllm.md
deleted file mode 100644
index 5388a0bb7..000000000
--- a/docs/my-website/docs/providers/vllm.md
+++ /dev/null
@@ -1,199 +0,0 @@
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# VLLM
-
-LiteLLM supports all models on VLLM.
-
-# Quick Start
-
-## Usage - litellm.completion (calling vLLM endpoint)
-vLLM Provides an OpenAI compatible endpoints - here's how to call it with LiteLLM
-
-In order to use litellm to call a hosted vllm server add the following to your completion call
-
-* `model="hosted_vllm/"`
-* `api_base = "your-hosted-vllm-server"`
-
-```python
-import litellm
-
-response = litellm.completion(
- model="hosted_vllm/facebook/opt-125m", # pass the vllm model name
- messages=messages,
- api_base="https://hosted-vllm-api.co",
- temperature=0.2,
- max_tokens=80)
-
-print(response)
-```
-
-
-## Usage - LiteLLM Proxy Server (calling vLLM endpoint)
-
-Here's how to call an OpenAI-Compatible Endpoint with the LiteLLM Proxy Server
-
-1. Modify the config.yaml
-
- ```yaml
- model_list:
- - model_name: my-model
- litellm_params:
- model: hosted_vllm/facebook/opt-125m # add hosted_vllm/ prefix to route as OpenAI provider
- api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider
- ```
-
-2. Start the proxy
-
- ```bash
- $ litellm --config /path/to/config.yaml
- ```
-
-3. Send Request to LiteLLM Proxy Server
-
-
-
-
-
- ```python
- import openai
- client = openai.OpenAI(
- api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
- base_url="http://0.0.0.0:4000" # litellm-proxy-base url
- )
-
- response = client.chat.completions.create(
- model="my-model",
- messages = [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- )
-
- print(response)
- ```
-
-
-
-
- ```shell
- curl --location 'http://0.0.0.0:4000/chat/completions' \
- --header 'Authorization: Bearer sk-1234' \
- --header 'Content-Type: application/json' \
- --data '{
- "model": "my-model",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- }'
- ```
-
-
-
-
-
-## Extras - for `vllm pip package`
-### Using - `litellm.completion`
-
-```
-pip install litellm vllm
-```
-```python
-import litellm
-
-response = litellm.completion(
- model="vllm/facebook/opt-125m", # add a vllm prefix so litellm knows the custom_llm_provider==vllm
- messages=messages,
- temperature=0.2,
- max_tokens=80)
-
-print(response)
-```
-
-
-### Batch Completion
-
-```python
-from litellm import batch_completion
-
-model_name = "facebook/opt-125m"
-provider = "vllm"
-messages = [[{"role": "user", "content": "Hey, how's it going"}] for _ in range(5)]
-
-response_list = batch_completion(
- model=model_name,
- custom_llm_provider=provider, # can easily switch to huggingface, replicate, together ai, sagemaker, etc.
- messages=messages,
- temperature=0.2,
- max_tokens=80,
- )
-print(response_list)
-```
-### Prompt Templates
-
-For models with special prompt templates (e.g. Llama2), we format the prompt to fit their template.
-
-**What if we don't support a model you need?**
-You can also specify you're own custom prompt formatting, in case we don't have your model covered yet.
-
-**Does this mean you have to specify a prompt for all models?**
-No. By default we'll concatenate your message content to make a prompt (expected format for Bloom, T-5, Llama-2 base models, etc.)
-
-**Default Prompt Template**
-```python
-def default_pt(messages):
- return " ".join(message["content"] for message in messages)
-```
-
-[Code for how prompt templates work in LiteLLM](https://github.com/BerriAI/litellm/blob/main/litellm/llms/prompt_templates/factory.py)
-
-
-#### Models we already have Prompt Templates for
-
-| Model Name | Works for Models | Function Call |
-|--------------------------------------|-----------------------------------|------------------------------------------------------------------------------------------------------------------|
-| meta-llama/Llama-2-7b-chat | All meta-llama llama2 chat models | `completion(model='vllm/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint")` |
-| tiiuae/falcon-7b-instruct | All falcon instruct models | `completion(model='vllm/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint")` |
-| mosaicml/mpt-7b-chat | All mpt chat models | `completion(model='vllm/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint")` |
-| codellama/CodeLlama-34b-Instruct-hf | All codellama instruct models | `completion(model='vllm/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint")` |
-| WizardLM/WizardCoder-Python-34B-V1.0 | All wizardcoder models | `completion(model='vllm/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint")` |
-| Phind/Phind-CodeLlama-34B-v2 | All phind-codellama models | `completion(model='vllm/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint")` |
-
-#### Custom prompt templates
-
-```python
-# Create your own custom prompt template works
-litellm.register_prompt_template(
- model="togethercomputer/LLaMA-2-7B-32K",
- roles={
- "system": {
- "pre_message": "[INST] <>\n",
- "post_message": "\n<>\n [/INST]\n"
- },
- "user": {
- "pre_message": "[INST] ",
- "post_message": " [/INST]\n"
- },
- "assistant": {
- "pre_message": "\n",
- "post_message": "\n",
- }
- } # tell LiteLLM how you want to map the openai messages to this model
-)
-
-def test_vllm_custom_model():
- model = "vllm/togethercomputer/LLaMA-2-7B-32K"
- response = completion(model=model, messages=messages)
- print(response['choices'][0]['message']['content'])
- return response
-
-test_vllm_custom_model()
-```
-
-[Implementation Code](https://github.com/BerriAI/litellm/blob/6b3cb1898382f2e4e80fd372308ea232868c78d1/litellm/utils.py#L1414)
-
diff --git a/docs/my-website/docs/providers/volcano.md b/docs/my-website/docs/providers/volcano.md
deleted file mode 100644
index 1742a43d8..000000000
--- a/docs/my-website/docs/providers/volcano.md
+++ /dev/null
@@ -1,98 +0,0 @@
-# Volcano Engine (Volcengine)
-https://www.volcengine.com/docs/82379/1263482
-
-:::tip
-
-**We support ALL Volcengine NIM models, just set `model=volcengine/` as a prefix when sending litellm requests**
-
-:::
-
-## API Key
-```python
-# env variable
-os.environ['VOLCENGINE_API_KEY']
-```
-
-## Sample Usage
-```python
-from litellm import completion
-import os
-
-os.environ['VOLCENGINE_API_KEY'] = ""
-response = completion(
- model="volcengine/",
- messages=[
- {
- "role": "user",
- "content": "What's the weather like in Boston today in Fahrenheit?",
- }
- ],
- temperature=0.2, # optional
- top_p=0.9, # optional
- frequency_penalty=0.1, # optional
- presence_penalty=0.1, # optional
- max_tokens=10, # optional
- stop=["\n\n"], # optional
-)
-print(response)
-```
-
-## Sample Usage - Streaming
-```python
-from litellm import completion
-import os
-
-os.environ['VOLCENGINE_API_KEY'] = ""
-response = completion(
- model="volcengine/",
- messages=[
- {
- "role": "user",
- "content": "What's the weather like in Boston today in Fahrenheit?",
- }
- ],
- stream=True,
- temperature=0.2, # optional
- top_p=0.9, # optional
- frequency_penalty=0.1, # optional
- presence_penalty=0.1, # optional
- max_tokens=10, # optional
- stop=["\n\n"], # optional
-)
-
-for chunk in response:
- print(chunk)
-```
-
-
-## Supported Models - 💥 ALL Volcengine NIM Models Supported!
-We support ALL `volcengine` models, just set `volcengine/` as a prefix when sending completion requests
-
-## Sample Usage - LiteLLM Proxy
-
-### Config.yaml setting
-
-```yaml
-model_list:
- - model_name: volcengine-model
- litellm_params:
- model: volcengine/
- api_key: os.environ/VOLCENGINE_API_KEY
-```
-
-### Send Request
-
-```shell
-curl --location 'http://localhost:4000/chat/completions' \
- --header 'Authorization: Bearer sk-1234' \
- --header 'Content-Type: application/json' \
- --data '{
- "model": "volcengine-model",
- "messages": [
- {
- "role": "user",
- "content": "here is my api key. openai_api_key=sk-1234"
- }
- ]
-}'
-```
\ No newline at end of file
diff --git a/docs/my-website/docs/providers/voyage.md b/docs/my-website/docs/providers/voyage.md
deleted file mode 100644
index a56a1408e..000000000
--- a/docs/my-website/docs/providers/voyage.md
+++ /dev/null
@@ -1,35 +0,0 @@
-# Voyage AI
-https://docs.voyageai.com/embeddings/
-
-## API Key
-```python
-# env variable
-os.environ['VOYAGE_API_KEY']
-```
-
-## Sample Usage - Embedding
-```python
-from litellm import embedding
-import os
-
-os.environ['VOYAGE_API_KEY'] = ""
-response = embedding(
- model="voyage/voyage-01",
- input=["good morning from litellm"],
-)
-print(response)
-```
-
-## Supported Models
-All models listed here https://docs.voyageai.com/embeddings/#models-and-specifics are supported
-
-| Model Name | Function Call |
-|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| voyage-2 | `embedding(model="voyage/voyage-2", input)` |
-| voyage-large-2 | `embedding(model="voyage/voyage-large-2", input)` |
-| voyage-law-2 | `embedding(model="voyage/voyage-law-2", input)` |
-| voyage-code-2 | `embedding(model="voyage/voyage-code-2", input)` |
-| voyage-lite-02-instruct | `embedding(model="voyage/voyage-lite-02-instruct", input)` |
-| voyage-01 | `embedding(model="voyage/voyage-01", input)` |
-| voyage-lite-01 | `embedding(model="voyage/voyage-lite-01", input)` |
-| voyage-lite-01-instruct | `embedding(model="voyage/voyage-lite-01-instruct", input)` |
\ No newline at end of file
diff --git a/docs/my-website/docs/providers/watsonx.md b/docs/my-website/docs/providers/watsonx.md
deleted file mode 100644
index 7a42a54ed..000000000
--- a/docs/my-website/docs/providers/watsonx.md
+++ /dev/null
@@ -1,284 +0,0 @@
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# IBM watsonx.ai
-
-LiteLLM supports all IBM [watsonx.ai](https://watsonx.ai/) foundational models and embeddings.
-
-## Environment Variables
-```python
-os.environ["WATSONX_URL"] = "" # (required) Base URL of your WatsonX instance
-# (required) either one of the following:
-os.environ["WATSONX_APIKEY"] = "" # IBM cloud API key
-os.environ["WATSONX_TOKEN"] = "" # IAM auth token
-# optional - can also be passed as params to completion() or embedding()
-os.environ["WATSONX_PROJECT_ID"] = "" # Project ID of your WatsonX instance
-os.environ["WATSONX_DEPLOYMENT_SPACE_ID"] = "" # ID of your deployment space to use deployed models
-```
-
-See [here](https://cloud.ibm.com/apidocs/watsonx-ai#api-authentication) for more information on how to get an access token to authenticate to watsonx.ai.
-
-## Usage
-
-
-
-
-
-```python
-import os
-from litellm import completion
-
-os.environ["WATSONX_URL"] = ""
-os.environ["WATSONX_APIKEY"] = ""
-
-response = completion(
- model="watsonx/ibm/granite-13b-chat-v2",
- messages=[{ "content": "what is your favorite colour?","role": "user"}],
- project_id="" # or pass with os.environ["WATSONX_PROJECT_ID"]
-)
-
-response = completion(
- model="watsonx/meta-llama/llama-3-8b-instruct",
- messages=[{ "content": "what is your favorite colour?","role": "user"}],
- project_id=""
-)
-```
-
-## Usage - Streaming
-```python
-import os
-from litellm import completion
-
-os.environ["WATSONX_URL"] = ""
-os.environ["WATSONX_APIKEY"] = ""
-os.environ["WATSONX_PROJECT_ID"] = ""
-
-response = completion(
- model="watsonx/ibm/granite-13b-chat-v2",
- messages=[{ "content": "what is your favorite colour?","role": "user"}],
- stream=True
-)
-for chunk in response:
- print(chunk)
-```
-
-#### Example Streaming Output Chunk
-```json
-{
- "choices": [
- {
- "finish_reason": null,
- "index": 0,
- "delta": {
- "content": "I don't have a favorite color, but I do like the color blue. What's your favorite color?"
- }
- }
- ],
- "created": null,
- "model": "watsonx/ibm/granite-13b-chat-v2",
- "usage": {
- "prompt_tokens": null,
- "completion_tokens": null,
- "total_tokens": null
- }
-}
-```
-
-## Usage - Models in deployment spaces
-
-Models that have been deployed to a deployment space (e.g.: tuned models) can be called using the `deployment/` format (where `` is the ID of the deployed model in your deployment space).
-
-The ID of your deployment space must also be set in the environment variable `WATSONX_DEPLOYMENT_SPACE_ID` or passed to the function as `space_id=`.
-
-```python
-import litellm
-response = litellm.completion(
- model="watsonx/deployment/",
- messages=[{"content": "Hello, how are you?", "role": "user"}],
- space_id=""
-)
-```
-
-## Usage - Embeddings
-
-LiteLLM also supports making requests to IBM watsonx.ai embedding models. The credential needed for this is the same as for completion.
-
-```python
-from litellm import embedding
-
-response = embedding(
- model="watsonx/ibm/slate-30m-english-rtrvr",
- input=["What is the capital of France?"],
- project_id=""
-)
-print(response)
-# EmbeddingResponse(model='ibm/slate-30m-english-rtrvr', data=[{'object': 'embedding', 'index': 0, 'embedding': [-0.037463713, -0.02141933, -0.02851813, 0.015519324, ..., -0.0021367231, -0.01704561, -0.001425816, 0.0035238306]}], object='list', usage=Usage(prompt_tokens=8, total_tokens=8))
-```
-
-## OpenAI Proxy Usage
-
-Here's how to call IBM watsonx.ai with the LiteLLM Proxy Server
-
-### 1. Save keys in your environment
-
-```bash
-export WATSONX_URL=""
-export WATSONX_APIKEY=""
-export WATSONX_PROJECT_ID=""
-```
-
-### 2. Start the proxy
-
-
-
-
-```bash
-$ litellm --model watsonx/meta-llama/llama-3-8b-instruct
-
-# Server running on http://0.0.0.0:4000
-```
-
-
-
-
-```yaml
-model_list:
- - model_name: llama-3-8b
- litellm_params:
- # all params accepted by litellm.completion()
- model: watsonx/meta-llama/llama-3-8b-instruct
- api_key: "os.environ/WATSONX_API_KEY" # does os.getenv("WATSONX_API_KEY")
-```
-
-
-
-### 3. Test it
-
-
-
-
-
-```shell
-curl --location 'http://0.0.0.0:4000/chat/completions' \
---header 'Content-Type: application/json' \
---data ' {
- "model": "llama-3-8b",
- "messages": [
- {
- "role": "user",
- "content": "what is your favorite colour?"
- }
- ]
- }
-'
-```
-
-
-
-```python
-import openai
-client = openai.OpenAI(
- api_key="anything",
- base_url="http://0.0.0.0:4000"
-)
-
-# request sent to model set on litellm proxy, `litellm --model`
-response = client.chat.completions.create(model="llama-3-8b", messages=[
- {
- "role": "user",
- "content": "what is your favorite colour?"
- }
-])
-
-print(response)
-
-```
-
-
-
-```python
-from langchain.chat_models import ChatOpenAI
-from langchain.prompts.chat import (
- ChatPromptTemplate,
- HumanMessagePromptTemplate,
- SystemMessagePromptTemplate,
-)
-from langchain.schema import HumanMessage, SystemMessage
-
-chat = ChatOpenAI(
- openai_api_base="http://0.0.0.0:4000", # set openai_api_base to the LiteLLM Proxy
- model = "llama-3-8b",
- temperature=0.1
-)
-
-messages = [
- SystemMessage(
- content="You are a helpful assistant that im using to make a test request to."
- ),
- HumanMessage(
- content="test from litellm. tell me why it's amazing in 1 sentence"
- ),
-]
-response = chat(messages)
-
-print(response)
-```
-
-
-
-
-## Authentication
-
-### Passing credentials as parameters
-
-You can also pass the credentials as parameters to the completion and embedding functions.
-
-```python
-import os
-from litellm import completion
-
-response = completion(
- model="watsonx/ibm/granite-13b-chat-v2",
- messages=[{ "content": "What is your favorite color?","role": "user"}],
- url="",
- api_key="",
- project_id=""
-)
-```
-
-
-## Supported IBM watsonx.ai Models
-
-Here are some examples of models available in IBM watsonx.ai that you can use with LiteLLM:
-
-| Mode Name | Command |
-|------------------------------------|------------------------------------------------------------------------------------------|
-| Flan T5 XXL | `completion(model=watsonx/google/flan-t5-xxl, messages=messages)` |
-| Flan Ul2 | `completion(model=watsonx/google/flan-ul2, messages=messages)` |
-| Mt0 XXL | `completion(model=watsonx/bigscience/mt0-xxl, messages=messages)` |
-| Gpt Neox | `completion(model=watsonx/eleutherai/gpt-neox-20b, messages=messages)` |
-| Mpt 7B Instruct2 | `completion(model=watsonx/ibm/mpt-7b-instruct2, messages=messages)` |
-| Starcoder | `completion(model=watsonx/bigcode/starcoder, messages=messages)` |
-| Llama 2 70B Chat | `completion(model=watsonx/meta-llama/llama-2-70b-chat, messages=messages)` |
-| Llama 2 13B Chat | `completion(model=watsonx/meta-llama/llama-2-13b-chat, messages=messages)` |
-| Granite 13B Instruct | `completion(model=watsonx/ibm/granite-13b-instruct-v1, messages=messages)` |
-| Granite 13B Chat | `completion(model=watsonx/ibm/granite-13b-chat-v1, messages=messages)` |
-| Flan T5 XL | `completion(model=watsonx/google/flan-t5-xl, messages=messages)` |
-| Granite 13B Chat V2 | `completion(model=watsonx/ibm/granite-13b-chat-v2, messages=messages)` |
-| Granite 13B Instruct V2 | `completion(model=watsonx/ibm/granite-13b-instruct-v2, messages=messages)` |
-| Elyza Japanese Llama 2 7B Instruct | `completion(model=watsonx/elyza/elyza-japanese-llama-2-7b-instruct, messages=messages)` |
-| Mixtral 8X7B Instruct V01 Q | `completion(model=watsonx/ibm-mistralai/mixtral-8x7b-instruct-v01-q, messages=messages)` |
-
-
-For a list of all available models in watsonx.ai, see [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx&locale=en&audience=wdp).
-
-
-## Supported IBM watsonx.ai Embedding Models
-
-| Model Name | Function Call |
-|------------|------------------------------------------------------------------------|
-| Slate 30m | `embedding(model="watsonx/ibm/slate-30m-english-rtrvr", input=input)` |
-| Slate 125m | `embedding(model="watsonx/ibm/slate-125m-english-rtrvr", input=input)` |
-
-
-For a list of all available embedding models in watsonx.ai, see [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-embed.html?context=wx).
\ No newline at end of file
diff --git a/docs/my-website/docs/providers/xai.md b/docs/my-website/docs/providers/xai.md
deleted file mode 100644
index 131c02b3d..000000000
--- a/docs/my-website/docs/providers/xai.md
+++ /dev/null
@@ -1,146 +0,0 @@
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# XAI
-
-https://docs.x.ai/docs
-
-:::tip
-
-**We support ALL XAI models, just set `model=xai/` as a prefix when sending litellm requests**
-
-:::
-
-## API Key
-```python
-# env variable
-os.environ['XAI_API_KEY']
-```
-
-## Sample Usage
-```python
-from litellm import completion
-import os
-
-os.environ['XAI_API_KEY'] = ""
-response = completion(
- model="xai/grok-beta",
- messages=[
- {
- "role": "user",
- "content": "What's the weather like in Boston today in Fahrenheit?",
- }
- ],
- max_tokens=10,
- response_format={ "type": "json_object" },
- seed=123,
- stop=["\n\n"],
- temperature=0.2,
- top_p=0.9,
- tool_choice="auto",
- tools=[],
- user="user",
-)
-print(response)
-```
-
-## Sample Usage - Streaming
-```python
-from litellm import completion
-import os
-
-os.environ['XAI_API_KEY'] = ""
-response = completion(
- model="xai/grok-beta",
- messages=[
- {
- "role": "user",
- "content": "What's the weather like in Boston today in Fahrenheit?",
- }
- ],
- stream=True,
- max_tokens=10,
- response_format={ "type": "json_object" },
- seed=123,
- stop=["\n\n"],
- temperature=0.2,
- top_p=0.9,
- tool_choice="auto",
- tools=[],
- user="user",
-)
-
-for chunk in response:
- print(chunk)
-```
-
-
-## Usage with LiteLLM Proxy Server
-
-Here's how to call a XAI model with the LiteLLM Proxy Server
-
-1. Modify the config.yaml
-
- ```yaml
- model_list:
- - model_name: my-model
- litellm_params:
- model: xai/ # add xai/ prefix to route as XAI provider
- api_key: api-key # api key to send your model
- ```
-
-
-2. Start the proxy
-
- ```bash
- $ litellm --config /path/to/config.yaml
- ```
-
-3. Send Request to LiteLLM Proxy Server
-
-
-
-
-
- ```python
- import openai
- client = openai.OpenAI(
- api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
- base_url="http://0.0.0.0:4000" # litellm-proxy-base url
- )
-
- response = client.chat.completions.create(
- model="my-model",
- messages = [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- )
-
- print(response)
- ```
-
-
-
-
- ```shell
- curl --location 'http://0.0.0.0:4000/chat/completions' \
- --header 'Authorization: Bearer sk-1234' \
- --header 'Content-Type: application/json' \
- --data '{
- "model": "my-model",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- }'
- ```
-
-
-
-
-
diff --git a/docs/my-website/docs/providers/xinference.md b/docs/my-website/docs/providers/xinference.md
deleted file mode 100644
index 3686c0209..000000000
--- a/docs/my-website/docs/providers/xinference.md
+++ /dev/null
@@ -1,62 +0,0 @@
-# Xinference [Xorbits Inference]
-https://inference.readthedocs.io/en/latest/index.html
-
-## API Base, Key
-```python
-# env variable
-os.environ['XINFERENCE_API_BASE'] = "http://127.0.0.1:9997/v1"
-os.environ['XINFERENCE_API_KEY'] = "anything" #[optional] no api key required
-```
-
-## Sample Usage - Embedding
-```python
-from litellm import embedding
-import os
-
-os.environ['XINFERENCE_API_BASE'] = "http://127.0.0.1:9997/v1"
-response = embedding(
- model="xinference/bge-base-en",
- input=["good morning from litellm"],
-)
-print(response)
-```
-
-## Sample Usage `api_base` param
-```python
-from litellm import embedding
-import os
-
-response = embedding(
- model="xinference/bge-base-en",
- api_base="http://127.0.0.1:9997/v1",
- input=["good morning from litellm"],
-)
-print(response)
-```
-
-## Supported Models
-All models listed here https://inference.readthedocs.io/en/latest/models/builtin/embedding/index.html are supported
-
-| Model Name | Function Call |
-|-----------------------------|--------------------------------------------------------------------|
-| bge-base-en | `embedding(model="xinference/bge-base-en", input)` |
-| bge-base-en-v1.5 | `embedding(model="xinference/bge-base-en-v1.5", input)` |
-| bge-base-zh | `embedding(model="xinference/bge-base-zh", input)` |
-| bge-base-zh-v1.5 | `embedding(model="xinference/bge-base-zh-v1.5", input)` |
-| bge-large-en | `embedding(model="xinference/bge-large-en", input)` |
-| bge-large-en-v1.5 | `embedding(model="xinference/bge-large-en-v1.5", input)` |
-| bge-large-zh | `embedding(model="xinference/bge-large-zh", input)` |
-| bge-large-zh-noinstruct | `embedding(model="xinference/bge-large-zh-noinstruct", input)` |
-| bge-large-zh-v1.5 | `embedding(model="xinference/bge-large-zh-v1.5", input)` |
-| bge-small-en-v1.5 | `embedding(model="xinference/bge-small-en-v1.5", input)` |
-| bge-small-zh | `embedding(model="xinference/bge-small-zh", input)` |
-| bge-small-zh-v1.5 | `embedding(model="xinference/bge-small-zh-v1.5", input)` |
-| e5-large-v2 | `embedding(model="xinference/e5-large-v2", input)` |
-| gte-base | `embedding(model="xinference/gte-base", input)` |
-| gte-large | `embedding(model="xinference/gte-large", input)` |
-| jina-embeddings-v2-base-en | `embedding(model="xinference/jina-embeddings-v2-base-en", input)` |
-| jina-embeddings-v2-small-en | `embedding(model="xinference/jina-embeddings-v2-small-en", input)` |
-| multilingual-e5-large | `embedding(model="xinference/multilingual-e5-large", input)` |
-
-
-
diff --git a/docs/my-website/docs/proxy/access_control.md b/docs/my-website/docs/proxy/access_control.md
deleted file mode 100644
index 3d335380f..000000000
--- a/docs/my-website/docs/proxy/access_control.md
+++ /dev/null
@@ -1,145 +0,0 @@
-# Role-based Access Controls (RBAC)
-
-Role-based access control (RBAC) is based on Organizations, Teams and Internal User Roles
-
-- `Organizations` are the top-level entities that contain Teams.
-- `Team` - A Team is a collection of multiple `Internal Users`
-- `Internal Users` - users that can create keys, make LLM API calls, view usage on LiteLLM
-- `Roles` define the permissions of an `Internal User`
-- `Virtual Keys` - Keys are used for authentication to the LiteLLM API. Keys are tied to a `Internal User` and `Team`
-
-## Roles
-
-**Admin Roles**
- - `proxy_admin`: admin over the platform
- - `proxy_admin_viewer`: can login, view all keys, view all spend. **Cannot** create keys/delete keys/add new users
-
-**Organization Roles**
- - `org_admin`: admin over the organization. Can create teams and users within their organization
-
-**Internal User Roles**
- - `internal_user`: can login, view/create/delete their own keys, view their spend. **Cannot** add new users.
- - `internal_user_viewer`: can login, view their own keys, view their own spend. **Cannot** create/delete keys, add new users.
-
-
-## Onboarding Organizations
-
-### 1. Creating a new Organization
-
-Any user with role=`proxy_admin` can create a new organization
-
-**Usage**
-
-[**API Reference for /organization/new**](https://litellm-api.up.railway.app/#/organization%20management/new_organization_organization_new_post)
-
-```shell
-curl --location 'http://0.0.0.0:4000/organization/new' \
- --header 'Authorization: Bearer sk-1234' \
- --header 'Content-Type: application/json' \
- --data '{
- "organization_alias": "marketing_department",
- "models": ["gpt-4"],
- "max_budget": 20
- }'
-```
-
-Expected Response
-
-```json
-{
- "organization_id": "ad15e8ca-12ae-46f4-8659-d02debef1b23",
- "organization_alias": "marketing_department",
- "budget_id": "98754244-3a9c-4b31-b2e9-c63edc8fd7eb",
- "metadata": {},
- "models": [
- "gpt-4"
- ],
- "created_by": "109010464461339474872",
- "updated_by": "109010464461339474872",
- "created_at": "2024-10-08T18:30:24.637000Z",
- "updated_at": "2024-10-08T18:30:24.637000Z"
-}
-```
-
-
-### 2. Adding an `org_admin` to an Organization
-
-Create a user (ishaan@berri.ai) as an `org_admin` for the `marketing_department` Organization (from [step 1](#1-creating-a-new-organization))
-
-Users with the following roles can call `/organization/member_add`
-- `proxy_admin`
-- `org_admin` only within their own organization
-
-```shell
-curl -X POST 'http://0.0.0.0:4000/organization/member_add' \
- -H 'Authorization: Bearer sk-1234' \
- -H 'Content-Type: application/json' \
- -d '{"organization_id": "ad15e8ca-12ae-46f4-8659-d02debef1b23", "member": {"role": "org_admin", "user_id": "ishaan@berri.ai"}}'
-```
-
-Now a user with user_id = `ishaan@berri.ai` and role = `org_admin` has been created in the `marketing_department` Organization
-
-Create a Virtual Key for user_id = `ishaan@berri.ai`. The User can then use the Virtual key for their Organization Admin Operations
-
-```shell
-curl --location 'http://0.0.0.0:4000/key/generate' \
- --header 'Authorization: Bearer sk-1234' \
- --header 'Content-Type: application/json' \
- --data '{
- "user_id": "ishaan@berri.ai"
- }'
-```
-
-Expected Response
-
-```json
-{
- "models": [],
- "user_id": "ishaan@berri.ai",
- "key": "sk-7shH8TGMAofR4zQpAAo6kQ",
- "key_name": "sk-...o6kQ",
-}
-```
-
-### 3. `Organization Admin` - Create a Team
-
-The organization admin will use the virtual key created in [step 2](#2-adding-an-org_admin-to-an-organization) to create a `Team` within the `marketing_department` Organization
-
-```shell
-curl --location 'http://0.0.0.0:4000/team/new' \
- --header 'Authorization: Bearer sk-7shH8TGMAofR4zQpAAo6kQ' \
- --header 'Content-Type: application/json' \
- --data '{
- "team_alias": "engineering_team",
- "organization_id": "ad15e8ca-12ae-46f4-8659-d02debef1b23"
- }'
-```
-
-This will create the team `engineering_team` within the `marketing_department` Organization
-
-Expected Response
-
-```json
-{
- "team_alias": "engineering_team",
- "team_id": "01044ee8-441b-45f4-be7d-c70e002722d8",
- "organization_id": "ad15e8ca-12ae-46f4-8659-d02debef1b23",
-}
-```
-
-
-### `Organization Admin` - Add an `Internal User`
-
-The organization admin will use the virtual key created in [step 2](#2-adding-an-org_admin-to-an-organization) to add an Internal User to the `engineering_team` Team.
-
-- We will assign role=`internal_user` so the user can create Virtual Keys for themselves
-- `team_id` is from [step 3](#3-organization-admin---create-a-team)
-
-```shell
-curl -X POST 'http://0.0.0.0:4000/team/member_add' \
- -H 'Authorization: Bearer sk-1234' \
- -H 'Content-Type: application/json' \
- -d '{"team_id": "01044ee8-441b-45f4-be7d-c70e002722d8", "member": {"role": "internal_user", "user_id": "krrish@berri.ai"}}'
-
-```
-
diff --git a/docs/my-website/docs/proxy/alerting.md b/docs/my-website/docs/proxy/alerting.md
deleted file mode 100644
index a5519157c..000000000
--- a/docs/my-website/docs/proxy/alerting.md
+++ /dev/null
@@ -1,459 +0,0 @@
-import Image from '@theme/IdealImage';
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# Alerting / Webhooks
-
-Get alerts for:
-
-- Hanging LLM api calls
-- Slow LLM api calls
-- Failed LLM api calls
-- Budget Tracking per key/user
-- Spend Reports - Weekly & Monthly spend per Team, Tag
-- Failed db read/writes
-- Model outage alerting
-- Daily Reports:
- - **LLM** Top 5 slowest deployments
- - **LLM** Top 5 deployments with most failed requests
-- **Spend** Weekly & Monthly spend per Team, Tag
-
-
-Works across:
-- [Slack](#quick-start)
-- [Discord](#advanced---using-discord-webhooks)
-- [Microsoft Teams](#advanced---using-ms-teams-webhooks)
-
-## Quick Start
-
-Set up a slack alert channel to receive alerts from proxy.
-
-### Step 1: Add a Slack Webhook URL to env
-
-Get a slack webhook url from https://api.slack.com/messaging/webhooks
-
-You can also use Discord Webhooks, see [here](#using-discord-webhooks)
-
-
-Set `SLACK_WEBHOOK_URL` in your proxy env to enable Slack alerts.
-
-```bash
-export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/<>/<>/<>"
-```
-
-### Step 2: Setup Proxy
-
-```yaml
-general_settings:
- alerting: ["slack"]
- alerting_threshold: 300 # sends alerts if requests hang for 5min+ and responses take 5min+
- spend_report_frequency: "1d" # [Optional] set as 1d, 2d, 30d .... Specifiy how often you want a Spend Report to be sent
-```
-
-Start proxy
-```bash
-$ litellm --config /path/to/config.yaml
-```
-
-
-### Step 3: Test it!
-
-
-```bash
-curl -X GET 'http://0.0.0.0:4000/health/services?service=slack' \
--H 'Authorization: Bearer sk-1234'
-```
-
-## Advanced
-
-### Redacting Messages from Alerts
-
-By default alerts show the `messages/input` passed to the LLM. If you want to redact this from slack alerting set the following setting on your config
-
-
-```shell
-general_settings:
- alerting: ["slack"]
- alert_types: ["spend_reports"]
-
-litellm_settings:
- redact_messages_in_exceptions: True
-```
-
-
-### Add Metadata to alerts
-
-Add alerting metadata to proxy calls for debugging.
-
-```python
-import openai
-client = openai.OpenAI(
- api_key="anything",
- base_url="http://0.0.0.0:4000"
-)
-
-# request sent to model set on litellm proxy, `litellm --model`
-response = client.chat.completions.create(
- model="gpt-3.5-turbo",
- messages = [],
- extra_body={
- "metadata": {
- "alerting_metadata": {
- "hello": "world"
- }
- }
- }
-)
-```
-
-**Expected Response**
-
-
-
-### Opting into specific alert types
-
-Set `alert_types` if you want to Opt into only specific alert types. When alert_types is not set, all Default Alert Types are enabled.
-
-👉 [**See all alert types here**](#all-possible-alert-types)
-
-```shell
-general_settings:
- alerting: ["slack"]
- alert_types: [
- "llm_exceptions",
- "llm_too_slow",
- "llm_requests_hanging",
- "budget_alerts",
- "spend_reports",
- "db_exceptions",
- "daily_reports",
- "cooldown_deployment",
- "new_model_added",
- ]
-```
-
-### Set specific slack channels per alert type
-
-Use this if you want to set specific channels per alert type
-
-**This allows you to do the following**
-```
-llm_exceptions -> go to slack channel #llm-exceptions
-spend_reports -> go to slack channel #llm-spend-reports
-```
-
-Set `alert_to_webhook_url` on your config.yaml
-
-
-
-
-
-```yaml
-model_list:
- - model_name: gpt-4
- litellm_params:
- model: openai/fake
- api_key: fake-key
- api_base: https://exampleopenaiendpoint-production.up.railway.app/
-
-general_settings:
- master_key: sk-1234
- alerting: ["slack"]
- alerting_threshold: 0.0001 # (Seconds) set an artifically low threshold for testing alerting
- alert_to_webhook_url: {
- "llm_exceptions": "https://hooks.slack.com/services/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",
- "llm_too_slow": "https://hooks.slack.com/services/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",
- "llm_requests_hanging": "https://hooks.slack.com/services/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",
- "budget_alerts": "https://hooks.slack.com/services/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",
- "db_exceptions": "https://hooks.slack.com/services/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",
- "daily_reports": "https://hooks.slack.com/services/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",
- "spend_reports": "https://hooks.slack.com/services/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",
- "cooldown_deployment": "https://hooks.slack.com/services/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",
- "new_model_added": "https://hooks.slack.com/services/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",
- "outage_alerts": "https://hooks.slack.com/services/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",
- }
-
-litellm_settings:
- success_callback: ["langfuse"]
-```
-
-
-
-
-Provide multiple slack channels for a given alert type
-
-```yaml
-model_list:
- - model_name: gpt-4
- litellm_params:
- model: openai/fake
- api_key: fake-key
- api_base: https://exampleopenaiendpoint-production.up.railway.app/
-
-general_settings:
- master_key: sk-1234
- alerting: ["slack"]
- alerting_threshold: 0.0001 # (Seconds) set an artifically low threshold for testing alerting
- alert_to_webhook_url: {
- "llm_exceptions": ["os.environ/SLACK_WEBHOOK_URL", "os.environ/SLACK_WEBHOOK_URL_2"],
- "llm_too_slow": ["https://webhook.site/7843a980-a494-4967-80fb-d502dbc16886", "https://webhook.site/28cfb179-f4fb-4408-8129-729ff55cf213"],
- "llm_requests_hanging": ["os.environ/SLACK_WEBHOOK_URL_5", "os.environ/SLACK_WEBHOOK_URL_6"],
- "budget_alerts": ["os.environ/SLACK_WEBHOOK_URL_7", "os.environ/SLACK_WEBHOOK_URL_8"],
- "db_exceptions": ["os.environ/SLACK_WEBHOOK_URL_9", "os.environ/SLACK_WEBHOOK_URL_10"],
- "daily_reports": ["os.environ/SLACK_WEBHOOK_URL_11", "os.environ/SLACK_WEBHOOK_URL_12"],
- "spend_reports": ["os.environ/SLACK_WEBHOOK_URL_13", "os.environ/SLACK_WEBHOOK_URL_14"],
- "cooldown_deployment": ["os.environ/SLACK_WEBHOOK_URL_15", "os.environ/SLACK_WEBHOOK_URL_16"],
- "new_model_added": ["os.environ/SLACK_WEBHOOK_URL_17", "os.environ/SLACK_WEBHOOK_URL_18"],
- "outage_alerts": ["os.environ/SLACK_WEBHOOK_URL_19", "os.environ/SLACK_WEBHOOK_URL_20"],
- }
-
-litellm_settings:
- success_callback: ["langfuse"]
-```
-
-
-
-
-
-Test it - send a valid llm request - expect to see a `llm_too_slow` alert in it's own slack channel
-
-```shell
-curl -i http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "gpt-4",
- "messages": [
- {"role": "user", "content": "Hello, Claude gm!"}
- ]
-}'
-```
-
-
-### Using MS Teams Webhooks
-
-MS Teams provides a slack compatible webhook url that you can use for alerting
-
-##### Quick Start
-
-1. [Get a webhook url](https://learn.microsoft.com/en-us/microsoftteams/platform/webhooks-and-connectors/how-to/add-incoming-webhook?tabs=newteams%2Cdotnet#create-an-incoming-webhook) for your Microsoft Teams channel
-
-2. Add it to your .env
-
-```bash
-SLACK_WEBHOOK_URL="https://berriai.webhook.office.com/webhookb2/...6901/IncomingWebhook/b55fa0c2a48647be8e6effedcd540266/e04b1092-4a3e-44a2-ab6b-29a0a4854d1d"
-```
-
-3. Add it to your litellm config
-
-```yaml
-model_list:
- model_name: "azure-model"
- litellm_params:
- model: "azure/gpt-35-turbo"
- api_key: "my-bad-key" # 👈 bad key
-
-general_settings:
- alerting: ["slack"]
- alerting_threshold: 300 # sends alerts if requests hang for 5min+ and responses take 5min+
-```
-
-4. Run health check!
-
-Call the proxy `/health/services` endpoint to test if your alerting connection is correctly setup.
-
-```bash
-curl --location 'http://0.0.0.0:4000/health/services?service=slack' \
---header 'Authorization: Bearer sk-1234'
-```
-
-
-**Expected Response**
-
-
-
-### Using Discord Webhooks
-
-Discord provides a slack compatible webhook url that you can use for alerting
-
-##### Quick Start
-
-1. Get a webhook url for your discord channel
-
-2. Append `/slack` to your discord webhook - it should look like
-
-```
-"https://discord.com/api/webhooks/1240030362193760286/cTLWt5ATn1gKmcy_982rl5xmYHsrM1IWJdmCL1AyOmU9JdQXazrp8L1_PYgUtgxj8x4f/slack"
-```
-
-3. Add it to your litellm config
-
-```yaml
-model_list:
- model_name: "azure-model"
- litellm_params:
- model: "azure/gpt-35-turbo"
- api_key: "my-bad-key" # 👈 bad key
-
-general_settings:
- alerting: ["slack"]
- alerting_threshold: 300 # sends alerts if requests hang for 5min+ and responses take 5min+
-
-environment_variables:
- SLACK_WEBHOOK_URL: "https://discord.com/api/webhooks/1240030362193760286/cTLWt5ATn1gKmcy_982rl5xmYHsrM1IWJdmCL1AyOmU9JdQXazrp8L1_PYgUtgxj8x4f/slack"
-```
-
-
-## [BETA] Webhooks for Budget Alerts
-
-**Note**: This is a beta feature, so the spec might change.
-
-Set a webhook to get notified for budget alerts.
-
-1. Setup config.yaml
-
-Add url to your environment, for testing you can use a link from [here](https://webhook.site/)
-
-```bash
-export WEBHOOK_URL="https://webhook.site/6ab090e8-c55f-4a23-b075-3209f5c57906"
-```
-
-Add 'webhook' to config.yaml
-```yaml
-general_settings:
- alerting: ["webhook"] # 👈 KEY CHANGE
-```
-
-2. Start proxy
-
-```bash
-litellm --config /path/to/config.yaml
-
-# RUNNING on http://0.0.0.0:4000
-```
-
-3. Test it!
-
-```bash
-curl -X GET --location 'http://0.0.0.0:4000/health/services?service=webhook' \
---header 'Authorization: Bearer sk-1234'
-```
-
-**Expected Response**
-
-```bash
-{
- "spend": 1, # the spend for the 'event_group'
- "max_budget": 0, # the 'max_budget' set for the 'event_group'
- "token": "88dc28d0f030c55ed4ab77ed8faf098196cb1c05df778539800c9f1243fe6b4b",
- "user_id": "default_user_id",
- "team_id": null,
- "user_email": null,
- "key_alias": null,
- "projected_exceeded_data": null,
- "projected_spend": null,
- "event": "budget_crossed", # Literal["budget_crossed", "threshold_crossed", "projected_limit_exceeded"]
- "event_group": "user",
- "event_message": "User Budget: Budget Crossed"
-}
-```
-
-### API Spec for Webhook Event
-
-- `spend` *float*: The current spend amount for the 'event_group'.
-- `max_budget` *float or null*: The maximum allowed budget for the 'event_group'. null if not set.
-- `token` *str*: A hashed value of the key, used for authentication or identification purposes.
-- `customer_id` *str or null*: The ID of the customer associated with the event (optional).
-- `internal_user_id` *str or null*: The ID of the internal user associated with the event (optional).
-- `team_id` *str or null*: The ID of the team associated with the event (optional).
-- `user_email` *str or null*: The email of the internal user associated with the event (optional).
-- `key_alias` *str or null*: An alias for the key associated with the event (optional).
-- `projected_exceeded_date` *str or null*: The date when the budget is projected to be exceeded, returned when 'soft_budget' is set for key (optional).
-- `projected_spend` *float or null*: The projected spend amount, returned when 'soft_budget' is set for key (optional).
-- `event` *Literal["budget_crossed", "threshold_crossed", "projected_limit_exceeded"]*: The type of event that triggered the webhook. Possible values are:
- * "spend_tracked": Emitted whenver spend is tracked for a customer id.
- * "budget_crossed": Indicates that the spend has exceeded the max budget.
- * "threshold_crossed": Indicates that spend has crossed a threshold (currently sent when 85% and 95% of budget is reached).
- * "projected_limit_exceeded": For "key" only - Indicates that the projected spend is expected to exceed the soft budget threshold.
-- `event_group` *Literal["customer", "internal_user", "key", "team", "proxy"]*: The group associated with the event. Possible values are:
- * "customer": The event is related to a specific customer
- * "internal_user": The event is related to a specific internal user.
- * "key": The event is related to a specific key.
- * "team": The event is related to a team.
- * "proxy": The event is related to a proxy.
-
-- `event_message` *str*: A human-readable description of the event.
-
-## Region-outage alerting (✨ Enterprise feature)
-
-:::info
-[Get a free 2-week license](https://forms.gle/P518LXsAZ7PhXpDn8)
-:::
-
-Setup alerts if a provider region is having an outage.
-
-```yaml
-general_settings:
- alerting: ["slack"]
- alert_types: ["region_outage_alerts"]
-```
-
-By default this will trigger if multiple models in a region fail 5+ requests in 1 minute. '400' status code errors are not counted (i.e. BadRequestErrors).
-
-Control thresholds with:
-
-```yaml
-general_settings:
- alerting: ["slack"]
- alert_types: ["region_outage_alerts"]
- alerting_args:
- region_outage_alert_ttl: 60 # time-window in seconds
- minor_outage_alert_threshold: 5 # number of errors to trigger a minor alert
- major_outage_alert_threshold: 10 # number of errors to trigger a major alert
-```
-
-## **All Possible Alert Types**
-
-👉 [**Here is how you can set specific alert types**](#opting-into-specific-alert-types)
-
-LLM-related Alerts
-
-| Alert Type | Description | Default On |
-|------------|-------------|---------|
-| `llm_exceptions` | Alerts for LLM API exceptions | ✅ |
-| `llm_too_slow` | Notifications for LLM responses slower than the set threshold | ✅ |
-| `llm_requests_hanging` | Alerts for LLM requests that are not completing | ✅ |
-| `cooldown_deployment` | Alerts when a deployment is put into cooldown | ✅ |
-| `new_model_added` | Notifications when a new model is added to litellm proxy through /model/new| ✅ |
-| `outage_alerts` | Alerts when a specific LLM deployment is facing an outage | ✅ |
-| `region_outage_alerts` | Alerts when a specfic LLM region is facing an outage. Example us-east-1 | ✅ |
-
-Budget and Spend Alerts
-
-| Alert Type | Description | Default On|
-|------------|-------------|---------|
-| `budget_alerts` | Notifications related to budget limits or thresholds | ✅ |
-| `spend_reports` | Periodic reports on spending across teams or tags | ✅ |
-| `failed_tracking_spend` | Alerts when spend tracking fails | ✅ |
-| `daily_reports` | Daily Spend reports | ✅ |
-| `fallback_reports` | Weekly Reports on LLM fallback occurrences | ✅ |
-
-Database Alerts
-
-| Alert Type | Description | Default On |
-|------------|-------------|---------|
-| `db_exceptions` | Notifications for database-related exceptions | ✅ |
-
-Management Endpoint Alerts - Virtual Key, Team, Internal User
-
-| Alert Type | Description | Default On |
-|------------|-------------|---------|
-| `new_virtual_key_created` | Notifications when a new virtual key is created | ❌ |
-| `virtual_key_updated` | Alerts when a virtual key is modified | ❌ |
-| `virtual_key_deleted` | Notifications when a virtual key is removed | ❌ |
-| `new_team_created` | Alerts for the creation of a new team | ❌ |
-| `team_updated` | Notifications when team details are modified | ❌ |
-| `team_deleted` | Alerts when a team is deleted | ❌ |
-| `new_internal_user_created` | Notifications for new internal user accounts | ❌ |
-| `internal_user_updated` | Alerts when an internal user's details are changed | ❌ |
-| `internal_user_deleted` | Notifications when an internal user account is removed | ❌ |
\ No newline at end of file
diff --git a/docs/my-website/docs/proxy/architecture.md b/docs/my-website/docs/proxy/architecture.md
deleted file mode 100644
index eb4f1ec8d..000000000
--- a/docs/my-website/docs/proxy/architecture.md
+++ /dev/null
@@ -1,39 +0,0 @@
-import Image from '@theme/IdealImage';
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# Life of a Request
-
-## High Level architecture
-
-
-
-
-### Request Flow
-
-1. **User Sends Request**: The process begins when a user sends a request to the LiteLLM Proxy Server (Gateway).
-
-2. [**Virtual Keys**](../virtual_keys): At this stage the `Bearer` token in the request is checked to ensure it is valid and under it's budget. [Here is the list of checks that run for each request](https://github.com/BerriAI/litellm/blob/ba41a72f92a9abf1d659a87ec880e8e319f87481/litellm/proxy/auth/auth_checks.py#L43)
- - 2.1 Check if the Virtual Key exists in Redis Cache or In Memory Cache
- - 2.2 **If not in Cache**, Lookup Virtual Key in DB
-
-3. **Rate Limiting**: The [MaxParallelRequestsHandler](https://github.com/BerriAI/litellm/blob/main/litellm/proxy/hooks/parallel_request_limiter.py) checks the **rate limit (rpm/tpm)** for the the following components:
- - Global Server Rate Limit
- - Virtual Key Rate Limit
- - User Rate Limit
- - Team Limit
-
-4. **LiteLLM `proxy_server.py`**: Contains the `/chat/completions` and `/embeddings` endpoints. Requests to these endpoints are sent through the LiteLLM Router
-
-5. [**LiteLLM Router**](../routing): The LiteLLM Router handles Load balancing, Fallbacks, Retries for LLM API deployments.
-
-6. [**litellm.completion() / litellm.embedding()**:](../index#litellm-python-sdk) The litellm Python SDK is used to call the LLM in the OpenAI API format (Translation and parameter mapping)
-
-7. **Post-Request Processing**: After the response is sent back to the client, the following **asynchronous** tasks are performed:
- - [Logging to LangFuse (logging destination is configurable)](./logging)
- - The [MaxParallelRequestsHandler](https://github.com/BerriAI/litellm/blob/main/litellm/proxy/hooks/parallel_request_limiter.py) updates the rpm/tpm usage for the
- - Global Server Rate Limit
- - Virtual Key Rate Limit
- - User Rate Limit
- - Team Limit
- - The `_PROXY_track_cost_callback` updates spend / usage in the LiteLLM database. [Here is everything tracked in the DB per request](https://github.com/BerriAI/litellm/blob/ba41a72f92a9abf1d659a87ec880e8e319f87481/schema.prisma#L172)
diff --git a/docs/my-website/docs/proxy/billing.md b/docs/my-website/docs/proxy/billing.md
deleted file mode 100644
index 902801cd0..000000000
--- a/docs/my-website/docs/proxy/billing.md
+++ /dev/null
@@ -1,319 +0,0 @@
-import Image from '@theme/IdealImage';
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# Billing
-
-Bill internal teams, external customers for their usage
-
-**🚨 Requirements**
-- [Setup Lago](https://docs.getlago.com/guide/self-hosted/docker#run-the-app), for usage-based billing. We recommend following [their Stripe tutorial](https://docs.getlago.com/templates/per-transaction/stripe#step-1-create-billable-metrics-for-transaction)
-
-Steps:
-- Connect the proxy to Lago
-- Set the id you want to bill for (customers, internal users, teams)
-- Start!
-
-## Quick Start
-
-Bill internal teams for their usage
-
-### 1. Connect proxy to Lago
-
-Set 'lago' as a callback on your proxy config.yaml
-
-```yaml
-model_list:
- - model_name: fake-openai-endpoint
- litellm_params:
- model: openai/fake
- api_key: fake-key
- api_base: https://exampleopenaiendpoint-production.up.railway.app/
-
-litellm_settings:
- callbacks: ["lago"] # 👈 KEY CHANGE
-
-general_settings:
- master_key: sk-1234
-```
-
-Add your Lago keys to the environment
-
-```bash
-export LAGO_API_BASE="http://localhost:3000" # self-host - https://docs.getlago.com/guide/self-hosted/docker#run-the-app
-export LAGO_API_KEY="3e29d607-de54-49aa-a019-ecf585729070" # Get key - https://docs.getlago.com/guide/self-hosted/docker#find-your-api-key
-export LAGO_API_EVENT_CODE="openai_tokens" # name of lago billing code
-export LAGO_API_CHARGE_BY="team_id" # 👈 Charges 'team_id' attached to proxy key
-```
-
-Start proxy
-
-```bash
-litellm --config /path/to/config.yaml
-```
-
-### 2. Create Key for Internal Team
-
-```bash
-curl 'http://0.0.0.0:4000/key/generate' \
---header 'Authorization: Bearer sk-1234' \
---header 'Content-Type: application/json' \
---data-raw '{"team_id": "my-unique-id"}' # 👈 Internal Team's ID
-```
-
-Response Object:
-
-```bash
-{
- "key": "sk-tXL0wt5-lOOVK9sfY2UacA",
-}
-```
-
-
-### 3. Start billing!
-
-
-
-
-```bash
-curl --location 'http://0.0.0.0:4000/chat/completions' \
---header 'Content-Type: application/json' \
---header 'Authorization: Bearer sk-tXL0wt5-lOOVK9sfY2UacA' \ # 👈 Team's Key
---data ' {
- "model": "fake-openai-endpoint",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- }
-'
-```
-
-
-
-```python
-import openai
-client = openai.OpenAI(
- api_key="sk-tXL0wt5-lOOVK9sfY2UacA", # 👈 Team's Key
- base_url="http://0.0.0.0:4000"
-)
-
-# request sent to model set on litellm proxy, `litellm --model`
-response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
- {
- "role": "user",
- "content": "this is a test request, write a short poem"
- }
-])
-
-print(response)
-```
-
-
-
-```python
-from langchain.chat_models import ChatOpenAI
-from langchain.prompts.chat import (
- ChatPromptTemplate,
- HumanMessagePromptTemplate,
- SystemMessagePromptTemplate,
-)
-from langchain.schema import HumanMessage, SystemMessage
-import os
-
-os.environ["OPENAI_API_KEY"] = "sk-tXL0wt5-lOOVK9sfY2UacA" # 👈 Team's Key
-
-chat = ChatOpenAI(
- openai_api_base="http://0.0.0.0:4000",
- model = "gpt-3.5-turbo",
- temperature=0.1,
-)
-
-messages = [
- SystemMessage(
- content="You are a helpful assistant that im using to make a test request to."
- ),
- HumanMessage(
- content="test from litellm. tell me why it's amazing in 1 sentence"
- ),
-]
-response = chat(messages)
-
-print(response)
-```
-
-
-
-**See Results on Lago**
-
-
-
-
-## Advanced - Lago Logging object
-
-This is what LiteLLM will log to Lagos
-
-```
-{
- "event": {
- "transaction_id": "",
- "external_customer_id": , # either 'end_user_id', 'user_id', or 'team_id'. Default 'end_user_id'.
- "code": os.getenv("LAGO_API_EVENT_CODE"),
- "properties": {
- "input_tokens": ,
- "output_tokens": ,
- "model": ,
- "response_cost": , # 👈 LITELLM CALCULATED RESPONSE COST - https://github.com/BerriAI/litellm/blob/d43f75150a65f91f60dc2c0c9462ce3ffc713c1f/litellm/utils.py#L1473
- }
- }
-}
-```
-
-## Advanced - Bill Customers, Internal Users
-
-For:
-- Customers (id passed via 'user' param in /chat/completion call) = 'end_user_id'
-- Internal Users (id set when [creating keys](https://docs.litellm.ai/docs/proxy/virtual_keys#advanced---spend-tracking)) = 'user_id'
-- Teams (id set when [creating keys](https://docs.litellm.ai/docs/proxy/virtual_keys#advanced---spend-tracking)) = 'team_id'
-
-
-
-
-
-
-1. Set 'LAGO_API_CHARGE_BY' to 'end_user_id'
-
- ```bash
- export LAGO_API_CHARGE_BY="end_user_id"
- ```
-
-2. Test it!
-
-
-
-
- ```shell
- curl --location 'http://0.0.0.0:4000/chat/completions' \
- --header 'Content-Type: application/json' \
- --data ' {
- "model": "gpt-3.5-turbo",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- "user": "my_customer_id" # 👈 whatever your customer id is
- }
- '
- ```
-
-
-
- ```python
- import openai
- client = openai.OpenAI(
- api_key="anything",
- base_url="http://0.0.0.0:4000"
- )
-
- # request sent to model set on litellm proxy, `litellm --model`
- response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
- {
- "role": "user",
- "content": "this is a test request, write a short poem"
- }
- ], user="my_customer_id") # 👈 whatever your customer id is
-
- print(response)
- ```
-
-
-
-
- ```python
- from langchain.chat_models import ChatOpenAI
- from langchain.prompts.chat import (
- ChatPromptTemplate,
- HumanMessagePromptTemplate,
- SystemMessagePromptTemplate,
- )
- from langchain.schema import HumanMessage, SystemMessage
- import os
-
- os.environ["OPENAI_API_KEY"] = "anything"
-
- chat = ChatOpenAI(
- openai_api_base="http://0.0.0.0:4000",
- model = "gpt-3.5-turbo",
- temperature=0.1,
- extra_body={
- "user": "my_customer_id" # 👈 whatever your customer id is
- }
- )
-
- messages = [
- SystemMessage(
- content="You are a helpful assistant that im using to make a test request to."
- ),
- HumanMessage(
- content="test from litellm. tell me why it's amazing in 1 sentence"
- ),
- ]
- response = chat(messages)
-
- print(response)
- ```
-
-
-
-
-
-
-
-1. Set 'LAGO_API_CHARGE_BY' to 'user_id'
-
-```bash
-export LAGO_API_CHARGE_BY="user_id"
-```
-
-2. Create a key for that user
-
-```bash
-curl 'http://0.0.0.0:4000/key/generate' \
---header 'Authorization: Bearer ' \
---header 'Content-Type: application/json' \
---data-raw '{"user_id": "my-unique-id"}' # 👈 Internal User's id
-```
-
-Response Object:
-
-```bash
-{
- "key": "sk-tXL0wt5-lOOVK9sfY2UacA",
-}
-```
-
-3. Make API Calls with that Key
-
-```python
-import openai
-client = openai.OpenAI(
- api_key="sk-tXL0wt5-lOOVK9sfY2UacA", # 👈 Generated key
- base_url="http://0.0.0.0:4000"
-)
-
-# request sent to model set on litellm proxy, `litellm --model`
-response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
- {
- "role": "user",
- "content": "this is a test request, write a short poem"
- }
-])
-
-print(response)
-```
-
-
diff --git a/docs/my-website/docs/proxy/bucket.md b/docs/my-website/docs/proxy/bucket.md
deleted file mode 100644
index d1b9e6076..000000000
--- a/docs/my-website/docs/proxy/bucket.md
+++ /dev/null
@@ -1,154 +0,0 @@
-
-import Image from '@theme/IdealImage';
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# Logging GCS, s3 Buckets
-
-LiteLLM Supports Logging to the following Cloud Buckets
-- (Enterprise) ✨ [Google Cloud Storage Buckets](#logging-proxy-inputoutput-to-google-cloud-storage-buckets)
-- (Free OSS) [Amazon s3 Buckets](#logging-proxy-inputoutput---s3-buckets)
-
-## Google Cloud Storage Buckets
-
-Log LLM Logs to [Google Cloud Storage Buckets](https://cloud.google.com/storage?hl=en)
-
-:::info
-
-✨ This is an Enterprise only feature [Get Started with Enterprise here](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
-
-:::
-
-
-| Property | Details |
-|----------|---------|
-| Description | Log LLM Input/Output to cloud storage buckets |
-| Load Test Benchmarks | [Benchmarks](https://docs.litellm.ai/docs/benchmarks) |
-| Google Docs on Cloud Storage | [Google Cloud Storage](https://cloud.google.com/storage?hl=en) |
-
-
-
-### Usage
-
-1. Add `gcs_bucket` to LiteLLM Config.yaml
-```yaml
-model_list:
-- litellm_params:
- api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
- api_key: my-fake-key
- model: openai/my-fake-model
- model_name: fake-openai-endpoint
-
-litellm_settings:
- callbacks: ["gcs_bucket"] # 👈 KEY CHANGE # 👈 KEY CHANGE
-```
-
-2. Set required env variables
-
-```shell
-GCS_BUCKET_NAME=""
-GCS_PATH_SERVICE_ACCOUNT="/Users/ishaanjaffer/Downloads/adroit-crow-413218-a956eef1a2a8.json" # Add path to service account.json
-```
-
-3. Start Proxy
-
-```
-litellm --config /path/to/config.yaml
-```
-
-4. Test it!
-
-```bash
-curl --location 'http://0.0.0.0:4000/chat/completions' \
---header 'Content-Type: application/json' \
---data ' {
- "model": "fake-openai-endpoint",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- }
-'
-```
-
-
-### Expected Logs on GCS Buckets
-
-
-
-### Fields Logged on GCS Buckets
-
-[**The standard logging object is logged on GCS Bucket**](../proxy/logging)
-
-
-### Getting `service_account.json` from Google Cloud Console
-
-1. Go to [Google Cloud Console](https://console.cloud.google.com/)
-2. Search for IAM & Admin
-3. Click on Service Accounts
-4. Select a Service Account
-5. Click on 'Keys' -> Add Key -> Create New Key -> JSON
-6. Save the JSON file and add the path to `GCS_PATH_SERVICE_ACCOUNT`
-
-
-## s3 Buckets
-
-We will use the `--config` to set
-
-- `litellm.success_callback = ["s3"]`
-
-This will log all successfull LLM calls to s3 Bucket
-
-**Step 1** Set AWS Credentials in .env
-
-```shell
-AWS_ACCESS_KEY_ID = ""
-AWS_SECRET_ACCESS_KEY = ""
-AWS_REGION_NAME = ""
-```
-
-**Step 2**: Create a `config.yaml` file and set `litellm_settings`: `success_callback`
-
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
-litellm_settings:
- success_callback: ["s3"]
- s3_callback_params:
- s3_bucket_name: logs-bucket-litellm # AWS Bucket Name for S3
- s3_region_name: us-west-2 # AWS Region Name for S3
- s3_aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID # us os.environ/ to pass environment variables. This is AWS Access Key ID for S3
- s3_aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY # AWS Secret Access Key for S3
- s3_path: my-test-path # [OPTIONAL] set path in bucket you want to write logs to
- s3_endpoint_url: https://s3.amazonaws.com # [OPTIONAL] S3 endpoint URL, if you want to use Backblaze/cloudflare s3 buckets
-```
-
-**Step 3**: Start the proxy, make a test request
-
-Start proxy
-
-```shell
-litellm --config config.yaml --debug
-```
-
-Test Request
-
-```shell
-curl --location 'http://0.0.0.0:4000/chat/completions' \
- --header 'Content-Type: application/json' \
- --data ' {
- "model": "Azure OpenAI GPT-4 East",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ]
- }'
-```
-
-Your logs should be available on the specified s3 Bucket
diff --git a/docs/my-website/docs/proxy/caching.md b/docs/my-website/docs/proxy/caching.md
deleted file mode 100644
index 3f5342c7e..000000000
--- a/docs/my-website/docs/proxy/caching.md
+++ /dev/null
@@ -1,945 +0,0 @@
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# Caching
-Cache LLM Responses
-
-:::note
-
-For OpenAI/Anthropic Prompt Caching, go [here](../completion/prompt_caching.md)
-
-:::
-
-LiteLLM supports:
-- In Memory Cache
-- Redis Cache
-- Qdrant Semantic Cache
-- Redis Semantic Cache
-- s3 Bucket Cache
-
-## Quick Start - Redis, s3 Cache, Semantic Cache
-
-
-
-
-Caching can be enabled by adding the `cache` key in the `config.yaml`
-
-#### Step 1: Add `cache` to the config.yaml
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
- - model_name: text-embedding-ada-002
- litellm_params:
- model: text-embedding-ada-002
-
-litellm_settings:
- set_verbose: True
- cache: True # set cache responses to True, litellm defaults to using a redis cache
-```
-
-#### [OPTIONAL] Step 1.5: Add redis namespaces, default ttl
-
-#### Namespace
-If you want to create some folder for your keys, you can set a namespace, like this:
-
-```yaml
-litellm_settings:
- cache: true
- cache_params: # set cache params for redis
- type: redis
- namespace: "litellm.caching.caching"
-```
-
-and keys will be stored like:
-
-```
-litellm.caching.caching:
-```
-
-#### Redis Cluster
-
-
-
-
-
-```yaml
-model_list:
- - model_name: "*"
- litellm_params:
- model: "*"
-
-
-litellm_settings:
- cache: True
- cache_params:
- type: redis
- redis_startup_nodes: [{"host": "127.0.0.1", "port": "7001"}]
-```
-
-
-
-
-
-You can configure redis cluster in your .env by setting `REDIS_CLUSTER_NODES` in your .env
-
-**Example `REDIS_CLUSTER_NODES`** value
-
-```
-REDIS_CLUSTER_NODES = "[{"host": "127.0.0.1", "port": "7001"}, {"host": "127.0.0.1", "port": "7003"}, {"host": "127.0.0.1", "port": "7004"}, {"host": "127.0.0.1", "port": "7005"}, {"host": "127.0.0.1", "port": "7006"}, {"host": "127.0.0.1", "port": "7007"}]"
-```
-
-:::note
-
-Example python script for setting redis cluster nodes in .env:
-
-```python
-# List of startup nodes
-startup_nodes = [
- {"host": "127.0.0.1", "port": "7001"},
- {"host": "127.0.0.1", "port": "7003"},
- {"host": "127.0.0.1", "port": "7004"},
- {"host": "127.0.0.1", "port": "7005"},
- {"host": "127.0.0.1", "port": "7006"},
- {"host": "127.0.0.1", "port": "7007"},
-]
-
-# set startup nodes in environment variables
-os.environ["REDIS_CLUSTER_NODES"] = json.dumps(startup_nodes)
-print("REDIS_CLUSTER_NODES", os.environ["REDIS_CLUSTER_NODES"])
-```
-
-:::
-
-
-
-
-
-#### Redis Sentinel
-
-
-
-
-
-
-```yaml
-model_list:
- - model_name: "*"
- litellm_params:
- model: "*"
-
-
-litellm_settings:
- cache: true
- cache_params:
- type: "redis"
- service_name: "mymaster"
- sentinel_nodes: [["localhost", 26379]]
- sentinel_password: "password" # [OPTIONAL]
-```
-
-
-
-
-
-You can configure redis sentinel in your .env by setting `REDIS_SENTINEL_NODES` in your .env
-
-**Example `REDIS_SENTINEL_NODES`** value
-
-```env
-REDIS_SENTINEL_NODES='[["localhost", 26379]]'
-REDIS_SERVICE_NAME = "mymaster"
-REDIS_SENTINEL_PASSWORD = "password"
-```
-
-:::note
-
-Example python script for setting redis cluster nodes in .env:
-
-```python
-# List of startup nodes
-sentinel_nodes = [["localhost", 26379]]
-
-# set startup nodes in environment variables
-os.environ["REDIS_SENTINEL_NODES"] = json.dumps(sentinel_nodes)
-print("REDIS_SENTINEL_NODES", os.environ["REDIS_SENTINEL_NODES"])
-```
-
-:::
-
-
-
-
-
-#### TTL
-
-```yaml
-litellm_settings:
- cache: true
- cache_params: # set cache params for redis
- type: redis
- ttl: 600 # will be cached on redis for 600s
- # default_in_memory_ttl: Optional[float], default is None. time in seconds.
- # default_in_redis_ttl: Optional[float], default is None. time in seconds.
-```
-
-
-#### SSL
-
-just set `REDIS_SSL="True"` in your .env, and LiteLLM will pick this up.
-
-```env
-REDIS_SSL="True"
-```
-
-For quick testing, you can also use REDIS_URL, eg.:
-
-```
-REDIS_URL="rediss://.."
-```
-
-but we **don't** recommend using REDIS_URL in prod. We've noticed a performance difference between using it vs. redis_host, port, etc.
-#### Step 2: Add Redis Credentials to .env
-Set either `REDIS_URL` or the `REDIS_HOST` in your os environment, to enable caching.
-
- ```shell
- REDIS_URL = "" # REDIS_URL='redis://username:password@hostname:port/database'
- ## OR ##
- REDIS_HOST = "" # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
- REDIS_PORT = "" # REDIS_PORT='18841'
- REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'
- ```
-
-**Additional kwargs**
-You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:
-```shell
-REDIS_ = ""
-```
-
-[**See how it's read from the environment**](https://github.com/BerriAI/litellm/blob/4d7ff1b33b9991dcf38d821266290631d9bcd2dd/litellm/_redis.py#L40)
-#### Step 3: Run proxy with config
-```shell
-$ litellm --config /path/to/config.yaml
-```
-
-
-
-
-
-Caching can be enabled by adding the `cache` key in the `config.yaml`
-
-#### Step 1: Add `cache` to the config.yaml
-```yaml
-model_list:
- - model_name: fake-openai-endpoint
- litellm_params:
- model: openai/fake
- api_key: fake-key
- api_base: https://exampleopenaiendpoint-production.up.railway.app/
- - model_name: openai-embedding
- litellm_params:
- model: openai/text-embedding-3-small
- api_key: os.environ/OPENAI_API_KEY
-
-litellm_settings:
- set_verbose: True
- cache: True # set cache responses to True, litellm defaults to using a redis cache
- cache_params:
- type: qdrant-semantic
- qdrant_semantic_cache_embedding_model: openai-embedding # the model should be defined on the model_list
- qdrant_collection_name: test_collection
- qdrant_quantization_config: binary
- similarity_threshold: 0.8 # similarity threshold for semantic cache
-```
-
-#### Step 2: Add Qdrant Credentials to your .env
-
-```shell
-QDRANT_API_KEY = "16rJUMBRx*************"
-QDRANT_API_BASE = "https://5392d382-45*********.cloud.qdrant.io"
-```
-
-#### Step 3: Run proxy with config
-```shell
-$ litellm --config /path/to/config.yaml
-```
-
-
-#### Step 4. Test it
-
-```shell
-curl -i http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "fake-openai-endpoint",
- "messages": [
- {"role": "user", "content": "Hello"}
- ]
- }'
-```
-
-**Expect to see `x-litellm-semantic-similarity` in the response headers when semantic caching is one**
-
-
-
-
-
-#### Step 1: Add `cache` to the config.yaml
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
- - model_name: text-embedding-ada-002
- litellm_params:
- model: text-embedding-ada-002
-
-litellm_settings:
- set_verbose: True
- cache: True # set cache responses to True
- cache_params: # set cache params for s3
- type: s3
- s3_bucket_name: cache-bucket-litellm # AWS Bucket Name for S3
- s3_region_name: us-west-2 # AWS Region Name for S3
- s3_aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID # us os.environ/ to pass environment variables. This is AWS Access Key ID for S3
- s3_aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY # AWS Secret Access Key for S3
- s3_endpoint_url: https://s3.amazonaws.com # [OPTIONAL] S3 endpoint URL, if you want to use Backblaze/cloudflare s3 buckets
-```
-
-#### Step 2: Run proxy with config
-```shell
-$ litellm --config /path/to/config.yaml
-```
-
-
-
-
-
-Caching can be enabled by adding the `cache` key in the `config.yaml`
-
-#### Step 1: Add `cache` to the config.yaml
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
- - model_name: azure-embedding-model
- litellm_params:
- model: azure/azure-embedding-model
- api_base: os.environ/AZURE_API_BASE
- api_key: os.environ/AZURE_API_KEY
- api_version: "2023-07-01-preview"
-
-litellm_settings:
- set_verbose: True
- cache: True # set cache responses to True, litellm defaults to using a redis cache
- cache_params:
- type: "redis-semantic"
- similarity_threshold: 0.8 # similarity threshold for semantic cache
- redis_semantic_cache_embedding_model: azure-embedding-model # set this to a model_name set in model_list
-```
-
-#### Step 2: Add Redis Credentials to .env
-Set either `REDIS_URL` or the `REDIS_HOST` in your os environment, to enable caching.
-
- ```shell
- REDIS_URL = "" # REDIS_URL='redis://username:password@hostname:port/database'
- ## OR ##
- REDIS_HOST = "" # REDIS_HOST='redis-18841.c274.us-east-1-3.ec2.cloud.redislabs.com'
- REDIS_PORT = "" # REDIS_PORT='18841'
- REDIS_PASSWORD = "" # REDIS_PASSWORD='liteLlmIsAmazing'
- ```
-
-**Additional kwargs**
-You can pass in any additional redis.Redis arg, by storing the variable + value in your os environment, like this:
-```shell
-REDIS_ = ""
-```
-
-#### Step 3: Run proxy with config
-```shell
-$ litellm --config /path/to/config.yaml
-```
-
-
-
-
-
-
-
-
-
-## Using Caching - /chat/completions
-
-
-
-
-Send the same request twice:
-```shell
-curl http://0.0.0.0:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -d '{
- "model": "gpt-3.5-turbo",
- "messages": [{"role": "user", "content": "write a poem about litellm!"}],
- "temperature": 0.7
- }'
-
-curl http://0.0.0.0:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -d '{
- "model": "gpt-3.5-turbo",
- "messages": [{"role": "user", "content": "write a poem about litellm!"}],
- "temperature": 0.7
- }'
-```
-
-
-
-Send the same request twice:
-```shell
-curl --location 'http://0.0.0.0:4000/embeddings' \
- --header 'Content-Type: application/json' \
- --data ' {
- "model": "text-embedding-ada-002",
- "input": ["write a litellm poem"]
- }'
-
-curl --location 'http://0.0.0.0:4000/embeddings' \
- --header 'Content-Type: application/json' \
- --data ' {
- "model": "text-embedding-ada-002",
- "input": ["write a litellm poem"]
- }'
-```
-
-
-
-## Set cache for proxy, but not on the actual llm api call
-
-Use this if you just want to enable features like rate limiting, and loadbalancing across multiple instances.
-
-Set `supported_call_types: []` to disable caching on the actual api call.
-
-
-```yaml
-litellm_settings:
- cache: True
- cache_params:
- type: redis
- supported_call_types: []
-```
-
-
-## Debugging Caching - `/cache/ping`
-LiteLLM Proxy exposes a `/cache/ping` endpoint to test if the cache is working as expected
-
-**Usage**
-```shell
-curl --location 'http://0.0.0.0:4000/cache/ping' -H "Authorization: Bearer sk-1234"
-```
-
-**Expected Response - when cache healthy**
-```shell
-{
- "status": "healthy",
- "cache_type": "redis",
- "ping_response": true,
- "set_cache_response": "success",
- "litellm_cache_params": {
- "supported_call_types": "['completion', 'acompletion', 'embedding', 'aembedding', 'atranscription', 'transcription']",
- "type": "redis",
- "namespace": "None"
- },
- "redis_cache_params": {
- "redis_client": "Redis>>",
- "redis_kwargs": "{'url': 'redis://:******@redis-16337.c322.us-east-1-2.ec2.cloud.redislabs.com:16337'}",
- "async_redis_conn_pool": "BlockingConnectionPool>",
- "redis_version": "7.2.0"
- }
-}
-```
-
-## Advanced
-
-### Control Call Types Caching is on for - (`/chat/completion`, `/embeddings`, etc.)
-
-By default, caching is on for all call types. You can control which call types caching is on for by setting `supported_call_types` in `cache_params`
-
-**Cache will only be on for the call types specified in `supported_call_types`**
-
-```yaml
-litellm_settings:
- cache: True
- cache_params:
- type: redis
- supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
- # /chat/completions, /completions, /embeddings, /audio/transcriptions
-```
-### Set Cache Params on config.yaml
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
- - model_name: text-embedding-ada-002
- litellm_params:
- model: text-embedding-ada-002
-
-litellm_settings:
- set_verbose: True
- cache: True # set cache responses to True, litellm defaults to using a redis cache
- cache_params: # cache_params are optional
- type: "redis" # The type of cache to initialize. Can be "local" or "redis". Defaults to "local".
- host: "localhost" # The host address for the Redis cache. Required if type is "redis".
- port: 6379 # The port number for the Redis cache. Required if type is "redis".
- password: "your_password" # The password for the Redis cache. Required if type is "redis".
-
- # Optional configurations
- supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
- # /chat/completions, /completions, /embeddings, /audio/transcriptions
-```
-
-### **Turn on / off caching per request. **
-
-The proxy support 4 cache-controls:
-
-- `ttl`: *Optional(int)* - Will cache the response for the user-defined amount of time (in seconds).
-- `s-maxage`: *Optional(int)* Will only accept cached responses that are within user-defined range (in seconds).
-- `no-cache`: *Optional(bool)* Will not return a cached response, but instead call the actual endpoint.
-- `no-store`: *Optional(bool)* Will not cache the response.
-
-[Let us know if you need more](https://github.com/BerriAI/litellm/issues/1218)
-
-**Turn off caching**
-
-Set `no-cache=True`, this will not return a cached response
-
-
-
-
-```python
-import os
-from openai import OpenAI
-
-client = OpenAI(
- # This is the default and can be omitted
- api_key=os.environ.get("OPENAI_API_KEY"),
- base_url="http://0.0.0.0:4000"
-)
-
-chat_completion = client.chat.completions.create(
- messages=[
- {
- "role": "user",
- "content": "Say this is a test",
- }
- ],
- model="gpt-3.5-turbo",
- extra_body = { # OpenAI python accepts extra args in extra_body
- cache: {
- "no-cache": True # will not return a cached response
- }
- }
-)
-```
-
-
-
-
-```shell
-curl http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "gpt-3.5-turbo",
- "cache": {"no-cache": True},
- "messages": [
- {"role": "user", "content": "Say this is a test"}
- ]
- }'
-```
-
-
-
-
-
-**Turn on caching**
-
-By default cache is always on
-
-
-
-
-```python
-import os
-from openai import OpenAI
-
-client = OpenAI(
- # This is the default and can be omitted
- api_key=os.environ.get("OPENAI_API_KEY"),
- base_url="http://0.0.0.0:4000"
-)
-
-chat_completion = client.chat.completions.create(
- messages=[
- {
- "role": "user",
- "content": "Say this is a test",
- }
- ],
- model="gpt-3.5-turbo"
-)
-```
-
-
-
-
-```shell
-curl http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "gpt-3.5-turbo",
- "messages": [
- {"role": "user", "content": "Say this is a test"}
- ]
- }'
-```
-
-
-
-
-
-**Set `ttl`**
-
-Set `ttl=600`, this will caches response for 10 minutes (600 seconds)
-
-
-
-
-```python
-import os
-from openai import OpenAI
-
-client = OpenAI(
- # This is the default and can be omitted
- api_key=os.environ.get("OPENAI_API_KEY"),
- base_url="http://0.0.0.0:4000"
-)
-
-chat_completion = client.chat.completions.create(
- messages=[
- {
- "role": "user",
- "content": "Say this is a test",
- }
- ],
- model="gpt-3.5-turbo",
- extra_body = { # OpenAI python accepts extra args in extra_body
- cache: {
- "ttl": 600 # caches response for 10 minutes
- }
- }
-)
-```
-
-
-
-
-```shell
-curl http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "gpt-3.5-turbo",
- "cache": {"ttl": 600},
- "messages": [
- {"role": "user", "content": "Say this is a test"}
- ]
- }'
-```
-
-
-
-
-
-
-
-**Set `s-maxage`**
-
-Set `s-maxage`, this will only get responses cached within last 10 minutes
-
-
-
-
-```python
-import os
-from openai import OpenAI
-
-client = OpenAI(
- # This is the default and can be omitted
- api_key=os.environ.get("OPENAI_API_KEY"),
- base_url="http://0.0.0.0:4000"
-)
-
-chat_completion = client.chat.completions.create(
- messages=[
- {
- "role": "user",
- "content": "Say this is a test",
- }
- ],
- model="gpt-3.5-turbo",
- extra_body = { # OpenAI python accepts extra args in extra_body
- cache: {
- "s-maxage": 600 # only get responses cached within last 10 minutes
- }
- }
-)
-```
-
-
-
-
-```shell
-curl http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "gpt-3.5-turbo",
- "cache": {"s-maxage": 600},
- "messages": [
- {"role": "user", "content": "Say this is a test"}
- ]
- }'
-```
-
-
-
-
-
-
-### Turn on / off caching per Key.
-
-1. Add cache params when creating a key [full list](#turn-on--off-caching-per-key)
-
-```bash
-curl -X POST 'http://0.0.0.0:4000/key/generate' \
--H 'Authorization: Bearer sk-1234' \
--H 'Content-Type: application/json' \
--d '{
- "user_id": "222",
- "metadata": {
- "cache": {
- "no-cache": true
- }
- }
-}'
-```
-
-2. Test it!
-
-```bash
-curl -X POST 'http://localhost:4000/chat/completions' \
--H 'Content-Type: application/json' \
--H 'Authorization: Bearer ' \
--d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "bom dia"}]}'
-```
-
-### Deleting Cache Keys - `/cache/delete`
-In order to delete a cache key, send a request to `/cache/delete` with the `keys` you want to delete
-
-Example
-```shell
-curl -X POST "http://0.0.0.0:4000/cache/delete" \
- -H "Authorization: Bearer sk-1234" \
- -d '{"keys": ["586bf3f3c1bf5aecb55bd9996494d3bbc69eb58397163add6d49537762a7548d", "key2"]}'
-```
-
-```shell
-# {"status":"success"}
-```
-
-#### Viewing Cache Keys from responses
-You can view the cache_key in the response headers, on cache hits the cache key is sent as the `x-litellm-cache-key` response headers
-```shell
-curl -i --location 'http://0.0.0.0:4000/chat/completions' \
- --header 'Authorization: Bearer sk-1234' \
- --header 'Content-Type: application/json' \
- --data '{
- "model": "gpt-3.5-turbo",
- "user": "ishan",
- "messages": [
- {
- "role": "user",
- "content": "what is litellm"
- }
- ],
-}'
-```
-
-Response from litellm proxy
-```json
-date: Thu, 04 Apr 2024 17:37:21 GMT
-content-type: application/json
-x-litellm-cache-key: 586bf3f3c1bf5aecb55bd9996494d3bbc69eb58397163add6d49537762a7548d
-
-{
- "id": "chatcmpl-9ALJTzsBlXR9zTxPvzfFFtFbFtG6T",
- "choices": [
- {
- "finish_reason": "stop",
- "index": 0,
- "message": {
- "content": "I'm sorr.."
- "role": "assistant"
- }
- }
- ],
- "created": 1712252235,
-}
-
-```
-
-### **Set Caching Default Off - Opt in only **
-
-1. **Set `mode: default_off` for caching**
-
-```yaml
-model_list:
- - model_name: fake-openai-endpoint
- litellm_params:
- model: openai/fake
- api_key: fake-key
- api_base: https://exampleopenaiendpoint-production.up.railway.app/
-
-# default off mode
-litellm_settings:
- set_verbose: True
- cache: True
- cache_params:
- mode: default_off # 👈 Key change cache is default_off
-```
-
-2. **Opting in to cache when cache is default off**
-
-
-
-
-
-```python
-import os
-from openai import OpenAI
-
-client = OpenAI(api_key=, base_url="http://0.0.0.0:4000")
-
-chat_completion = client.chat.completions.create(
- messages=[
- {
- "role": "user",
- "content": "Say this is a test",
- }
- ],
- model="gpt-3.5-turbo",
- extra_body = { # OpenAI python accepts extra args in extra_body
- "cache": {"use-cache": True}
- }
-)
-```
-
-
-
-
-```shell
-curl http://localhost:4000/v1/chat/completions \
- -H "Content-Type: application/json" \
- -H "Authorization: Bearer sk-1234" \
- -d '{
- "model": "gpt-3.5-turbo",
- "cache": {"use-cache": True}
- "messages": [
- {"role": "user", "content": "Say this is a test"}
- ]
- }'
-```
-
-
-
-
-
-
-
-### Turn on `batch_redis_requests`
-
-**What it does?**
-When a request is made:
-
-- Check if a key starting with `litellm:::` exists in-memory, if no - get the last 100 cached requests for this key and store it
-
-- New requests are stored with this `litellm:..` as the namespace
-
-**Why?**
-Reduce number of redis GET requests. This improved latency by 46% in prod load tests.
-
-**Usage**
-
-```yaml
-litellm_settings:
- cache: true
- cache_params:
- type: redis
- ... # remaining redis args (host, port, etc.)
- callbacks: ["batch_redis_requests"] # 👈 KEY CHANGE!
-```
-
-[**SEE CODE**](https://github.com/BerriAI/litellm/blob/main/litellm/proxy/hooks/batch_redis_get.py)
-
-## Supported `cache_params` on proxy config.yaml
-
-```yaml
-cache_params:
- # ttl
- ttl: Optional[float]
- default_in_memory_ttl: Optional[float]
- default_in_redis_ttl: Optional[float]
-
- # Type of cache (options: "local", "redis", "s3")
- type: s3
-
- # List of litellm call types to cache for
- # Options: "completion", "acompletion", "embedding", "aembedding"
- supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
- # /chat/completions, /completions, /embeddings, /audio/transcriptions
-
- # Redis cache parameters
- host: localhost # Redis server hostname or IP address
- port: "6379" # Redis server port (as a string)
- password: secret_password # Redis server password
- namespace: Optional[str] = None,
-
-
- # S3 cache parameters
- s3_bucket_name: your_s3_bucket_name # Name of the S3 bucket
- s3_region_name: us-west-2 # AWS region of the S3 bucket
- s3_api_version: 2006-03-01 # AWS S3 API version
- s3_use_ssl: true # Use SSL for S3 connections (options: true, false)
- s3_verify: true # SSL certificate verification for S3 connections (options: true, false)
- s3_endpoint_url: https://s3.amazonaws.com # S3 endpoint URL
- s3_aws_access_key_id: your_access_key # AWS Access Key ID for S3
- s3_aws_secret_access_key: your_secret_key # AWS Secret Access Key for S3
- s3_aws_session_token: your_session_token # AWS Session Token for temporary credentials
-
-```
-
-## Advanced - user api key cache ttl
-
-Configure how long the in-memory cache stores the key object (prevents db requests)
-
-```yaml
-general_settings:
- user_api_key_cache_ttl: #time in seconds
-```
-
-By default this value is set to 60s.
\ No newline at end of file
diff --git a/docs/my-website/docs/proxy/call_hooks.md b/docs/my-website/docs/proxy/call_hooks.md
deleted file mode 100644
index 6651393ef..000000000
--- a/docs/my-website/docs/proxy/call_hooks.md
+++ /dev/null
@@ -1,314 +0,0 @@
-import Image from '@theme/IdealImage';
-
-# Modify / Reject Incoming Requests
-
-- Modify data before making llm api calls on proxy
-- Reject data before making llm api calls / before returning the response
-- Enforce 'user' param for all openai endpoint calls
-
-See a complete example with our [parallel request rate limiter](https://github.com/BerriAI/litellm/blob/main/litellm/proxy/hooks/parallel_request_limiter.py)
-
-## Quick Start
-
-1. In your Custom Handler add a new `async_pre_call_hook` function
-
-This function is called just before a litellm completion call is made, and allows you to modify the data going into the litellm call [**See Code**](https://github.com/BerriAI/litellm/blob/589a6ca863000ba8e92c897ba0f776796e7a5904/litellm/proxy/proxy_server.py#L1000)
-
-```python
-from litellm.integrations.custom_logger import CustomLogger
-import litellm
-from litellm.proxy.proxy_server import UserAPIKeyAuth, DualCache
-from typing import Optional, Literal
-
-# This file includes the custom callbacks for LiteLLM Proxy
-# Once defined, these can be passed in proxy_config.yaml
-class MyCustomHandler(CustomLogger): # https://docs.litellm.ai/docs/observability/custom_callback#callback-class
- # Class variables or attributes
- def __init__(self):
- pass
-
- #### CALL HOOKS - proxy only ####
-
- async def async_pre_call_hook(self, user_api_key_dict: UserAPIKeyAuth, cache: DualCache, data: dict, call_type: Literal[
- "completion",
- "text_completion",
- "embeddings",
- "image_generation",
- "moderation",
- "audio_transcription",
- ]):
- data["model"] = "my-new-model"
- return data
-
- async def async_post_call_failure_hook(
- self,
- request_data: dict,
- original_exception: Exception,
- user_api_key_dict: UserAPIKeyAuth
- ):
- pass
-
- async def async_post_call_success_hook(
- self,
- data: dict,
- user_api_key_dict: UserAPIKeyAuth,
- response,
- ):
- pass
-
- async def async_moderation_hook( # call made in parallel to llm api call
- self,
- data: dict,
- user_api_key_dict: UserAPIKeyAuth,
- call_type: Literal["completion", "embeddings", "image_generation", "moderation", "audio_transcription"],
- ):
- pass
-
- async def async_post_call_streaming_hook(
- self,
- user_api_key_dict: UserAPIKeyAuth,
- response: str,
- ):
- pass
-proxy_handler_instance = MyCustomHandler()
-```
-
-2. Add this file to your proxy config
-
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
-
-litellm_settings:
- callbacks: custom_callbacks.proxy_handler_instance # sets litellm.callbacks = [proxy_handler_instance]
-```
-
-3. Start the server + test the request
-
-```shell
-$ litellm /path/to/config.yaml
-```
-```shell
-curl --location 'http://0.0.0.0:4000/chat/completions' \
- --data ' {
- "model": "gpt-3.5-turbo",
- "messages": [
- {
- "role": "user",
- "content": "good morning good sir"
- }
- ],
- "user": "ishaan-app",
- "temperature": 0.2
- }'
-```
-
-
-## [BETA] *NEW* async_moderation_hook
-
-Run a moderation check in parallel to the actual LLM API call.
-
-In your Custom Handler add a new `async_moderation_hook` function
-
-- This is currently only supported for `/chat/completion` calls.
-- This function runs in parallel to the actual LLM API call.
-- If your `async_moderation_hook` raises an Exception, we will return that to the user.
-
-
-:::info
-
-We might need to update the function schema in the future, to support multiple endpoints (e.g. accept a call_type). Please keep that in mind, while trying this feature
-
-:::
-
-See a complete example with our [Llama Guard content moderation hook](https://github.com/BerriAI/litellm/blob/main/enterprise/enterprise_hooks/llm_guard.py)
-
-```python
-from litellm.integrations.custom_logger import CustomLogger
-import litellm
-from fastapi import HTTPException
-
-# This file includes the custom callbacks for LiteLLM Proxy
-# Once defined, these can be passed in proxy_config.yaml
-class MyCustomHandler(CustomLogger): # https://docs.litellm.ai/docs/observability/custom_callback#callback-class
- # Class variables or attributes
- def __init__(self):
- pass
-
- #### ASYNC ####
-
- async def async_log_stream_event(self, kwargs, response_obj, start_time, end_time):
- pass
-
- async def async_log_pre_api_call(self, model, messages, kwargs):
- pass
-
- async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
- pass
-
- async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
- pass
-
- #### CALL HOOKS - proxy only ####
-
- async def async_pre_call_hook(self, user_api_key_dict: UserAPIKeyAuth, cache: DualCache, data: dict, call_type: Literal["completion", "embeddings"]):
- data["model"] = "my-new-model"
- return data
-
- async def async_moderation_hook( ### 👈 KEY CHANGE ###
- self,
- data: dict,
- ):
- messages = data["messages"]
- print(messages)
- if messages[0]["content"] == "hello world":
- raise HTTPException(
- status_code=400, detail={"error": "Violated content safety policy"}
- )
-
-proxy_handler_instance = MyCustomHandler()
-```
-
-
-2. Add this file to your proxy config
-
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
-
-litellm_settings:
- callbacks: custom_callbacks.proxy_handler_instance # sets litellm.callbacks = [proxy_handler_instance]
-```
-
-3. Start the server + test the request
-
-```shell
-$ litellm /path/to/config.yaml
-```
-```shell
-curl --location 'http://0.0.0.0:4000/chat/completions' \
- --data ' {
- "model": "gpt-3.5-turbo",
- "messages": [
- {
- "role": "user",
- "content": "Hello world"
- }
- ],
- }'
-```
-
-## Advanced - Enforce 'user' param
-
-Set `enforce_user_param` to true, to require all calls to the openai endpoints to have the 'user' param.
-
-[**See Code**](https://github.com/BerriAI/litellm/blob/4777921a31c4c70e4d87b927cb233b6a09cd8b51/litellm/proxy/auth/auth_checks.py#L72)
-
-```yaml
-general_settings:
- enforce_user_param: True
-```
-
-**Result**
-
-
-
-## Advanced - Return rejected message as response
-
-For chat completions and text completion calls, you can return a rejected message as a user response.
-
-Do this by returning a string. LiteLLM takes care of returning the response in the correct format depending on the endpoint and if it's streaming/non-streaming.
-
-For non-chat/text completion endpoints, this response is returned as a 400 status code exception.
-
-
-### 1. Create Custom Handler
-
-```python
-from litellm.integrations.custom_logger import CustomLogger
-import litellm
-from litellm.utils import get_formatted_prompt
-
-# This file includes the custom callbacks for LiteLLM Proxy
-# Once defined, these can be passed in proxy_config.yaml
-class MyCustomHandler(CustomLogger):
- def __init__(self):
- pass
-
- #### CALL HOOKS - proxy only ####
-
- async def async_pre_call_hook(self, user_api_key_dict: UserAPIKeyAuth, cache: DualCache, data: dict, call_type: Literal[
- "completion",
- "text_completion",
- "embeddings",
- "image_generation",
- "moderation",
- "audio_transcription",
- ]) -> Optional[dict, str, Exception]:
- formatted_prompt = get_formatted_prompt(data=data, call_type=call_type)
-
- if "Hello world" in formatted_prompt:
- return "This is an invalid response"
-
- return data
-
-proxy_handler_instance = MyCustomHandler()
-```
-
-### 2. Update config.yaml
-
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: gpt-3.5-turbo
-
-litellm_settings:
- callbacks: custom_callbacks.proxy_handler_instance # sets litellm.callbacks = [proxy_handler_instance]
-```
-
-
-### 3. Test it!
-
-```shell
-$ litellm /path/to/config.yaml
-```
-```shell
-curl --location 'http://0.0.0.0:4000/chat/completions' \
- --data ' {
- "model": "gpt-3.5-turbo",
- "messages": [
- {
- "role": "user",
- "content": "Hello world"
- }
- ],
- }'
-```
-
-**Expected Response**
-
-```
-{
- "id": "chatcmpl-d00bbede-2d90-4618-bf7b-11a1c23cf360",
- "choices": [
- {
- "finish_reason": "stop",
- "index": 0,
- "message": {
- "content": "This is an invalid response.", # 👈 REJECTED RESPONSE
- "role": "assistant"
- }
- }
- ],
- "created": 1716234198,
- "model": null,
- "object": "chat.completion",
- "system_fingerprint": null,
- "usage": {}
-}
-```
\ No newline at end of file
diff --git a/docs/my-website/docs/proxy/cli.md b/docs/my-website/docs/proxy/cli.md
deleted file mode 100644
index d0c477a4e..000000000
--- a/docs/my-website/docs/proxy/cli.md
+++ /dev/null
@@ -1,186 +0,0 @@
-# CLI Arguments
-Cli arguments, --host, --port, --num_workers
-
-## --host
- - **Default:** `'0.0.0.0'`
- - The host for the server to listen on.
- - **Usage:**
- ```shell
- litellm --host 127.0.0.1
- ```
- - **Usage - set Environment Variable:** `HOST`
- ```shell
- export HOST=127.0.0.1
- litellm
- ```
-
-## --port
- - **Default:** `4000`
- - The port to bind the server to.
- - **Usage:**
- ```shell
- litellm --port 8080
- ```
- - **Usage - set Environment Variable:** `PORT`
- ```shell
- export PORT=8080
- litellm
- ```
-
-## --num_workers
- - **Default:** `1`
- - The number of uvicorn workers to spin up.
- - **Usage:**
- ```shell
- litellm --num_workers 4
- ```
- - **Usage - set Environment Variable:** `NUM_WORKERS`
- ```shell
- export NUM_WORKERS=4
- litellm
- ```
-
-## --api_base
- - **Default:** `None`
- - The API base for the model litellm should call.
- - **Usage:**
- ```shell
- litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud
- ```
-
-## --api_version
- - **Default:** `None`
- - For Azure services, specify the API version.
- - **Usage:**
- ```shell
- litellm --model azure/gpt-deployment --api_version 2023-08-01 --api_base https://"
- ```
-
-## --model or -m
- - **Default:** `None`
- - The model name to pass to Litellm.
- - **Usage:**
- ```shell
- litellm --model gpt-3.5-turbo
- ```
-
-## --test
- - **Type:** `bool` (Flag)
- - Proxy chat completions URL to make a test request.
- - **Usage:**
- ```shell
- litellm --test
- ```
-
-## --health
- - **Type:** `bool` (Flag)
- - Runs a health check on all models in config.yaml
- - **Usage:**
- ```shell
- litellm --health
- ```
-
-## --alias
- - **Default:** `None`
- - An alias for the model, for user-friendly reference.
- - **Usage:**
- ```shell
- litellm --alias my-gpt-model
- ```
-
-## --debug
- - **Default:** `False`
- - **Type:** `bool` (Flag)
- - Enable debugging mode for the input.
- - **Usage:**
- ```shell
- litellm --debug
- ```
- - **Usage - set Environment Variable:** `DEBUG`
- ```shell
- export DEBUG=True
- litellm
- ```
-
-## --detailed_debug
- - **Default:** `False`
- - **Type:** `bool` (Flag)
- - Enable debugging mode for the input.
- - **Usage:**
- ```shell
- litellm --detailed_debug
- ```
- - **Usage - set Environment Variable:** `DETAILED_DEBUG`
- ```shell
- export DETAILED_DEBUG=True
- litellm
- ```
-
-#### --temperature
- - **Default:** `None`
- - **Type:** `float`
- - Set the temperature for the model.
- - **Usage:**
- ```shell
- litellm --temperature 0.7
- ```
-
-## --max_tokens
- - **Default:** `None`
- - **Type:** `int`
- - Set the maximum number of tokens for the model output.
- - **Usage:**
- ```shell
- litellm --max_tokens 50
- ```
-
-## --request_timeout
- - **Default:** `6000`
- - **Type:** `int`
- - Set the timeout in seconds for completion calls.
- - **Usage:**
- ```shell
- litellm --request_timeout 300
- ```
-
-## --drop_params
- - **Type:** `bool` (Flag)
- - Drop any unmapped params.
- - **Usage:**
- ```shell
- litellm --drop_params
- ```
-
-## --add_function_to_prompt
- - **Type:** `bool` (Flag)
- - If a function passed but unsupported, pass it as a part of the prompt.
- - **Usage:**
- ```shell
- litellm --add_function_to_prompt
- ```
-
-## --config
- - Configure Litellm by providing a configuration file path.
- - **Usage:**
- ```shell
- litellm --config path/to/config.yaml
- ```
-
-## --telemetry
- - **Default:** `True`
- - **Type:** `bool`
- - Help track usage of this feature.
- - **Usage:**
- ```shell
- litellm --telemetry False
- ```
-
-
-## --log_config
- - **Default:** `None`
- - **Type:** `str`
- - Specify a log configuration file for uvicorn.
- - **Usage:**
- ```shell
- litellm --log_config path/to/log_config.conf
- ```
diff --git a/docs/my-website/docs/proxy/config_management.md b/docs/my-website/docs/proxy/config_management.md
deleted file mode 100644
index 4f7c5775b..000000000
--- a/docs/my-website/docs/proxy/config_management.md
+++ /dev/null
@@ -1,59 +0,0 @@
-# File Management
-
-## `include` external YAML files in a config.yaml
-
-You can use `include` to include external YAML files in a config.yaml.
-
-**Quick Start Usage:**
-
-To include a config file, use `include` with either a single file or a list of files.
-
-Contents of `parent_config.yaml`:
-```yaml
-include:
- - model_config.yaml # 👈 Key change, will include the contents of model_config.yaml
-
-litellm_settings:
- callbacks: ["prometheus"]
-```
-
-
-Contents of `model_config.yaml`:
-```yaml
-model_list:
- - model_name: gpt-4o
- litellm_params:
- model: openai/gpt-4o
- api_base: https://exampleopenaiendpoint-production.up.railway.app/
- - model_name: fake-anthropic-endpoint
- litellm_params:
- model: anthropic/fake
- api_base: https://exampleanthropicendpoint-production.up.railway.app/
-
-```
-
-Start proxy server
-
-This will start the proxy server with config `parent_config.yaml`. Since the `include` directive is used, the server will also include the contents of `model_config.yaml`.
-```
-litellm --config parent_config.yaml --detailed_debug
-```
-
-
-
-
-
-## Examples using `include`
-
-Include a single file:
-```yaml
-include:
- - model_config.yaml
-```
-
-Include multiple files:
-```yaml
-include:
- - model_config.yaml
- - another_config.yaml
-```
\ No newline at end of file
diff --git a/docs/my-website/docs/proxy/config_settings.md b/docs/my-website/docs/proxy/config_settings.md
deleted file mode 100644
index c762a0716..000000000
--- a/docs/my-website/docs/proxy/config_settings.md
+++ /dev/null
@@ -1,507 +0,0 @@
-# All settings
-
-
-```yaml
-environment_variables: {}
-
-model_list:
- - model_name: string
- litellm_params: {}
- model_info:
- id: string
- mode: embedding
- input_cost_per_token: 0
- output_cost_per_token: 0
- max_tokens: 2048
- base_model: gpt-4-1106-preview
- additionalProp1: {}
-
-litellm_settings:
- # Logging/Callback settings
- success_callback: ["langfuse"] # list of success callbacks
- failure_callback: ["sentry"] # list of failure callbacks
- callbacks: ["otel"] # list of callbacks - runs on success and failure
- service_callbacks: ["datadog", "prometheus"] # logs redis, postgres failures on datadog, prometheus
- turn_off_message_logging: boolean # prevent the messages and responses from being logged to on your callbacks, but request metadata will still be logged.
- redact_user_api_key_info: boolean # Redact information about the user api key (hashed token, user_id, team id, etc.), from logs. Currently supported for Langfuse, OpenTelemetry, Logfire, ArizeAI logging.
- langfuse_default_tags: ["cache_hit", "cache_key", "proxy_base_url", "user_api_key_alias", "user_api_key_user_id", "user_api_key_user_email", "user_api_key_team_alias", "semantic-similarity", "proxy_base_url"] # default tags for Langfuse Logging
-
- # Networking settings
- request_timeout: 10 # (int) llm requesttimeout in seconds. Raise Timeout error if call takes longer than 10s. Sets litellm.request_timeout
- force_ipv4: boolean # If true, litellm will force ipv4 for all LLM requests. Some users have seen httpx ConnectionError when using ipv6 + Anthropic API
-
- set_verbose: boolean # sets litellm.set_verbose=True to view verbose debug logs. DO NOT LEAVE THIS ON IN PRODUCTION
- json_logs: boolean # if true, logs will be in json format
-
- # Fallbacks, reliability
- default_fallbacks: ["claude-opus"] # set default_fallbacks, in case a specific model group is misconfigured / bad.
- content_policy_fallbacks: [{"gpt-3.5-turbo-small": ["claude-opus"]}] # fallbacks for ContentPolicyErrors
- context_window_fallbacks: [{"gpt-3.5-turbo-small": ["gpt-3.5-turbo-large", "claude-opus"]}] # fallbacks for ContextWindowExceededErrors
-
-
-
- # Caching settings
- cache: true
- cache_params: # set cache params for redis
- type: redis # type of cache to initialize
-
- # Optional - Redis Settings
- host: "localhost" # The host address for the Redis cache. Required if type is "redis".
- port: 6379 # The port number for the Redis cache. Required if type is "redis".
- password: "your_password" # The password for the Redis cache. Required if type is "redis".
- namespace: "litellm.caching.caching" # namespace for redis cache
-
- # Optional - Redis Cluster Settings
- redis_startup_nodes: [{"host": "127.0.0.1", "port": "7001"}]
-
- # Optional - Redis Sentinel Settings
- service_name: "mymaster"
- sentinel_nodes: [["localhost", 26379]]
-
- # Optional - Qdrant Semantic Cache Settings
- qdrant_semantic_cache_embedding_model: openai-embedding # the model should be defined on the model_list
- qdrant_collection_name: test_collection
- qdrant_quantization_config: binary
- similarity_threshold: 0.8 # similarity threshold for semantic cache
-
- # Optional - S3 Cache Settings
- s3_bucket_name: cache-bucket-litellm # AWS Bucket Name for S3
- s3_region_name: us-west-2 # AWS Region Name for S3
- s3_aws_access_key_id: os.environ/AWS_ACCESS_KEY_ID # us os.environ/ to pass environment variables. This is AWS Access Key ID for S3
- s3_aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY # AWS Secret Access Key for S3
- s3_endpoint_url: https://s3.amazonaws.com # [OPTIONAL] S3 endpoint URL, if you want to use Backblaze/cloudflare s3 bucket
-
- # Common Cache settings
- # Optional - Supported call types for caching
- supported_call_types: ["acompletion", "atext_completion", "aembedding", "atranscription"]
- # /chat/completions, /completions, /embeddings, /audio/transcriptions
- mode: default_off # if default_off, you need to opt in to caching on a per call basis
- ttl: 600 # ttl for caching
-
-
-callback_settings:
- otel:
- message_logging: boolean # OTEL logging callback specific settings
-
-general_settings:
- completion_model: string
- disable_spend_logs: boolean # turn off writing each transaction to the db
- disable_master_key_return: boolean # turn off returning master key on UI (checked on '/user/info' endpoint)
- disable_retry_on_max_parallel_request_limit_error: boolean # turn off retries when max parallel request limit is reached
- disable_reset_budget: boolean # turn off reset budget scheduled task
- disable_adding_master_key_hash_to_db: boolean # turn off storing master key hash in db, for spend tracking
- enable_jwt_auth: boolean # allow proxy admin to auth in via jwt tokens with 'litellm_proxy_admin' in claims
- enforce_user_param: boolean # requires all openai endpoint requests to have a 'user' param
- allowed_routes: ["route1", "route2"] # list of allowed proxy API routes - a user can access. (currently JWT-Auth only)
- key_management_system: google_kms # either google_kms or azure_kms
- master_key: string
-
- # Database Settings
- database_url: string
- database_connection_pool_limit: 0 # default 100
- database_connection_timeout: 0 # default 60s
- allow_requests_on_db_unavailable: boolean # if true, will allow requests that can not connect to the DB to verify Virtual Key to still work
-
- custom_auth: string
- max_parallel_requests: 0 # the max parallel requests allowed per deployment
- global_max_parallel_requests: 0 # the max parallel requests allowed on the proxy all up
- infer_model_from_keys: true
- background_health_checks: true
- health_check_interval: 300
- alerting: ["slack", "email"]
- alerting_threshold: 0
- use_client_credentials_pass_through_routes: boolean # use client credentials for all pass through routes like "/vertex-ai", /bedrock/. When this is True Virtual Key auth will not be applied on these endpoints
-```
-
-### litellm_settings - Reference
-
-| Name | Type | Description |
-|------|------|-------------|
-| success_callback | array of strings | List of success callbacks. [Doc Proxy logging callbacks](logging), [Doc Metrics](prometheus) |
-| failure_callback | array of strings | List of failure callbacks [Doc Proxy logging callbacks](logging), [Doc Metrics](prometheus) |
-| callbacks | array of strings | List of callbacks - runs on success and failure [Doc Proxy logging callbacks](logging), [Doc Metrics](prometheus) |
-| service_callbacks | array of strings | System health monitoring - Logs redis, postgres failures on specified services (e.g. datadog, prometheus) [Doc Metrics](prometheus) |
-| turn_off_message_logging | boolean | If true, prevents messages and responses from being logged to callbacks, but request metadata will still be logged [Proxy Logging](logging) |
-| modify_params | boolean | If true, allows modifying the parameters of the request before it is sent to the LLM provider |
-| enable_preview_features | boolean | If true, enables preview features - e.g. Azure O1 Models with streaming support.|
-| redact_user_api_key_info | boolean | If true, redacts information about the user api key from logs [Proxy Logging](logging#redacting-userapikeyinfo) |
-| langfuse_default_tags | array of strings | Default tags for Langfuse Logging. Use this if you want to control which LiteLLM-specific fields are logged as tags by the LiteLLM proxy. By default LiteLLM Proxy logs no LiteLLM-specific fields as tags. [Further docs](./logging#litellm-specific-tags-on-langfuse---cache_hit-cache_key) |
-| set_verbose | boolean | If true, sets litellm.set_verbose=True to view verbose debug logs. DO NOT LEAVE THIS ON IN PRODUCTION |
-| json_logs | boolean | If true, logs will be in json format. If you need to store the logs as JSON, just set the `litellm.json_logs = True`. We currently just log the raw POST request from litellm as a JSON [Further docs](./debugging) |
-| default_fallbacks | array of strings | List of fallback models to use if a specific model group is misconfigured / bad. [Further docs](./reliability#default-fallbacks) |
-| request_timeout | integer | The timeout for requests in seconds. If not set, the default value is `6000 seconds`. [For reference OpenAI Python SDK defaults to `600 seconds`.](https://github.com/openai/openai-python/blob/main/src/openai/_constants.py) |
-| force_ipv4 | boolean | If true, litellm will force ipv4 for all LLM requests. Some users have seen httpx ConnectionError when using ipv6 + Anthropic API |
-| content_policy_fallbacks | array of objects | Fallbacks to use when a ContentPolicyViolationError is encountered. [Further docs](./reliability#content-policy-fallbacks) |
-| context_window_fallbacks | array of objects | Fallbacks to use when a ContextWindowExceededError is encountered. [Further docs](./reliability#context-window-fallbacks) |
-| cache | boolean | If true, enables caching. [Further docs](./caching) |
-| cache_params | object | Parameters for the cache. [Further docs](./caching) |
-| cache_params.type | string | The type of cache to initialize. Can be one of ["local", "redis", "redis-semantic", "s3", "disk", "qdrant-semantic"]. Defaults to "redis". [Furher docs](./caching) |
-| cache_params.host | string | The host address for the Redis cache. Required if type is "redis". |
-| cache_params.port | integer | The port number for the Redis cache. Required if type is "redis". |
-| cache_params.password | string | The password for the Redis cache. Required if type is "redis". |
-| cache_params.namespace | string | The namespace for the Redis cache. |
-| cache_params.redis_startup_nodes | array of objects | Redis Cluster Settings. [Further docs](./caching) |
-| cache_params.service_name | string | Redis Sentinel Settings. [Further docs](./caching) |
-| cache_params.sentinel_nodes | array of arrays | Redis Sentinel Settings. [Further docs](./caching) |
-| cache_params.ttl | integer | The time (in seconds) to store entries in cache. |
-| cache_params.qdrant_semantic_cache_embedding_model | string | The embedding model to use for qdrant semantic cache. |
-| cache_params.qdrant_collection_name | string | The name of the collection to use for qdrant semantic cache. |
-| cache_params.qdrant_quantization_config | string | The quantization configuration for the qdrant semantic cache. |
-| cache_params.similarity_threshold | float | The similarity threshold for the semantic cache. |
-| cache_params.s3_bucket_name | string | The name of the S3 bucket to use for the semantic cache. |
-| cache_params.s3_region_name | string | The region name for the S3 bucket. |
-| cache_params.s3_aws_access_key_id | string | The AWS access key ID for the S3 bucket. |
-| cache_params.s3_aws_secret_access_key | string | The AWS secret access key for the S3 bucket. |
-| cache_params.s3_endpoint_url | string | Optional - The endpoint URL for the S3 bucket. |
-| cache_params.supported_call_types | array of strings | The types of calls to cache. [Further docs](./caching) |
-| cache_params.mode | string | The mode of the cache. [Further docs](./caching) |
-| disable_end_user_cost_tracking | boolean | If true, turns off end user cost tracking on prometheus metrics + litellm spend logs table on proxy. |
-| key_generation_settings | object | Restricts who can generate keys. [Further docs](./virtual_keys.md#restricting-key-generation) |
-
-### general_settings - Reference
-
-| Name | Type | Description |
-|------|------|-------------|
-| completion_model | string | The default model to use for completions when `model` is not specified in the request |
-| disable_spend_logs | boolean | If true, turns off writing each transaction to the database |
-| disable_master_key_return | boolean | If true, turns off returning master key on UI. (checked on '/user/info' endpoint) |
-| disable_retry_on_max_parallel_request_limit_error | boolean | If true, turns off retries when max parallel request limit is reached |
-| disable_reset_budget | boolean | If true, turns off reset budget scheduled task |
-| disable_adding_master_key_hash_to_db | boolean | If true, turns off storing master key hash in db |
-| enable_jwt_auth | boolean | allow proxy admin to auth in via jwt tokens with 'litellm_proxy_admin' in claims. [Doc on JWT Tokens](token_auth) |
-| enforce_user_param | boolean | If true, requires all OpenAI endpoint requests to have a 'user' param. [Doc on call hooks](call_hooks)|
-| allowed_routes | array of strings | List of allowed proxy API routes a user can access [Doc on controlling allowed routes](enterprise#control-available-public-private-routes)|
-| key_management_system | string | Specifies the key management system. [Doc Secret Managers](../secret) |
-| master_key | string | The master key for the proxy [Set up Virtual Keys](virtual_keys) |
-| database_url | string | The URL for the database connection [Set up Virtual Keys](virtual_keys) |
-| database_connection_pool_limit | integer | The limit for database connection pool [Setting DB Connection Pool limit](#configure-db-pool-limits--connection-timeouts) |
-| database_connection_timeout | integer | The timeout for database connections in seconds [Setting DB Connection Pool limit, timeout](#configure-db-pool-limits--connection-timeouts) |
-| allow_requests_on_db_unavailable | boolean | If true, allows requests to succeed even if DB is unreachable. **Only use this if running LiteLLM in your VPC** This will allow requests to work even when LiteLLM cannot connect to the DB to verify a Virtual Key |
-| custom_auth | string | Write your own custom authentication logic [Doc Custom Auth](virtual_keys#custom-auth) |
-| max_parallel_requests | integer | The max parallel requests allowed per deployment |
-| global_max_parallel_requests | integer | The max parallel requests allowed on the proxy overall |
-| infer_model_from_keys | boolean | If true, infers the model from the provided keys |
-| background_health_checks | boolean | If true, enables background health checks. [Doc on health checks](health) |
-| health_check_interval | integer | The interval for health checks in seconds [Doc on health checks](health) |
-| alerting | array of strings | List of alerting methods [Doc on Slack Alerting](alerting) |
-| alerting_threshold | integer | The threshold for triggering alerts [Doc on Slack Alerting](alerting) |
-| use_client_credentials_pass_through_routes | boolean | If true, uses client credentials for all pass-through routes. [Doc on pass through routes](pass_through) |
-| health_check_details | boolean | If false, hides health check details (e.g. remaining rate limit). [Doc on health checks](health) |
-| public_routes | List[str] | (Enterprise Feature) Control list of public routes |
-| alert_types | List[str] | Control list of alert types to send to slack (Doc on alert types)[./alerting.md] |
-| enforced_params | List[str] | (Enterprise Feature) List of params that must be included in all requests to the proxy |
-| enable_oauth2_auth | boolean | (Enterprise Feature) If true, enables oauth2.0 authentication |
-| use_x_forwarded_for | str | If true, uses the X-Forwarded-For header to get the client IP address |
-| service_account_settings | List[Dict[str, Any]] | Set `service_account_settings` if you want to create settings that only apply to service account keys (Doc on service accounts)[./service_accounts.md] |
-| image_generation_model | str | The default model to use for image generation - ignores model set in request |
-| store_model_in_db | boolean | If true, allows `/model/new` endpoint to store model information in db. Endpoint disabled by default. [Doc on `/model/new` endpoint](./model_management.md#create-a-new-model) |
-| max_request_size_mb | int | The maximum size for requests in MB. Requests above this size will be rejected. |
-| max_response_size_mb | int | The maximum size for responses in MB. LLM Responses above this size will not be sent. |
-| proxy_budget_rescheduler_min_time | int | The minimum time (in seconds) to wait before checking db for budget resets. **Default is 597 seconds** |
-| proxy_budget_rescheduler_max_time | int | The maximum time (in seconds) to wait before checking db for budget resets. **Default is 605 seconds** |
-| proxy_batch_write_at | int | Time (in seconds) to wait before batch writing spend logs to the db. **Default is 10 seconds** |
-| alerting_args | dict | Args for Slack Alerting [Doc on Slack Alerting](./alerting.md) |
-| custom_key_generate | str | Custom function for key generation [Doc on custom key generation](./virtual_keys.md#custom--key-generate) |
-| allowed_ips | List[str] | List of IPs allowed to access the proxy. If not set, all IPs are allowed. |
-| embedding_model | str | The default model to use for embeddings - ignores model set in request |
-| default_team_disabled | boolean | If true, users cannot create 'personal' keys (keys with no team_id). |
-| alert_to_webhook_url | Dict[str] | [Specify a webhook url for each alert type.](./alerting.md#set-specific-slack-channels-per-alert-type) |
-| key_management_settings | List[Dict[str, Any]] | Settings for key management system (e.g. AWS KMS, Azure Key Vault) [Doc on key management](../secret.md) |
-| allow_user_auth | boolean | (Deprecated) old approach for user authentication. |
-| user_api_key_cache_ttl | int | The time (in seconds) to cache user api keys in memory. |
-| disable_prisma_schema_update | boolean | If true, turns off automatic schema updates to DB |
-| litellm_key_header_name | str | If set, allows passing LiteLLM keys as a custom header. [Doc on custom headers](./virtual_keys.md#custom-headers) |
-| moderation_model | str | The default model to use for moderation. |
-| custom_sso | str | Path to a python file that implements custom SSO logic. [Doc on custom SSO](./custom_sso.md) |
-| allow_client_side_credentials | boolean | If true, allows passing client side credentials to the proxy. (Useful when testing finetuning models) [Doc on client side credentials](./virtual_keys.md#client-side-credentials) |
-| admin_only_routes | List[str] | (Enterprise Feature) List of routes that are only accessible to admin users. [Doc on admin only routes](./enterprise#control-available-public-private-routes) |
-| use_azure_key_vault | boolean | If true, load keys from azure key vault |
-| use_google_kms | boolean | If true, load keys from google kms |
-| spend_report_frequency | str | Specify how often you want a Spend Report to be sent (e.g. "1d", "2d", "30d") [More on this](./alerting.md#spend-report-frequency) |
-| ui_access_mode | Literal["admin_only"] | If set, restricts access to the UI to admin users only. [Docs](./ui.md#restrict-ui-access) |
-| litellm_jwtauth | Dict[str, Any] | Settings for JWT authentication. [Docs](./token_auth.md) |
-| litellm_license | str | The license key for the proxy. [Docs](../enterprise.md#how-does-deployment-with-enterprise-license-work) |
-| oauth2_config_mappings | Dict[str, str] | Define the OAuth2 config mappings |
-| pass_through_endpoints | List[Dict[str, Any]] | Define the pass through endpoints. [Docs](./pass_through) |
-| enable_oauth2_proxy_auth | boolean | (Enterprise Feature) If true, enables oauth2.0 authentication |
-| forward_openai_org_id | boolean | If true, forwards the OpenAI Organization ID to the backend LLM call (if it's OpenAI). |
-| forward_client_headers_to_llm_api | boolean | If true, forwards the client headers (any `x-` headers) to the backend LLM call |
-
-### router_settings - Reference
-
-:::info
-
-Most values can also be set via `litellm_settings`. If you see overlapping values, settings on `router_settings` will override those on `litellm_settings`.
-:::
-
-```yaml
-router_settings:
- routing_strategy: usage-based-routing-v2 # Literal["simple-shuffle", "least-busy", "usage-based-routing","latency-based-routing"], default="simple-shuffle"
- redis_host: # string
- redis_password: # string
- redis_port: # string
- enable_pre_call_check: true # bool - Before call is made check if a call is within model context window
- allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.
- cooldown_time: 30 # (in seconds) how long to cooldown model if fails/min > allowed_fails
- disable_cooldowns: True # bool - Disable cooldowns for all models
- enable_tag_filtering: True # bool - Use tag based routing for requests
- retry_policy: { # Dict[str, int]: retry policy for different types of exceptions
- "AuthenticationErrorRetries": 3,
- "TimeoutErrorRetries": 3,
- "RateLimitErrorRetries": 3,
- "ContentPolicyViolationErrorRetries": 4,
- "InternalServerErrorRetries": 4
- }
- allowed_fails_policy: {
- "BadRequestErrorAllowedFails": 1000, # Allow 1000 BadRequestErrors before cooling down a deployment
- "AuthenticationErrorAllowedFails": 10, # int
- "TimeoutErrorAllowedFails": 12, # int
- "RateLimitErrorAllowedFails": 10000, # int
- "ContentPolicyViolationErrorAllowedFails": 15, # int
- "InternalServerErrorAllowedFails": 20, # int
- }
- content_policy_fallbacks=[{"claude-2": ["my-fallback-model"]}] # List[Dict[str, List[str]]]: Fallback model for content policy violations
- fallbacks=[{"claude-2": ["my-fallback-model"]}] # List[Dict[str, List[str]]]: Fallback model for all errors
-```
-
-| Name | Type | Description |
-|------|------|-------------|
-| routing_strategy | string | The strategy used for routing requests. Options: "simple-shuffle", "least-busy", "usage-based-routing", "latency-based-routing". Default is "simple-shuffle". [More information here](../routing) |
-| redis_host | string | The host address for the Redis server. **Only set this if you have multiple instances of LiteLLM Proxy and want current tpm/rpm tracking to be shared across them** |
-| redis_password | string | The password for the Redis server. **Only set this if you have multiple instances of LiteLLM Proxy and want current tpm/rpm tracking to be shared across them** |
-| redis_port | string | The port number for the Redis server. **Only set this if you have multiple instances of LiteLLM Proxy and want current tpm/rpm tracking to be shared across them**|
-| enable_pre_call_check | boolean | If true, checks if a call is within the model's context window before making the call. [More information here](reliability) |
-| content_policy_fallbacks | array of objects | Specifies fallback models for content policy violations. [More information here](reliability) |
-| fallbacks | array of objects | Specifies fallback models for all types of errors. [More information here](reliability) |
-| enable_tag_filtering | boolean | If true, uses tag based routing for requests [Tag Based Routing](tag_routing) |
-| cooldown_time | integer | The duration (in seconds) to cooldown a model if it exceeds the allowed failures. |
-| disable_cooldowns | boolean | If true, disables cooldowns for all models. [More information here](reliability) |
-| retry_policy | object | Specifies the number of retries for different types of exceptions. [More information here](reliability) |
-| allowed_fails | integer | The number of failures allowed before cooling down a model. [More information here](reliability) |
-| allowed_fails_policy | object | Specifies the number of allowed failures for different error types before cooling down a deployment. [More information here](reliability) |
-| default_max_parallel_requests | Optional[int] | The default maximum number of parallel requests for a deployment. |
-| default_priority | (Optional[int]) | The default priority for a request. Only for '.scheduler_acompletion()'. Default is None. |
-| polling_interval | (Optional[float]) | frequency of polling queue. Only for '.scheduler_acompletion()'. Default is 3ms. |
-| max_fallbacks | Optional[int] | The maximum number of fallbacks to try before exiting the call. Defaults to 5. |
-| default_litellm_params | Optional[dict] | The default litellm parameters to add to all requests (e.g. `temperature`, `max_tokens`). |
-| timeout | Optional[float] | The default timeout for a request. |
-| debug_level | Literal["DEBUG", "INFO"] | The debug level for the logging library in the router. Defaults to "INFO". |
-| client_ttl | int | Time-to-live for cached clients in seconds. Defaults to 3600. |
-| cache_kwargs | dict | Additional keyword arguments for the cache initialization. |
-| routing_strategy_args | dict | Additional keyword arguments for the routing strategy - e.g. lowest latency routing default ttl |
-| model_group_alias | dict | Model group alias mapping. E.g. `{"claude-3-haiku": "claude-3-haiku-20240229"}` |
-| num_retries | int | Number of retries for a request. Defaults to 3. |
-| default_fallbacks | Optional[List[str]] | Fallbacks to try if no model group-specific fallbacks are defined. |
-| caching_groups | Optional[List[tuple]] | List of model groups for caching across model groups. Defaults to None. - e.g. caching_groups=[("openai-gpt-3.5-turbo", "azure-gpt-3.5-turbo")]|
-| alerting_config | AlertingConfig | [SDK-only arg] Slack alerting configuration. Defaults to None. [Further Docs](../routing.md#alerting-) |
-| assistants_config | AssistantsConfig | Set on proxy via `assistant_settings`. [Further docs](../assistants.md) |
-| set_verbose | boolean | [DEPRECATED PARAM - see debug docs](./debugging.md) If true, sets the logging level to verbose. |
-| retry_after | int | Time to wait before retrying a request in seconds. Defaults to 0. If `x-retry-after` is received from LLM API, this value is overridden. |
-| provider_budget_config | ProviderBudgetConfig | Provider budget configuration. Use this to set llm_provider budget limits. example $100/day to OpenAI, $100/day to Azure, etc. Defaults to None. [Further Docs](./provider_budget_routing.md) |
-| enable_pre_call_checks | boolean | If true, checks if a call is within the model's context window before making the call. [More information here](reliability) |
-| model_group_retry_policy | Dict[str, RetryPolicy] | [SDK-only arg] Set retry policy for model groups. |
-| context_window_fallbacks | List[Dict[str, List[str]]] | Fallback models for context window violations. |
-| redis_url | str | URL for Redis server. **Known performance issue with Redis URL.** |
-| cache_responses | boolean | Flag to enable caching LLM Responses, if cache set under `router_settings`. If true, caches responses. Defaults to False. |
-| router_general_settings | RouterGeneralSettings | [SDK-Only] Router general settings - contains optimizations like 'async_only_mode'. [Docs](../routing.md#router-general-settings) |
-
-### environment variables - Reference
-
-| Name | Description |
-|------|-------------|
-| ACTIONS_ID_TOKEN_REQUEST_TOKEN | Token for requesting ID in GitHub Actions
-| ACTIONS_ID_TOKEN_REQUEST_URL | URL for requesting ID token in GitHub Actions
-| AISPEND_ACCOUNT_ID | Account ID for AI Spend
-| AISPEND_API_KEY | API Key for AI Spend
-| ALLOWED_EMAIL_DOMAINS | List of email domains allowed for access
-| ARIZE_API_KEY | API key for Arize platform integration
-| ARIZE_SPACE_KEY | Space key for Arize platform
-| ARGILLA_BATCH_SIZE | Batch size for Argilla logging
-| ARGILLA_API_KEY | API key for Argilla platform
-| ARGILLA_SAMPLING_RATE | Sampling rate for Argilla logging
-| ARGILLA_DATASET_NAME | Dataset name for Argilla logging
-| ARGILLA_BASE_URL | Base URL for Argilla service
-| ATHINA_API_KEY | API key for Athina service
-| AUTH_STRATEGY | Strategy used for authentication (e.g., OAuth, API key)
-| AWS_ACCESS_KEY_ID | Access Key ID for AWS services
-| AWS_PROFILE_NAME | AWS CLI profile name to be used
-| AWS_REGION_NAME | Default AWS region for service interactions
-| AWS_ROLE_NAME | Role name for AWS IAM usage
-| AWS_SECRET_ACCESS_KEY | Secret Access Key for AWS services
-| AWS_SESSION_NAME | Name for AWS session
-| AWS_WEB_IDENTITY_TOKEN | Web identity token for AWS
-| AZURE_API_VERSION | Version of the Azure API being used
-| AZURE_AUTHORITY_HOST | Azure authority host URL
-| AZURE_CLIENT_ID | Client ID for Azure services
-| AZURE_CLIENT_SECRET | Client secret for Azure services
-| AZURE_FEDERATED_TOKEN_FILE | File path to Azure federated token
-| AZURE_KEY_VAULT_URI | URI for Azure Key Vault
-| AZURE_TENANT_ID | Tenant ID for Azure Active Directory
-| BERRISPEND_ACCOUNT_ID | Account ID for BerriSpend service
-| BRAINTRUST_API_KEY | API key for Braintrust integration
-| CIRCLE_OIDC_TOKEN | OpenID Connect token for CircleCI
-| CIRCLE_OIDC_TOKEN_V2 | Version 2 of the OpenID Connect token for CircleCI
-| CONFIG_FILE_PATH | File path for configuration file
-| CUSTOM_TIKTOKEN_CACHE_DIR | Custom directory for Tiktoken cache
-| DATABASE_HOST | Hostname for the database server
-| DATABASE_NAME | Name of the database
-| DATABASE_PASSWORD | Password for the database user
-| DATABASE_PORT | Port number for database connection
-| DATABASE_SCHEMA | Schema name used in the database
-| DATABASE_URL | Connection URL for the database
-| DATABASE_USER | Username for database connection
-| DATABASE_USERNAME | Alias for database user
-| DATABRICKS_API_BASE | Base URL for Databricks API
-| DD_BASE_URL | Base URL for Datadog integration
-| DATADOG_BASE_URL | (Alternative to DD_BASE_URL) Base URL for Datadog integration
-| _DATADOG_BASE_URL | (Alternative to DD_BASE_URL) Base URL for Datadog integration
-| DD_API_KEY | API key for Datadog integration
-| DD_SITE | Site URL for Datadog (e.g., datadoghq.com)
-| DD_SOURCE | Source identifier for Datadog logs
-| DD_ENV | Environment identifier for Datadog logs. Only supported for `datadog_llm_observability` callback
-| DD_SERVICE | Service identifier for Datadog logs. Defaults to "litellm-server"
-| DD_VERSION | Version identifier for Datadog logs. Defaults to "unknown"
-| DEBUG_OTEL | Enable debug mode for OpenTelemetry
-| DIRECT_URL | Direct URL for service endpoint
-| DISABLE_ADMIN_UI | Toggle to disable the admin UI
-| DISABLE_SCHEMA_UPDATE | Toggle to disable schema updates
-| DOCS_DESCRIPTION | Description text for documentation pages
-| DOCS_FILTERED | Flag indicating filtered documentation
-| DOCS_TITLE | Title of the documentation pages
-| DOCS_URL | The path to the Swagger API documentation. **By default this is "/"**
-| EMAIL_SUPPORT_CONTACT | Support contact email address
-| GCS_BUCKET_NAME | Name of the Google Cloud Storage bucket
-| GCS_PATH_SERVICE_ACCOUNT | Path to the Google Cloud service account JSON file
-| GCS_FLUSH_INTERVAL | Flush interval for GCS logging (in seconds). Specify how often you want a log to be sent to GCS. **Default is 20 seconds**
-| GCS_BATCH_SIZE | Batch size for GCS logging. Specify after how many logs you want to flush to GCS. If `BATCH_SIZE` is set to 10, logs are flushed every 10 logs. **Default is 2048**
-| GENERIC_AUTHORIZATION_ENDPOINT | Authorization endpoint for generic OAuth providers
-| GENERIC_CLIENT_ID | Client ID for generic OAuth providers
-| GENERIC_CLIENT_SECRET | Client secret for generic OAuth providers
-| GENERIC_CLIENT_STATE | State parameter for generic client authentication
-| GENERIC_INCLUDE_CLIENT_ID | Include client ID in requests for OAuth
-| GENERIC_SCOPE | Scope settings for generic OAuth providers
-| GENERIC_TOKEN_ENDPOINT | Token endpoint for generic OAuth providers
-| GENERIC_USER_DISPLAY_NAME_ATTRIBUTE | Attribute for user's display name in generic auth
-| GENERIC_USER_EMAIL_ATTRIBUTE | Attribute for user's email in generic auth
-| GENERIC_USER_FIRST_NAME_ATTRIBUTE | Attribute for user's first name in generic auth
-| GENERIC_USER_ID_ATTRIBUTE | Attribute for user ID in generic auth
-| GENERIC_USER_LAST_NAME_ATTRIBUTE | Attribute for user's last name in generic auth
-| GENERIC_USER_PROVIDER_ATTRIBUTE | Attribute specifying the user's provider
-| GENERIC_USER_ROLE_ATTRIBUTE | Attribute specifying the user's role
-| GENERIC_USERINFO_ENDPOINT | Endpoint to fetch user information in generic OAuth
-| GALILEO_BASE_URL | Base URL for Galileo platform
-| GALILEO_PASSWORD | Password for Galileo authentication
-| GALILEO_PROJECT_ID | Project ID for Galileo usage
-| GALILEO_USERNAME | Username for Galileo authentication
-| GREENSCALE_API_KEY | API key for Greenscale service
-| GREENSCALE_ENDPOINT | Endpoint URL for Greenscale service
-| GOOGLE_APPLICATION_CREDENTIALS | Path to Google Cloud credentials JSON file
-| GOOGLE_CLIENT_ID | Client ID for Google OAuth
-| GOOGLE_CLIENT_SECRET | Client secret for Google OAuth
-| GOOGLE_KMS_RESOURCE_NAME | Name of the resource in Google KMS
-| HF_API_BASE | Base URL for Hugging Face API
-| HELICONE_API_KEY | API key for Helicone service
-| HUGGINGFACE_API_BASE | Base URL for Hugging Face API
-| IAM_TOKEN_DB_AUTH | IAM token for database authentication
-| JSON_LOGS | Enable JSON formatted logging
-| JWT_AUDIENCE | Expected audience for JWT tokens
-| JWT_PUBLIC_KEY_URL | URL to fetch public key for JWT verification
-| LAGO_API_BASE | Base URL for Lago API
-| LAGO_API_CHARGE_BY | Parameter to determine charge basis in Lago
-| LAGO_API_EVENT_CODE | Event code for Lago API events
-| LAGO_API_KEY | API key for accessing Lago services
-| LANGFUSE_DEBUG | Toggle debug mode for Langfuse
-| LANGFUSE_FLUSH_INTERVAL | Interval for flushing Langfuse logs
-| LANGFUSE_HOST | Host URL for Langfuse service
-| LANGFUSE_PUBLIC_KEY | Public key for Langfuse authentication
-| LANGFUSE_RELEASE | Release version of Langfuse integration
-| LANGFUSE_SECRET_KEY | Secret key for Langfuse authentication
-| LANGSMITH_API_KEY | API key for Langsmith platform
-| LANGSMITH_BASE_URL | Base URL for Langsmith service
-| LANGSMITH_BATCH_SIZE | Batch size for operations in Langsmith
-| LANGSMITH_DEFAULT_RUN_NAME | Default name for Langsmith run
-| LANGSMITH_PROJECT | Project name for Langsmith integration
-| LANGSMITH_SAMPLING_RATE | Sampling rate for Langsmith logging
-| LANGTRACE_API_KEY | API key for Langtrace service
-| LITERAL_API_KEY | API key for Literal integration
-| LITERAL_API_URL | API URL for Literal service
-| LITERAL_BATCH_SIZE | Batch size for Literal operations
-| LITELLM_DONT_SHOW_FEEDBACK_BOX | Flag to hide feedback box in LiteLLM UI
-| LITELLM_DROP_PARAMS | Parameters to drop in LiteLLM requests
-| LITELLM_EMAIL | Email associated with LiteLLM account
-| LITELLM_GLOBAL_MAX_PARALLEL_REQUEST_RETRIES | Maximum retries for parallel requests in LiteLLM
-| LITELLM_GLOBAL_MAX_PARALLEL_REQUEST_RETRY_TIMEOUT | Timeout for retries of parallel requests in LiteLLM
-| LITELLM_HOSTED_UI | URL of the hosted UI for LiteLLM
-| LITELLM_LICENSE | License key for LiteLLM usage
-| LITELLM_LOCAL_MODEL_COST_MAP | Local configuration for model cost mapping in LiteLLM
-| LITELLM_LOG | Enable detailed logging for LiteLLM
-| LITELLM_MODE | Operating mode for LiteLLM (e.g., production, development)
-| LITELLM_SALT_KEY | Salt key for encryption in LiteLLM
-| LITELLM_SECRET_AWS_KMS_LITELLM_LICENSE | AWS KMS encrypted license for LiteLLM
-| LITELLM_TOKEN | Access token for LiteLLM integration
-| LOGFIRE_TOKEN | Token for Logfire logging service
-| MICROSOFT_CLIENT_ID | Client ID for Microsoft services
-| MICROSOFT_CLIENT_SECRET | Client secret for Microsoft services
-| MICROSOFT_TENANT | Tenant ID for Microsoft Azure
-| NO_DOCS | Flag to disable documentation generation
-| NO_PROXY | List of addresses to bypass proxy
-| OAUTH_TOKEN_INFO_ENDPOINT | Endpoint for OAuth token info retrieval
-| OPENAI_API_BASE | Base URL for OpenAI API
-| OPENAI_API_KEY | API key for OpenAI services
-| OPENAI_ORGANIZATION | Organization identifier for OpenAI
-| OPENID_BASE_URL | Base URL for OpenID Connect services
-| OPENID_CLIENT_ID | Client ID for OpenID Connect authentication
-| OPENID_CLIENT_SECRET | Client secret for OpenID Connect authentication
-| OPENMETER_API_ENDPOINT | API endpoint for OpenMeter integration
-| OPENMETER_API_KEY | API key for OpenMeter services
-| OPENMETER_EVENT_TYPE | Type of events sent to OpenMeter
-| OTEL_ENDPOINT | OpenTelemetry endpoint for traces
-| OTEL_ENVIRONMENT_NAME | Environment name for OpenTelemetry
-| OTEL_EXPORTER | Exporter type for OpenTelemetry
-| OTEL_HEADERS | Headers for OpenTelemetry requests
-| OTEL_SERVICE_NAME | Service name identifier for OpenTelemetry
-| OTEL_TRACER_NAME | Tracer name for OpenTelemetry tracing
-| PREDIBASE_API_BASE | Base URL for Predibase API
-| PRESIDIO_ANALYZER_API_BASE | Base URL for Presidio Analyzer service
-| PRESIDIO_ANONYMIZER_API_BASE | Base URL for Presidio Anonymizer service
-| PROMETHEUS_URL | URL for Prometheus service
-| PROMPTLAYER_API_KEY | API key for PromptLayer integration
-| PROXY_ADMIN_ID | Admin identifier for proxy server
-| PROXY_BASE_URL | Base URL for proxy service
-| PROXY_LOGOUT_URL | URL for logging out of the proxy service
-| PROXY_MASTER_KEY | Master key for proxy authentication
-| QDRANT_API_BASE | Base URL for Qdrant API
-| QDRANT_API_KEY | API key for Qdrant service
-| QDRANT_URL | Connection URL for Qdrant database
-| REDIS_HOST | Hostname for Redis server
-| REDIS_PASSWORD | Password for Redis service
-| REDIS_PORT | Port number for Redis server
-| REDOC_URL | The path to the Redoc Fast API documentation. **By default this is "/redoc"**
-| SERVER_ROOT_PATH | Root path for the server application
-| SET_VERBOSE | Flag to enable verbose logging
-| SLACK_DAILY_REPORT_FREQUENCY | Frequency of daily Slack reports (e.g., daily, weekly)
-| SLACK_WEBHOOK_URL | Webhook URL for Slack integration
-| SMTP_HOST | Hostname for the SMTP server
-| SMTP_PASSWORD | Password for SMTP authentication
-| SMTP_PORT | Port number for SMTP server
-| SMTP_SENDER_EMAIL | Email address used as the sender in SMTP transactions
-| SMTP_SENDER_LOGO | Logo used in emails sent via SMTP
-| SMTP_TLS | Flag to enable or disable TLS for SMTP connections
-| SMTP_USERNAME | Username for SMTP authentication
-| SPEND_LOGS_URL | URL for retrieving spend logs
-| SSL_CERTIFICATE | Path to the SSL certificate file
-| SSL_VERIFY | Flag to enable or disable SSL certificate verification
-| SUPABASE_KEY | API key for Supabase service
-| SUPABASE_URL | Base URL for Supabase instance
-| TEST_EMAIL_ADDRESS | Email address used for testing purposes
-| UI_LOGO_PATH | Path to the logo image used in the UI
-| UI_PASSWORD | Password for accessing the UI
-| UI_USERNAME | Username for accessing the UI
-| UPSTREAM_LANGFUSE_DEBUG | Flag to enable debugging for upstream Langfuse
-| UPSTREAM_LANGFUSE_HOST | Host URL for upstream Langfuse service
-| UPSTREAM_LANGFUSE_PUBLIC_KEY | Public key for upstream Langfuse authentication
-| UPSTREAM_LANGFUSE_RELEASE | Release version identifier for upstream Langfuse
-| UPSTREAM_LANGFUSE_SECRET_KEY | Secret key for upstream Langfuse authentication
-| USE_AWS_KMS | Flag to enable AWS Key Management Service for encryption
-| WEBHOOK_URL | URL for receiving webhooks from external services
-
diff --git a/docs/my-website/docs/proxy/configs.md b/docs/my-website/docs/proxy/configs.md
deleted file mode 100644
index 7876c9dec..000000000
--- a/docs/my-website/docs/proxy/configs.md
+++ /dev/null
@@ -1,618 +0,0 @@
-import Image from '@theme/IdealImage';
-import Tabs from '@theme/Tabs';
-import TabItem from '@theme/TabItem';
-
-# Overview
-Set model list, `api_base`, `api_key`, `temperature` & proxy server settings (`master-key`) on the config.yaml.
-
-| Param Name | Description |
-|----------------------|---------------------------------------------------------------|
-| `model_list` | List of supported models on the server, with model-specific configs |
-| `router_settings` | litellm Router settings, example `routing_strategy="least-busy"` [**see all**](#router-settings)|
-| `litellm_settings` | litellm Module settings, example `litellm.drop_params=True`, `litellm.set_verbose=True`, `litellm.api_base`, `litellm.cache` [**see all**](#all-settings)|
-| `general_settings` | Server settings, example setting `master_key: sk-my_special_key` |
-| `environment_variables` | Environment Variables example, `REDIS_HOST`, `REDIS_PORT` |
-
-**Complete List:** Check the Swagger UI docs on `/#/config.yaml` (e.g. http://0.0.0.0:4000/#/config.yaml), for everything you can pass in the config.yaml.
-
-
-## Quick Start
-
-Set a model alias for your deployments.
-
-In the `config.yaml` the model_name parameter is the user-facing name to use for your deployment.
-
-In the config below:
-- `model_name`: the name to pass TO litellm from the external client
-- `litellm_params.model`: the model string passed to the litellm.completion() function
-
-E.g.:
-- `model=vllm-models` will route to `openai/facebook/opt-125m`.
-- `model=gpt-3.5-turbo` will load balance between `azure/gpt-turbo-small-eu` and `azure/gpt-turbo-small-ca`
-
-```yaml
-model_list:
- - model_name: gpt-3.5-turbo ### RECEIVED MODEL NAME ###
- litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
- model: azure/gpt-turbo-small-eu ### MODEL NAME sent to `litellm.completion()` ###
- api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
- api_key: "os.environ/AZURE_API_KEY_EU" # does os.getenv("AZURE_API_KEY_EU")
- rpm: 6 # [OPTIONAL] Rate limit for this deployment: in requests per minute (rpm)
- - model_name: bedrock-claude-v1
- litellm_params:
- model: bedrock/anthropic.claude-instant-v1
- - model_name: gpt-3.5-turbo
- litellm_params:
- model: azure/gpt-turbo-small-ca
- api_base: https://my-endpoint-canada-berri992.openai.azure.com/
- api_key: "os.environ/AZURE_API_KEY_CA"
- rpm: 6
- - model_name: anthropic-claude
- litellm_params:
- model: bedrock/anthropic.claude-instant-v1
- ### [OPTIONAL] SET AWS REGION ###
- aws_region_name: us-east-1
- - model_name: vllm-models
- litellm_params:
- model: openai/facebook/opt-125m # the `openai/` prefix tells litellm it's openai compatible
- api_base: http://0.0.0.0:4000/v1
- api_key: none
- rpm: 1440
- model_info:
- version: 2
-
- # Use this if you want to make requests to `claude-3-haiku-20240307`,`claude-3-opus-20240229`,`claude-2.1` without defining them on the config.yaml
- # Default models
- # Works for ALL Providers and needs the default provider credentials in .env
- - model_name: "*"
- litellm_params:
- model: "*"
-
-litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
- drop_params: True
- success_callback: ["langfuse"] # OPTIONAL - if you want to start sending LLM Logs to Langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your env
-
-general_settings:
- master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)
- alerting: ["slack"] # [OPTIONAL] If you want Slack Alerts for Hanging LLM requests, Slow llm responses, Budget Alerts. Make sure to set `SLACK_WEBHOOK_URL` in your env
-```
-:::info
-
-For more provider-specific info, [go here](../providers/)
-
-:::
-
-#### Step 2: Start Proxy with config
-
-```shell
-$ litellm --config /path/to/config.yaml
-```
-
-:::tip
-
-Run with `--detailed_debug` if you need detailed debug logs
-
-```shell
-$ litellm --config /path/to/config.yaml --detailed_debug
-```
-
-:::
-
-#### Step 3: Test it
-
-Sends request to model where `model_name=gpt-3.5-turbo` on config.yaml.
-
-If multiple with `model_name=gpt-3.5-turbo` does [Load Balancing](https://docs.litellm.ai/docs/proxy/load_balancing)
-
-**[Langchain, OpenAI SDK Usage Examples](../proxy/user_keys#request-format)**
-
-```shell
-curl --location 'http://0.0.0.0:4000/chat/completions' \
---header 'Content-Type: application/json' \
---data ' {
- "model": "gpt-3.5-turbo",
- "messages": [
- {
- "role": "user",
- "content": "what llm are you"
- }
- ],
- }
-'
-```
-
-## LLM configs `model_list`
-
-### Model-specific params (API Base, Keys, Temperature, Max Tokens, Organization, Headers etc.)
-You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.
-
-[**All input params**](https://docs.litellm.ai/docs/completion/input#input-params-1)
-
-**Step 1**: Create a `config.yaml` file
-```yaml
-model_list:
- - model_name: gpt-4-team1
- litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
- model: azure/chatgpt-v-2
- api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
- api_version: "2023-05-15"
- azure_ad_token: eyJ0eXAiOiJ
- seed: 12
- max_tokens: 20
- - model_name: gpt-4-team2
- litellm_params:
- model: azure/gpt-4
- api_key: sk-123
- api_base: https://openai-gpt-4-test-v-2.openai.azure.com/
- temperature: 0.2
- - model_name: openai-gpt-3.5
- litellm_params:
- model: openai/gpt-3.5-turbo
- extra_headers: {"AI-Resource Group": "ishaan-resource"}
- api_key: sk-123
- organization: org-ikDc4ex8NB
- temperature: 0.2
- - model_name: mistral-7b
- litellm_params:
- model: ollama/mistral
- api_base: your_ollama_api_base
-```
-
-**Step 2**: Start server with config
-
-```shell
-$ litellm --config /path/to/config.yaml
-```
-
-**Expected Logs:**
-
-Look for this line in your console logs to confirm the config.yaml was loaded in correctly.
-```
-LiteLLM: Proxy initialized with Config, Set models:
-```
-
-### Embedding Models - Use Sagemaker, Bedrock, Azure, OpenAI, XInference
-
-See supported Embedding Providers & Models [here](https://docs.litellm.ai/docs/embedding/supported_embedding)
-
-
-
-
-
-```yaml
-model_list:
- - model_name: bedrock-cohere
- litellm_params:
- model: "bedrock/cohere.command-text-v14"
- aws_region_name: "us-west-2"
- - model_name: bedrock-cohere
- litellm_params:
- model: "bedrock/cohere.command-text-v14"
- aws_region_name: "us-east-2"
- - model_name: bedrock-cohere
- litellm_params:
- model: "bedrock/cohere.command-text-v14"
- aws_region_name: "us-east-1"
-
-```
-
-
-
-
-
-Here's how to route between GPT-J embedding (sagemaker endpoint), Amazon Titan embedding (Bedrock) and Azure OpenAI embedding on the proxy server:
-
-```yaml
-model_list:
- - model_name: sagemaker-embeddings
- litellm_params:
- model: "sagemaker/berri-benchmarking-gpt-j-6b-fp16"
- - model_name: amazon-embeddings
- litellm_params:
- model: "bedrock/amazon.titan-embed-text-v1"
- - model_name: azure-embeddings
- litellm_params:
- model: "azure/azure-embedding-model"
- api_base: "os.environ/AZURE_API_BASE" # os.getenv("AZURE_API_BASE")
- api_key: "os.environ/AZURE_API_KEY" # os.getenv("AZURE_API_KEY")
- api_version: "2023-07-01-preview"
-
-general_settings:
- master_key: sk-1234 # [OPTIONAL] if set all calls to proxy will require either this key or a valid generated token
-```
-
-
-
-
-LiteLLM Proxy supports all Feature-Extraction Embedding models.
-
-```yaml
-model_list:
- - model_name: deployed-codebert-base
- litellm_params:
- # send request to deployed hugging face inference endpoint
- model: huggingface/microsoft/codebert-base # add huggingface prefix so it routes to hugging face
- api_key: hf_LdS # api key for hugging face inference endpoint
- api_base: https://uysneno1wv2wd4lw.us-east-1.aws.endpoints.huggingface.cloud # your hf inference endpoint
- - model_name: codebert-base
- litellm_params:
- # no api_base set, sends request to hugging face free inference api https://api-inference.huggingface.co/models/
- model: huggingface/microsoft/codebert-base # add huggingface prefix so it routes to hugging face
- api_key: hf_LdS # api key for hugging face
-
-```
-
-
-
-
-
-```yaml
-model_list:
- - model_name: azure-embedding-model # model group
- litellm_params:
- model: azure/azure-embedding-model # model name for litellm.embedding(model=azure/azure-embedding-model) call
- api_base: your-azure-api-base
- api_key: your-api-key
- api_version: 2023-07-01-preview
-```
-
-
-
-
-
-```yaml
-model_list:
-- model_name: text-embedding-ada-002 # model group
- litellm_params:
- model: text-embedding-ada-002 # model name for litellm.embedding(model=text-embedding-ada-002)
- api_key: your-api-key-1
-- model_name: text-embedding-ada-002
- litellm_params:
- model: text-embedding-ada-002
- api_key: your-api-key-2
-```
-
-
-
-
-
-
-https://docs.litellm.ai/docs/providers/xinference
-
-**Note add `xinference/` prefix to `litellm_params`: `model` so litellm knows to route to OpenAI**
-
-```yaml
-model_list:
-- model_name: embedding-model # model group
- litellm_params:
- model: xinference/bge-base-en # model name for litellm.embedding(model=xinference/bge-base-en)
- api_base: http://0.0.0.0:9997/v1
-```
-
-
-
-
-
-