diff --git a/docs/my-website/docs/proxy/load_balancing.md b/docs/my-website/docs/proxy/load_balancing.md index 691592cb6..ff3a351c6 100644 --- a/docs/my-website/docs/proxy/load_balancing.md +++ b/docs/my-website/docs/proxy/load_balancing.md @@ -1,4 +1,4 @@ -# Load Balancing - Config Setup +# Multiple Instances Load balance multiple instances of the same model The proxy will handle routing requests (using LiteLLM's Router). **Set `rpm` in the config if you want maximize throughput** @@ -10,75 +10,6 @@ For more details on routing strategies / params, see [Routing](../routing.md) ::: -## Quick Start - Load Balancing -### Step 1 - Set deployments on config - -**Example config below**. Here requests with `model=gpt-3.5-turbo` will be routed across multiple instances of `azure/gpt-3.5-turbo` -```yaml -model_list: - - model_name: gpt-3.5-turbo - litellm_params: - model: azure/ - api_base: - api_key: - rpm: 6 # Rate limit for this deployment: in requests per minute (rpm) - - model_name: gpt-3.5-turbo - litellm_params: - model: azure/gpt-turbo-small-ca - api_base: https://my-endpoint-canada-berri992.openai.azure.com/ - api_key: - rpm: 6 - - model_name: gpt-3.5-turbo - litellm_params: - model: azure/gpt-turbo-large - api_base: https://openai-france-1234.openai.azure.com/ - api_key: - rpm: 1440 -``` - -### Step 2: Start Proxy with config - -```shell -$ litellm --config /path/to/config.yaml -``` - -### Step 3: Use proxy - Call a model group [Load Balancing] -Curl Command -```shell -curl --location 'http://0.0.0.0:4000/chat/completions' \ ---header 'Content-Type: application/json' \ ---data ' { - "model": "gpt-3.5-turbo", - "messages": [ - { - "role": "user", - "content": "what llm are you" - } - ], - } -' -``` - -### Usage - Call a specific model deployment -If you want to call a specific model defined in the `config.yaml`, you can call the `litellm_params: model` - -In this example it will call `azure/gpt-turbo-small-ca`. Defined in the config on Step 1 - -```bash -curl --location 'http://0.0.0.0:4000/chat/completions' \ ---header 'Content-Type: application/json' \ ---data ' { - "model": "azure/gpt-turbo-small-ca", - "messages": [ - { - "role": "user", - "content": "what llm are you" - } - ], - } -' -``` - ## Load Balancing using multiple litellm instances (Kubernetes, Auto Scaling) LiteLLM Proxy supports sharing rpm/tpm shared across multiple litellm instances, pass `redis_host`, `redis_password` and `redis_port` to enable this. (LiteLLM will use Redis to track rpm/tpm usage ) diff --git a/docs/my-website/docs/proxy/reliability.md b/docs/my-website/docs/proxy/reliability.md index 7527a3d5b..51e90fe39 100644 --- a/docs/my-website/docs/proxy/reliability.md +++ b/docs/my-website/docs/proxy/reliability.md @@ -2,7 +2,9 @@ import Image from '@theme/IdealImage'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -# Fallbacks, Retries, Timeouts, Cooldowns +# 🔥 Fallbacks, Retries, Timeouts, Load Balancing + +Retry call with multiple instances of the same model. If a call fails after num_retries, fall back to another model group. @@ -10,6 +12,77 @@ If the error is a context window exceeded error, fall back to a larger model gro [**See Code**](https://github.com/BerriAI/litellm/blob/main/litellm/router.py) +## Quick Start - Load Balancing +### Step 1 - Set deployments on config + +**Example config below**. Here requests with `model=gpt-3.5-turbo` will be routed across multiple instances of `azure/gpt-3.5-turbo` +```yaml +model_list: + - model_name: gpt-3.5-turbo + litellm_params: + model: azure/ + api_base: + api_key: + rpm: 6 # Rate limit for this deployment: in requests per minute (rpm) + - model_name: gpt-3.5-turbo + litellm_params: + model: azure/gpt-turbo-small-ca + api_base: https://my-endpoint-canada-berri992.openai.azure.com/ + api_key: + rpm: 6 + - model_name: gpt-3.5-turbo + litellm_params: + model: azure/gpt-turbo-large + api_base: https://openai-france-1234.openai.azure.com/ + api_key: + rpm: 1440 +``` + +### Step 2: Start Proxy with config + +```shell +$ litellm --config /path/to/config.yaml +``` + +### Step 3: Use proxy - Call a model group [Load Balancing] +Curl Command +```shell +curl --location 'http://0.0.0.0:4000/chat/completions' \ +--header 'Content-Type: application/json' \ +--data ' { + "model": "gpt-3.5-turbo", + "messages": [ + { + "role": "user", + "content": "what llm are you" + } + ], + } +' +``` + +### Usage - Call a specific model deployment +If you want to call a specific model defined in the `config.yaml`, you can call the `litellm_params: model` + +In this example it will call `azure/gpt-turbo-small-ca`. Defined in the config on Step 1 + +```bash +curl --location 'http://0.0.0.0:4000/chat/completions' \ +--header 'Content-Type: application/json' \ +--data ' { + "model": "azure/gpt-turbo-small-ca", + "messages": [ + { + "role": "user", + "content": "what llm are you" + } + ], + } +' +``` + +## Fallbacks + Retries + Timeouts + Cooldowns + **Set via config** ```yaml model_list: @@ -63,7 +136,143 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \ ' ``` -## Custom Timeouts, Stream Timeouts - Per Model +## Advanced - Context Window Fallbacks + +**Before call is made** check if a call is within model context window with **`enable_pre_call_checks: true`**. + +[**See Code**](https://github.com/BerriAI/litellm/blob/c9e6b05cfb20dfb17272218e2555d6b496c47f6f/litellm/router.py#L2163) + +**1. Setup config** + +For azure deployments, set the base model. Pick the base model from [this list](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json), all the azure models start with azure/. + + + + + +Filter older instances of a model (e.g. gpt-3.5-turbo) with smaller context windows + +```yaml +router_settings: + enable_pre_call_checks: true # 1. Enable pre-call checks + +model_list: + - model_name: gpt-3.5-turbo + litellm_params: + model: azure/chatgpt-v-2 + api_base: os.environ/AZURE_API_BASE + api_key: os.environ/AZURE_API_KEY + api_version: "2023-07-01-preview" + model_info: + base_model: azure/gpt-4-1106-preview # 2. 👈 (azure-only) SET BASE MODEL + + - model_name: gpt-3.5-turbo + litellm_params: + model: gpt-3.5-turbo-1106 + api_key: os.environ/OPENAI_API_KEY +``` + +**2. Start proxy** + +```bash +litellm --config /path/to/config.yaml + +# RUNNING on http://0.0.0.0:4000 +``` + +**3. Test it!** + +```python +import openai +client = openai.OpenAI( + api_key="anything", + base_url="http://0.0.0.0:4000" +) + +text = "What is the meaning of 42?" * 5000 + +# request sent to model set on litellm proxy, `litellm --model` +response = client.chat.completions.create( + model="gpt-3.5-turbo", + messages = [ + {"role": "system", "content": text}, + {"role": "user", "content": "Who was Alexander?"}, + ], +) + +print(response) +``` + + + + + +Fallback to larger models if current model is too small. + +```yaml +router_settings: + enable_pre_call_checks: true # 1. Enable pre-call checks + +model_list: + - model_name: gpt-3.5-turbo-small + litellm_params: + model: azure/chatgpt-v-2 + api_base: os.environ/AZURE_API_BASE + api_key: os.environ/AZURE_API_KEY + api_version: "2023-07-01-preview" + model_info: + base_model: azure/gpt-4-1106-preview # 2. 👈 (azure-only) SET BASE MODEL + + - model_name: gpt-3.5-turbo-large + litellm_params: + model: gpt-3.5-turbo-1106 + api_key: os.environ/OPENAI_API_KEY + + - model_name: claude-opus + litellm_params: + model: claude-3-opus-20240229 + api_key: os.environ/ANTHROPIC_API_KEY + +litellm_settings: + context_window_fallbacks: [{"gpt-3.5-turbo-small": ["gpt-3.5-turbo-large", "claude-opus"]}] +``` + +**2. Start proxy** + +```bash +litellm --config /path/to/config.yaml + +# RUNNING on http://0.0.0.0:4000 +``` + +**3. Test it!** + +```python +import openai +client = openai.OpenAI( + api_key="anything", + base_url="http://0.0.0.0:4000" +) + +text = "What is the meaning of 42?" * 5000 + +# request sent to model set on litellm proxy, `litellm --model` +response = client.chat.completions.create( + model="gpt-3.5-turbo", + messages = [ + {"role": "system", "content": text}, + {"role": "user", "content": "Who was Alexander?"}, + ], +) + +print(response) +``` + + + + + +## Advanced - Custom Timeouts, Stream Timeouts - Per Model For each model you can set `timeout` & `stream_timeout` under `litellm_params` ```yaml model_list: @@ -92,7 +301,7 @@ $ litellm --config /path/to/config.yaml ``` -## Setting Dynamic Timeouts - Per Request +## Advanced - Setting Dynamic Timeouts - Per Request LiteLLM Proxy supports setting a `timeout` per request diff --git a/docs/my-website/docs/routing.md b/docs/my-website/docs/routing.md index fb16c4f08..404c72e44 100644 --- a/docs/my-website/docs/routing.md +++ b/docs/my-website/docs/routing.md @@ -567,10 +567,14 @@ from litellm import Router router = Router(model_list=model_list, enable_pre_call_checks=True) # 👈 Set to True ``` -**2. (Azure-only) Set base model** + +**2. Set Model List** For azure deployments, set the base model. Pick the base model from [this list](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json), all the azure models start with `azure/`. + + + ```python model_list = [ { @@ -582,7 +586,7 @@ model_list = [ "api_base": os.getenv("AZURE_API_BASE"), }, "model_info": { - "base_model": "azure/gpt-35-turbo", # 👈 SET BASE MODEL + "base_model": "azure/gpt-35-turbo", # 👈 (Azure-only) SET BASE MODEL } }, { @@ -593,8 +597,51 @@ model_list = [ }, }, ] + +router = Router(model_list=model_list, enable_pre_call_checks=True) ``` + + + + +```python +model_list = [ + { + "model_name": "gpt-3.5-turbo-small", # model group name + "litellm_params": { # params for litellm completion/embedding call + "model": "azure/chatgpt-v-2", + "api_key": os.getenv("AZURE_API_KEY"), + "api_version": os.getenv("AZURE_API_VERSION"), + "api_base": os.getenv("AZURE_API_BASE"), + }, + "model_info": { + "base_model": "azure/gpt-35-turbo", # 👈 (Azure-only) SET BASE MODEL + } + }, + { + "model_name": "gpt-3.5-turbo-large", # model group name + "litellm_params": { # params for litellm completion/embedding call + "model": "gpt-3.5-turbo-1106", + "api_key": os.getenv("OPENAI_API_KEY"), + }, + }, + { + "model_name": "claude-opus", + "litellm_params": { call + "model": "claude-3-opus-20240229", + "api_key": os.getenv("ANTHROPIC_API_KEY"), + }, + }, + ] + +router = Router(model_list=model_list, enable_pre_call_checks=True, context_window_fallbacks=[{"gpt-3.5-turbo-small": ["gpt-3.5-turbo-large", "claude-opus"]}]) +``` + + + + + **3. Test it!** ```python @@ -646,60 +693,9 @@ print(f"response: {response}") -**1. Setup config** - -For azure deployments, set the base model. Pick the base model from [this list](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json), all the azure models start with azure/. - -```yaml -router_settings: - enable_pre_call_checks: true # 1. Enable pre-call checks - -model_list: - - model_name: gpt-3.5-turbo - litellm_params: - model: azure/chatgpt-v-2 - api_base: os.environ/AZURE_API_BASE - api_key: os.environ/AZURE_API_KEY - api_version: "2023-07-01-preview" - model_info: - base_model: azure/gpt-4-1106-preview # 2. 👈 (azure-only) SET BASE MODEL - - - model_name: gpt-3.5-turbo - litellm_params: - model: gpt-3.5-turbo-1106 - api_key: os.environ/OPENAI_API_KEY -``` - -**2. Start proxy** - -```bash -litellm --config /path/to/config.yaml - -# RUNNING on http://0.0.0.0:4000 -``` - -**3. Test it!** - -```python -import openai -client = openai.OpenAI( - api_key="anything", - base_url="http://0.0.0.0:4000" -) - -text = "What is the meaning of 42?" * 5000 - -# request sent to model set on litellm proxy, `litellm --model` -response = client.chat.completions.create( - model="gpt-3.5-turbo", - messages = [ - {"role": "system", "content": text}, - {"role": "user", "content": "Who was Alexander?"}, - ], -) - -print(response) -``` +:::info +Go [here](./proxy/reliability.md#advanced---context-window-fallbacks) for how to do this on the proxy +::: diff --git a/docs/my-website/sidebars.js b/docs/my-website/sidebars.js index 63f3fbb02..5d5e24371 100644 --- a/docs/my-website/sidebars.js +++ b/docs/my-website/sidebars.js @@ -31,24 +31,25 @@ const sidebars = { "proxy/quick_start", "proxy/deploy", "proxy/prod", - "proxy/configs", { type: "link", - label: "📖 All Endpoints", + label: "📖 All Endpoints (Swagger)", href: "https://litellm-api.up.railway.app/", }, - "proxy/enterprise", - "proxy/user_keys", - "proxy/virtual_keys", + "proxy/configs", + "proxy/reliability", "proxy/users", + "proxy/user_keys", + "proxy/enterprise", + "proxy/virtual_keys", "proxy/team_based_routing", "proxy/ui", "proxy/cost_tracking", "proxy/token_auth", { type: "category", - label: "🔥 Load Balancing", - items: ["proxy/load_balancing", "proxy/reliability"], + label: "Extra Load Balancing", + items: ["proxy/load_balancing"], }, "proxy/model_management", "proxy/health", diff --git a/litellm/router.py b/litellm/router.py index 18aa83369..616633340 100644 --- a/litellm/router.py +++ b/litellm/router.py @@ -2170,7 +2170,7 @@ class Router: Filter out model in model group, if: - model context window < message length - - function call and model doesn't support function calling + - [TODO] function call and model doesn't support function calling """ verbose_router_logger.debug( f"Starting Pre-call checks for deployments in model={model}"