forked from phoenix/litellm-mirror
450 lines
13 KiB
Markdown
450 lines
13 KiB
Markdown
import Image from '@theme/IdealImage';
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
|
|
|
|
# Router - Load Balancing, Fallbacks
|
|
|
|
LiteLLM manages:
|
|
- Load-balance across multiple deployments (e.g. Azure/OpenAI)
|
|
- Prioritizing important requests to ensure they don't fail (i.e. Queueing)
|
|
- Basic reliability logic - cooldowns, fallbacks, timeouts and retries (fixed + exponential backoff) across multiple deployments/providers.
|
|
|
|
In production, litellm supports using Redis as a way to track cooldown server and usage (managing tpm/rpm limits).
|
|
|
|
:::info
|
|
|
|
If you want a server to load balance across different LLM APIs, use our [OpenAI Proxy Server](./proxy/load_balancing.md)
|
|
|
|
:::
|
|
|
|
|
|
## Load Balancing
|
|
(s/o [@paulpierre](https://www.linkedin.com/in/paulpierre/) and [sweep proxy](https://docs.sweep.dev/blogs/openai-proxy) for their contributions to this implementation)
|
|
[**See Code**](https://github.com/BerriAI/litellm/blob/main/litellm/router.py)
|
|
|
|
### Quick Start
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
model_list = [{ # list of model deployments
|
|
"model_name": "gpt-3.5-turbo", # model alias
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-v-2", # actual model name
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
}
|
|
}, {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-functioncalling",
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
}
|
|
}, {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "gpt-3.5-turbo",
|
|
"api_key": os.getenv("OPENAI_API_KEY"),
|
|
}
|
|
}]
|
|
|
|
router = Router(model_list=model_list)
|
|
|
|
# openai.ChatCompletion.create replacement
|
|
response = await router.acompletion(model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hey, how's it going?"}])
|
|
|
|
print(response)
|
|
```
|
|
|
|
### Available Endpoints
|
|
- `router.completion()` - chat completions endpoint to call 100+ LLMs
|
|
- `router.acompletion()` - async chat completion calls
|
|
- `router.embeddings()` - embedding endpoint for Azure, OpenAI, Huggingface endpoints
|
|
- `router.aembeddings()` - async embeddings endpoint
|
|
- `router.text_completion()` - completion calls in the old OpenAI `/v1/completions` endpoint format
|
|
|
|
### Advanced
|
|
#### Routing Strategies - Weighted Pick, Rate Limit Aware
|
|
|
|
Router provides 2 strategies for routing your calls across multiple deployments:
|
|
|
|
<Tabs>
|
|
<TabItem value="simple-shuffle" label="Weighted Pick">
|
|
|
|
**Default** Picks a deployment based on the provided **Requests per minute (rpm) or Tokens per minute (tpm)**
|
|
|
|
If `rpm` or `tpm` is not provided, it randomly picks a deployment
|
|
|
|
```python
|
|
from litellm import Router
|
|
import asyncio
|
|
|
|
model_list = [{ # list of model deployments
|
|
"model_name": "gpt-3.5-turbo", # model alias
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-v-2", # actual model name
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE"),
|
|
"rpm": 900, # requests per minute for this API
|
|
}
|
|
}, {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-functioncalling",
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE"),
|
|
"rpm": 10,
|
|
}
|
|
}, {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "gpt-3.5-turbo",
|
|
"api_key": os.getenv("OPENAI_API_KEY"),
|
|
"rpm": 10,
|
|
}
|
|
}]
|
|
|
|
# init router
|
|
router = Router(model_list=model_list, routing_strategy="simple-shuffle")
|
|
async def router_acompletion():
|
|
response = await router.acompletion(
|
|
model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hey, how's it going?"}]
|
|
)
|
|
print(response)
|
|
return response
|
|
|
|
asyncio.run(router_acompletion())
|
|
```
|
|
</TabItem>
|
|
<TabItem value="usage-based" label="Rate-Limit Aware">
|
|
|
|
This will route to the deployment with the lowest TPM usage for that minute.
|
|
|
|
In production, we use Redis to track usage (TPM/RPM) across multiple deployments.
|
|
|
|
If you pass in the deployment's tpm/rpm limits, this will also check against that, and filter out any who's limits would be exceeded.
|
|
|
|
For Azure, your RPM = TPM/6.
|
|
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
|
|
model_list = [{ # list of model deployments
|
|
"model_name": "gpt-3.5-turbo", # model alias
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-v-2", # actual model name
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
},
|
|
"tpm": 100000,
|
|
"rpm": 10000,
|
|
}, {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-functioncalling",
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
},
|
|
"tpm": 100000,
|
|
"rpm": 1000,
|
|
}, {
|
|
"model_name": "gpt-3.5-turbo",
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "gpt-3.5-turbo",
|
|
"api_key": os.getenv("OPENAI_API_KEY"),
|
|
},
|
|
"tpm": 100000,
|
|
"rpm": 1000,
|
|
}]
|
|
router = Router(model_list=model_list,
|
|
redis_host=os.environ["REDIS_HOST"],
|
|
redis_password=os.environ["REDIS_PASSWORD"],
|
|
redis_port=os.environ["REDIS_PORT"],
|
|
routing_strategy="simple-shuffle")
|
|
|
|
|
|
response = await router.acompletion(model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hey, how's it going?"}]
|
|
|
|
print(response)
|
|
```
|
|
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Basic Reliability
|
|
|
|
### Timeouts
|
|
|
|
The timeout set in router is for the entire length of the call, and is passed down to the completion() call level as well.
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
model_list = [{...}]
|
|
|
|
router = Router(model_list=model_list,
|
|
timeout=30) # raise timeout error if call takes > 30s
|
|
|
|
print(response)
|
|
```
|
|
|
|
### Cooldowns
|
|
|
|
Set the limit for how many calls a model is allowed to fail in a minute, before being cooled down for a minute.
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
model_list = [{...}]
|
|
|
|
router = Router(model_list=model_list,
|
|
allowed_fails=1) # cooldown model if it fails > 1 call in a minute.
|
|
|
|
user_message = "Hello, whats the weather in San Francisco??"
|
|
messages = [{"content": user_message, "role": "user"}]
|
|
|
|
# normal call
|
|
response = router.completion(model="gpt-3.5-turbo", messages=messages)
|
|
|
|
print(f"response: {response}")
|
|
|
|
```
|
|
|
|
### Retries
|
|
|
|
For both async + sync functions, we support retrying failed requests.
|
|
|
|
For RateLimitError we implement exponential backoffs
|
|
|
|
For generic errors, we retry immediately
|
|
|
|
Here's a quick look at how we can set `num_retries = 3`:
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
model_list = [{...}]
|
|
|
|
router = Router(model_list=model_list,
|
|
num_retries=3)
|
|
|
|
user_message = "Hello, whats the weather in San Francisco??"
|
|
messages = [{"content": user_message, "role": "user"}]
|
|
|
|
# normal call
|
|
response = router.completion(model="gpt-3.5-turbo", messages=messages)
|
|
|
|
print(f"response: {response}")
|
|
```
|
|
|
|
### Fallbacks
|
|
|
|
If a call fails after num_retries, fall back to another model group.
|
|
|
|
If the error is a context window exceeded error, fall back to a larger model group (if given).
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
model_list = [
|
|
{ # list of model deployments
|
|
"model_name": "azure/gpt-3.5-turbo", # openai model name
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-v-2",
|
|
"api_key": "bad-key",
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
},
|
|
"tpm": 240000,
|
|
"rpm": 1800
|
|
},
|
|
{ # list of model deployments
|
|
"model_name": "azure/gpt-3.5-turbo-context-fallback", # openai model name
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-v-2",
|
|
"api_key": "bad-key",
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
},
|
|
"tpm": 240000,
|
|
"rpm": 1800
|
|
},
|
|
{
|
|
"model_name": "azure/gpt-3.5-turbo", # openai model name
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "azure/chatgpt-functioncalling",
|
|
"api_key": "bad-key",
|
|
"api_version": os.getenv("AZURE_API_VERSION"),
|
|
"api_base": os.getenv("AZURE_API_BASE")
|
|
},
|
|
"tpm": 240000,
|
|
"rpm": 1800
|
|
},
|
|
{
|
|
"model_name": "gpt-3.5-turbo", # openai model name
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "gpt-3.5-turbo",
|
|
"api_key": os.getenv("OPENAI_API_KEY"),
|
|
},
|
|
"tpm": 1000000,
|
|
"rpm": 9000
|
|
},
|
|
{
|
|
"model_name": "gpt-3.5-turbo-16k", # openai model name
|
|
"litellm_params": { # params for litellm completion/embedding call
|
|
"model": "gpt-3.5-turbo-16k",
|
|
"api_key": os.getenv("OPENAI_API_KEY"),
|
|
},
|
|
"tpm": 1000000,
|
|
"rpm": 9000
|
|
}
|
|
]
|
|
|
|
|
|
router = Router(model_list=model_list,
|
|
fallbacks=[{"azure/gpt-3.5-turbo": ["gpt-3.5-turbo"]}],
|
|
context_window_fallbacks=[{"azure/gpt-3.5-turbo-context-fallback": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}],
|
|
set_verbose=True)
|
|
|
|
|
|
user_message = "Hello, whats the weather in San Francisco??"
|
|
messages = [{"content": user_message, "role": "user"}]
|
|
|
|
# normal fallback call
|
|
response = router.completion(model="azure/gpt-3.5-turbo", messages=messages)
|
|
|
|
# context window fallback call
|
|
response = router.completion(model="azure/gpt-3.5-turbo-context-fallback", messages=messages)
|
|
|
|
print(f"response: {response}")
|
|
```
|
|
|
|
### Caching
|
|
|
|
In production, we recommend using a Redis cache. For quickly testing things locally, we also support simple in-memory caching.
|
|
|
|
**In-memory Cache**
|
|
|
|
```python
|
|
router = Router(model_list=model_list,
|
|
cache_responses=True)
|
|
|
|
print(response)
|
|
```
|
|
|
|
**Redis Cache**
|
|
```python
|
|
router = Router(model_list=model_list,
|
|
redis_host=os.getenv("REDIS_HOST"),
|
|
redis_password=os.getenv("REDIS_PASSWORD"),
|
|
redis_port=os.getenv("REDIS_PORT"),
|
|
cache_responses=True)
|
|
|
|
print(response)
|
|
```
|
|
|
|
**Pass in Redis URL, additional kwargs**
|
|
```python
|
|
router = Router(model_list: Optional[list] = None,
|
|
## CACHING ##
|
|
redis_url=os.getenv("REDIS_URL")",
|
|
cache_kwargs= {}, # additional kwargs to pass to RedisCache (see caching.py)
|
|
cache_responses=True)
|
|
```
|
|
|
|
## Caching across model groups
|
|
|
|
If you want to cache across 2 different model groups (e.g. azure deployments, and openai), use caching groups.
|
|
|
|
```python
|
|
import litellm, asyncio, time
|
|
from litellm import Router
|
|
|
|
# set os env
|
|
os.environ["OPENAI_API_KEY"] = ""
|
|
os.environ["AZURE_API_KEY"] = ""
|
|
os.environ["AZURE_API_BASE"] = ""
|
|
os.environ["AZURE_API_VERSION"] = ""
|
|
|
|
async def test_acompletion_caching_on_router_caching_groups():
|
|
# tests acompletion + caching on router
|
|
try:
|
|
litellm.set_verbose = True
|
|
model_list = [
|
|
{
|
|
"model_name": "openai-gpt-3.5-turbo",
|
|
"litellm_params": {
|
|
"model": "gpt-3.5-turbo-0613",
|
|
"api_key": os.getenv("OPENAI_API_KEY"),
|
|
},
|
|
},
|
|
{
|
|
"model_name": "azure-gpt-3.5-turbo",
|
|
"litellm_params": {
|
|
"model": "azure/chatgpt-v-2",
|
|
"api_key": os.getenv("AZURE_API_KEY"),
|
|
"api_base": os.getenv("AZURE_API_BASE"),
|
|
"api_version": os.getenv("AZURE_API_VERSION")
|
|
},
|
|
}
|
|
]
|
|
|
|
messages = [
|
|
{"role": "user", "content": f"write a one sentence poem {time.time()}?"}
|
|
]
|
|
start_time = time.time()
|
|
router = Router(model_list=model_list,
|
|
cache_responses=True,
|
|
caching_groups=[("openai-gpt-3.5-turbo", "azure-gpt-3.5-turbo")])
|
|
response1 = await router.acompletion(model="openai-gpt-3.5-turbo", messages=messages, temperature=1)
|
|
print(f"response1: {response1}")
|
|
await asyncio.sleep(1) # add cache is async, async sleep for cache to get set
|
|
response2 = await router.acompletion(model="azure-gpt-3.5-turbo", messages=messages, temperature=1)
|
|
assert response1.id == response2.id
|
|
assert len(response1.choices[0].message.content) > 0
|
|
assert response1.choices[0].message.content == response2.choices[0].message.content
|
|
except Exception as e:
|
|
traceback.print_exc()
|
|
|
|
asyncio.run(test_acompletion_caching_on_router_caching_groups())
|
|
```
|
|
|
|
#### Default litellm.completion/embedding params
|
|
|
|
You can also set default params for litellm completion/embedding calls. Here's how to do that:
|
|
|
|
```python
|
|
from litellm import Router
|
|
|
|
fallback_dict = {"gpt-3.5-turbo": "gpt-3.5-turbo-16k"}
|
|
|
|
router = Router(model_list=model_list,
|
|
default_litellm_params={"context_window_fallback_dict": fallback_dict})
|
|
|
|
user_message = "Hello, whats the weather in San Francisco??"
|
|
messages = [{"content": user_message, "role": "user"}]
|
|
|
|
# normal call
|
|
response = router.completion(model="gpt-3.5-turbo", messages=messages)
|
|
|
|
print(f"response: {response}")
|
|
```
|
|
|
|
|
|
## Deploy Router
|
|
|
|
If you want a server to load balance across different LLM APIs, use our [OpenAI Proxy Server](./simple_proxy#load-balancing---multiple-instances-of-1-model)
|