mirror of
https://github.com/BerriAI/litellm.git
synced 2025-04-24 18:24:20 +00:00
(docs) simple proxy
This commit is contained in:
parent
3cc8305ec6
commit
032cd0121b
1 changed files with 137 additions and 131 deletions
|
@ -591,6 +591,143 @@ curl --location 'http://0.0.0.0:8000/chat/completions' \
|
||||||
'
|
'
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Load Balancing - Multiple Instances of 1 model
|
||||||
|
**LiteLLM Proxy can handle 1k+ requests/second**. Use this config to load balance between multiple instances of the same model.
|
||||||
|
|
||||||
|
The proxy will handle routing requests (using LiteLLM's Router).
|
||||||
|
|
||||||
|
In the config below requests with `model=gpt-3.5-turbo` will be routed across multiple instances of `azure/gpt-3.5-turbo`
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
model_list:
|
||||||
|
- model_name: gpt-3.5-turbo
|
||||||
|
litellm_params:
|
||||||
|
model: azure/gpt-turbo-small-eu
|
||||||
|
api_base: https://my-endpoint-europe-berri-992.openai.azure.com/
|
||||||
|
api_key:
|
||||||
|
rpm: 6 # Rate limit for this deployment: in requests per minute (rpm)
|
||||||
|
- model_name: gpt-3.5-turbo
|
||||||
|
litellm_params:
|
||||||
|
model: azure/gpt-turbo-small-ca
|
||||||
|
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
|
||||||
|
api_key:
|
||||||
|
rpm: 6
|
||||||
|
- model_name: gpt-3.5-turbo
|
||||||
|
litellm_params:
|
||||||
|
model: azure/gpt-turbo-large
|
||||||
|
api_base: https://openai-france-1234.openai.azure.com/
|
||||||
|
api_key:
|
||||||
|
rpm: 1440
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 2: Start Proxy with config
|
||||||
|
|
||||||
|
```shell
|
||||||
|
$ litellm --config /path/to/config.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 3: Use proxy
|
||||||
|
Curl Command
|
||||||
|
```shell
|
||||||
|
curl --location 'http://0.0.0.0:8000/chat/completions' \
|
||||||
|
--header 'Content-Type: application/json' \
|
||||||
|
--data ' {
|
||||||
|
"model": "gpt-3.5-turbo",
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "what llm are you"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Fallbacks + Cooldowns + Retries + Timeouts
|
||||||
|
|
||||||
|
If a call fails after num_retries, fall back to another model group.
|
||||||
|
|
||||||
|
If the error is a context window exceeded error, fall back to a larger model group (if given).
|
||||||
|
|
||||||
|
[**See Code**](https://github.com/BerriAI/litellm/blob/main/litellm/router.py)
|
||||||
|
|
||||||
|
**Set via config**
|
||||||
|
```yaml
|
||||||
|
model_list:
|
||||||
|
- model_name: zephyr-beta
|
||||||
|
litellm_params:
|
||||||
|
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
||||||
|
api_base: http://0.0.0.0:8001
|
||||||
|
- model_name: zephyr-beta
|
||||||
|
litellm_params:
|
||||||
|
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
||||||
|
api_base: http://0.0.0.0:8002
|
||||||
|
- model_name: zephyr-beta
|
||||||
|
litellm_params:
|
||||||
|
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
||||||
|
api_base: http://0.0.0.0:8003
|
||||||
|
- model_name: gpt-3.5-turbo
|
||||||
|
litellm_params:
|
||||||
|
model: gpt-3.5-turbo
|
||||||
|
api_key: <my-openai-key>
|
||||||
|
- model_name: gpt-3.5-turbo-16k
|
||||||
|
litellm_params:
|
||||||
|
model: gpt-3.5-turbo-16k
|
||||||
|
api_key: <my-openai-key>
|
||||||
|
|
||||||
|
litellm_settings:
|
||||||
|
num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
|
||||||
|
request_timeout: 10 # raise Timeout error if call takes longer than 10s
|
||||||
|
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries
|
||||||
|
context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
|
||||||
|
allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.
|
||||||
|
```
|
||||||
|
|
||||||
|
**Set dynamically**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl --location 'http://0.0.0.0:8000/chat/completions' \
|
||||||
|
--header 'Content-Type: application/json' \
|
||||||
|
--data ' {
|
||||||
|
"model": "zephyr-beta",
|
||||||
|
"messages": [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": "what llm are you"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
|
||||||
|
"context_window_fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
|
||||||
|
"num_retries": 2,
|
||||||
|
"request_timeout": 10
|
||||||
|
}
|
||||||
|
'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Config for Embedding Models - xorbitsai/inference
|
||||||
|
|
||||||
|
Here's how you can use multiple llms with one proxy `config.yaml`.
|
||||||
|
Here is how [LiteLLM calls OpenAI Compatible Embedding models](https://docs.litellm.ai/docs/embedding/supported_embedding#openai-compatible-embedding-models)
|
||||||
|
|
||||||
|
#### Config
|
||||||
|
```yaml
|
||||||
|
model_list:
|
||||||
|
- model_name: custom_embedding_model
|
||||||
|
litellm_params:
|
||||||
|
model: openai/custom_embedding # the `openai/` prefix tells litellm it's openai compatible
|
||||||
|
api_base: http://0.0.0.0:8000/
|
||||||
|
- model_name: custom_embedding_model
|
||||||
|
litellm_params:
|
||||||
|
model: openai/custom_embedding # the `openai/` prefix tells litellm it's openai compatible
|
||||||
|
api_base: http://0.0.0.0:8001/
|
||||||
|
```
|
||||||
|
|
||||||
|
Run the proxy using this config
|
||||||
|
```shell
|
||||||
|
$ litellm --config /path/to/config.yaml
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
### Managing Auth - Virtual Keys
|
### Managing Auth - Virtual Keys
|
||||||
|
|
||||||
Grant other's temporary access to your proxy, with keys that expire after a set duration.
|
Grant other's temporary access to your proxy, with keys that expire after a set duration.
|
||||||
|
@ -784,137 +921,6 @@ model_list:
|
||||||
|
|
||||||
s/o to [@David Manouchehri](https://www.linkedin.com/in/davidmanouchehri/) for helping with this.
|
s/o to [@David Manouchehri](https://www.linkedin.com/in/davidmanouchehri/) for helping with this.
|
||||||
|
|
||||||
### Load Balancing - Multiple Instances of 1 model
|
|
||||||
|
|
||||||
If you have multiple instances of the same model,
|
|
||||||
|
|
||||||
in the `config.yaml` just add all of them with the same 'model_name', and the proxy will handle routing requests (using LiteLLM's Router).
|
|
||||||
|
|
||||||
In the config below requests with `model=zephyr-beta` will be routed across multiple instances of `HuggingFaceH4/zephyr-7b-beta`
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
model_list:
|
|
||||||
- model_name: zephyr-beta
|
|
||||||
litellm_params:
|
|
||||||
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
|
||||||
api_base: http://0.0.0.0:8001
|
|
||||||
- model_name: zephyr-beta
|
|
||||||
litellm_params:
|
|
||||||
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
|
||||||
api_base: http://0.0.0.0:8002
|
|
||||||
- model_name: zephyr-beta
|
|
||||||
litellm_params:
|
|
||||||
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
|
||||||
api_base: http://0.0.0.0:8003
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Step 2: Start Proxy with config
|
|
||||||
|
|
||||||
```shell
|
|
||||||
$ litellm --config /path/to/config.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Step 3: Use proxy
|
|
||||||
Curl Command
|
|
||||||
```shell
|
|
||||||
curl --location 'http://0.0.0.0:8000/chat/completions' \
|
|
||||||
--header 'Content-Type: application/json' \
|
|
||||||
--data ' {
|
|
||||||
"model": "zephyr-beta",
|
|
||||||
"messages": [
|
|
||||||
{
|
|
||||||
"role": "user",
|
|
||||||
"content": "what llm are you"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
}
|
|
||||||
'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Fallbacks + Cooldowns + Retries + Timeouts
|
|
||||||
|
|
||||||
If a call fails after num_retries, fall back to another model group.
|
|
||||||
|
|
||||||
If the error is a context window exceeded error, fall back to a larger model group (if given).
|
|
||||||
|
|
||||||
[**See Code**](https://github.com/BerriAI/litellm/blob/main/litellm/router.py)
|
|
||||||
|
|
||||||
**Set via config**
|
|
||||||
```yaml
|
|
||||||
model_list:
|
|
||||||
- model_name: zephyr-beta
|
|
||||||
litellm_params:
|
|
||||||
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
|
||||||
api_base: http://0.0.0.0:8001
|
|
||||||
- model_name: zephyr-beta
|
|
||||||
litellm_params:
|
|
||||||
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
|
||||||
api_base: http://0.0.0.0:8002
|
|
||||||
- model_name: zephyr-beta
|
|
||||||
litellm_params:
|
|
||||||
model: huggingface/HuggingFaceH4/zephyr-7b-beta
|
|
||||||
api_base: http://0.0.0.0:8003
|
|
||||||
- model_name: gpt-3.5-turbo
|
|
||||||
litellm_params:
|
|
||||||
model: gpt-3.5-turbo
|
|
||||||
api_key: <my-openai-key>
|
|
||||||
- model_name: gpt-3.5-turbo-16k
|
|
||||||
litellm_params:
|
|
||||||
model: gpt-3.5-turbo-16k
|
|
||||||
api_key: <my-openai-key>
|
|
||||||
|
|
||||||
litellm_settings:
|
|
||||||
num_retries: 3 # retry call 3 times on each model_name (e.g. zephyr-beta)
|
|
||||||
request_timeout: 10 # raise Timeout error if call takes longer than 10s
|
|
||||||
fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo"]}] # fallback to gpt-3.5-turbo if call fails num_retries
|
|
||||||
context_window_fallbacks: [{"zephyr-beta": ["gpt-3.5-turbo-16k"]}, {"gpt-3.5-turbo": ["gpt-3.5-turbo-16k"]}] # fallback to gpt-3.5-turbo-16k if context window error
|
|
||||||
allowed_fails: 3 # cooldown model if it fails > 1 call in a minute.
|
|
||||||
```
|
|
||||||
|
|
||||||
**Set dynamically**
|
|
||||||
|
|
||||||
```bash
|
|
||||||
curl --location 'http://0.0.0.0:8000/chat/completions' \
|
|
||||||
--header 'Content-Type: application/json' \
|
|
||||||
--data ' {
|
|
||||||
"model": "zephyr-beta",
|
|
||||||
"messages": [
|
|
||||||
{
|
|
||||||
"role": "user",
|
|
||||||
"content": "what llm are you"
|
|
||||||
}
|
|
||||||
],
|
|
||||||
"fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
|
|
||||||
"context_window_fallbacks": [{"zephyr-beta": ["gpt-3.5-turbo"]}],
|
|
||||||
"num_retries": 2,
|
|
||||||
"request_timeout": 10
|
|
||||||
}
|
|
||||||
'
|
|
||||||
```
|
|
||||||
|
|
||||||
### Config for Embedding Models - xorbitsai/inference
|
|
||||||
|
|
||||||
Here's how you can use multiple llms with one proxy `config.yaml`.
|
|
||||||
Here is how [LiteLLM calls OpenAI Compatible Embedding models](https://docs.litellm.ai/docs/embedding/supported_embedding#openai-compatible-embedding-models)
|
|
||||||
|
|
||||||
#### Config
|
|
||||||
```yaml
|
|
||||||
model_list:
|
|
||||||
- model_name: custom_embedding_model
|
|
||||||
litellm_params:
|
|
||||||
model: openai/custom_embedding # the `openai/` prefix tells litellm it's openai compatible
|
|
||||||
api_base: http://0.0.0.0:8000/
|
|
||||||
- model_name: custom_embedding_model
|
|
||||||
litellm_params:
|
|
||||||
model: openai/custom_embedding # the `openai/` prefix tells litellm it's openai compatible
|
|
||||||
api_base: http://0.0.0.0:8001/
|
|
||||||
```
|
|
||||||
|
|
||||||
Run the proxy using this config
|
|
||||||
```shell
|
|
||||||
$ litellm --config /path/to/config.yaml
|
|
||||||
```
|
|
||||||
|
|
||||||
### Config for setting Model Aliases
|
### Config for setting Model Aliases
|
||||||
|
|
||||||
Set a model alias for your deployments.
|
Set a model alias for your deployments.
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue