forked from phoenix/litellm-mirror
631 lines
15 KiB
Markdown
631 lines
15 KiB
Markdown
import Image from '@theme/IdealImage';
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
|
|
# Quick Start
|
|
Quick start CLI, Config, Docker
|
|
|
|
LiteLLM Server manages:
|
|
|
|
* **Unified Interface**: Calling 100+ LLMs [Huggingface/Bedrock/TogetherAI/etc.](#other-supported-models) in the OpenAI `ChatCompletions` & `Completions` format
|
|
* **Load Balancing**: between [Multiple Models](#multiple-models---quick-start) + [Deployments of the same model](#multiple-instances-of-1-model) - LiteLLM proxy can handle 1.5k+ requests/second during load tests.
|
|
* **Cost tracking**: Authentication & Spend Tracking [Virtual Keys](#managing-auth---virtual-keys)
|
|
|
|
[**See LiteLLM Proxy code**](https://github.com/BerriAI/litellm/tree/main/litellm/proxy)
|
|
|
|
|
|
#### 📖 Proxy Endpoints - [Swagger Docs](https://litellm-api.up.railway.app/)
|
|
|
|
|
|
View all the supported args for the Proxy CLI [here](https://docs.litellm.ai/docs/simple_proxy#proxy-cli-arguments)
|
|
|
|
```shell
|
|
$ pip install 'litellm[proxy]'
|
|
```
|
|
|
|
## Quick Start - LiteLLM Proxy CLI
|
|
|
|
Run the following command to start the litellm proxy
|
|
```shell
|
|
$ litellm --model huggingface/bigcode/starcoder
|
|
|
|
#INFO: Proxy running on http://0.0.0.0:8000
|
|
```
|
|
|
|
### Test
|
|
In a new shell, run, this will make an `openai.chat.completions` request. Ensure you're using openai v1.0.0+
|
|
```shell
|
|
litellm --test
|
|
```
|
|
|
|
This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.
|
|
|
|
### Supported LLMs
|
|
All LiteLLM supported LLMs are supported on the Proxy. Seel all [supported llms](https://docs.litellm.ai/docs/providers)
|
|
<Tabs>
|
|
<TabItem value="bedrock" label="AWS Bedrock">
|
|
|
|
```shell
|
|
$ export AWS_ACCESS_KEY_ID=
|
|
$ export AWS_REGION_NAME=
|
|
$ export AWS_SECRET_ACCESS_KEY=
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model bedrock/anthropic.claude-v2
|
|
```
|
|
</TabItem>
|
|
<TabItem value="azure" label="Azure OpenAI">
|
|
|
|
```shell
|
|
$ export AZURE_API_KEY=my-api-key
|
|
$ export AZURE_API_BASE=my-api-base
|
|
```
|
|
```
|
|
$ litellm --model azure/my-deployment-name
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="openai" label="OpenAI">
|
|
|
|
```shell
|
|
$ export OPENAI_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model gpt-3.5-turbo
|
|
```
|
|
</TabItem>
|
|
<TabItem value="ollama" label="Ollama">
|
|
|
|
```
|
|
$ litellm --model ollama/<ollama-model-name>
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="openai-proxy" label="OpenAI Compatible Endpoint">
|
|
|
|
```shell
|
|
$ export OPENAI_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model openai/<your model name> --api_base <your-api-base> # e.g. http://0.0.0.0:3000
|
|
```
|
|
</TabItem>
|
|
|
|
<TabItem value="vertex-ai" label="Vertex AI [Gemini]">
|
|
|
|
```shell
|
|
$ export VERTEX_PROJECT="hardy-project"
|
|
$ export VERTEX_LOCATION="us-west"
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model vertex_ai/gemini-pro
|
|
```
|
|
</TabItem>
|
|
|
|
<TabItem value="huggingface" label="Huggingface (TGI) Deployed">
|
|
|
|
```shell
|
|
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
|
|
```
|
|
```shell
|
|
$ litellm --model huggingface/<your model name> --api_base <your-api-base> # e.g. http://0.0.0.0:3000
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="huggingface-local" label="Huggingface (TGI) Local">
|
|
|
|
```shell
|
|
$ litellm --model huggingface/<your model name> --api_base http://0.0.0.0:8001
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="aws-sagemaker" label="AWS Sagemaker">
|
|
|
|
```shell
|
|
export AWS_ACCESS_KEY_ID=
|
|
export AWS_REGION_NAME=
|
|
export AWS_SECRET_ACCESS_KEY=
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model sagemaker/jumpstart-dft-meta-textgeneration-llama-2-7b
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="anthropic" label="Anthropic">
|
|
|
|
```shell
|
|
$ export ANTHROPIC_API_KEY=my-api-key
|
|
```
|
|
```shell
|
|
$ litellm --model claude-instant-1
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="vllm-local" label="VLLM">
|
|
Assuming you're running vllm locally
|
|
|
|
```shell
|
|
$ litellm --model vllm/facebook/opt-125m
|
|
```
|
|
</TabItem>
|
|
<TabItem value="together_ai" label="TogetherAI">
|
|
|
|
```shell
|
|
$ export TOGETHERAI_API_KEY=my-api-key
|
|
```
|
|
```shell
|
|
$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="replicate" label="Replicate">
|
|
|
|
```shell
|
|
$ export REPLICATE_API_KEY=my-api-key
|
|
```
|
|
```shell
|
|
$ litellm \
|
|
--model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="petals" label="Petals">
|
|
|
|
```shell
|
|
$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="palm" label="Palm">
|
|
|
|
```shell
|
|
$ export PALM_API_KEY=my-palm-key
|
|
```
|
|
```shell
|
|
$ litellm --model palm/chat-bison
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="ai21" label="AI21">
|
|
|
|
```shell
|
|
$ export AI21_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model j2-light
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="cohere" label="Cohere">
|
|
|
|
```shell
|
|
$ export COHERE_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model command-nightly
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
</Tabs>
|
|
|
|
|
|
### Using LiteLLM Proxy - Curl Request, OpenAI Package, Langchain
|
|
|
|
<Tabs>
|
|
<TabItem value="Curl" label="Curl Request">
|
|
|
|
```shell
|
|
curl --location 'http://0.0.0.0:8000/chat/completions' \
|
|
--header 'Content-Type: application/json' \
|
|
--data ' {
|
|
"model": "gpt-3.5-turbo",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": "what llm are you"
|
|
}
|
|
]
|
|
}
|
|
'
|
|
```
|
|
</TabItem>
|
|
<TabItem value="openai" label="OpenAI v1.0.0+">
|
|
|
|
```python
|
|
import openai
|
|
client = openai.OpenAI(
|
|
api_key="anything",
|
|
base_url="http://0.0.0.0:8000"
|
|
)
|
|
|
|
# request sent to model set on litellm proxy, `litellm --model`
|
|
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
|
|
{
|
|
"role": "user",
|
|
"content": "this is a test request, write a short poem"
|
|
}
|
|
])
|
|
|
|
print(response)
|
|
|
|
```
|
|
</TabItem>
|
|
<TabItem value="langchain" label="Langchain">
|
|
|
|
```python
|
|
from langchain.chat_models import ChatOpenAI
|
|
from langchain.prompts.chat import (
|
|
ChatPromptTemplate,
|
|
HumanMessagePromptTemplate,
|
|
SystemMessagePromptTemplate,
|
|
)
|
|
from langchain.schema import HumanMessage, SystemMessage
|
|
|
|
chat = ChatOpenAI(
|
|
openai_api_base="http://0.0.0.0:8000", # set openai_api_base to the LiteLLM Proxy
|
|
model = "gpt-3.5-turbo",
|
|
temperature=0.1
|
|
)
|
|
|
|
messages = [
|
|
SystemMessage(
|
|
content="You are a helpful assistant that im using to make a test request to."
|
|
),
|
|
HumanMessage(
|
|
content="test from litellm. tell me why it's amazing in 1 sentence"
|
|
),
|
|
]
|
|
response = chat(messages)
|
|
|
|
print(response)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="langchain-embedding" label="Langchain Embeddings">
|
|
|
|
```python
|
|
from langchain.embeddings import OpenAIEmbeddings
|
|
|
|
embeddings = OpenAIEmbeddings(model="sagemaker-embeddings", openai_api_base="http://0.0.0.0:8000", openai_api_key="temp-key")
|
|
|
|
|
|
text = "This is a test document."
|
|
|
|
query_result = embeddings.embed_query(text)
|
|
|
|
print(f"SAGEMAKER EMBEDDINGS")
|
|
print(query_result[:5])
|
|
|
|
embeddings = OpenAIEmbeddings(model="bedrock-embeddings", openai_api_base="http://0.0.0.0:8000", openai_api_key="temp-key")
|
|
|
|
text = "This is a test document."
|
|
|
|
query_result = embeddings.embed_query(text)
|
|
|
|
print(f"BEDROCK EMBEDDINGS")
|
|
print(query_result[:5])
|
|
|
|
embeddings = OpenAIEmbeddings(model="bedrock-titan-embeddings", openai_api_base="http://0.0.0.0:8000", openai_api_key="temp-key")
|
|
|
|
text = "This is a test document."
|
|
|
|
query_result = embeddings.embed_query(text)
|
|
|
|
print(f"TITAN EMBEDDINGS")
|
|
print(query_result[:5])
|
|
```
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
|
|
## Quick Start - LiteLLM Proxy + Config.yaml
|
|
The config allows you to create a model list and set `api_base`, `max_tokens` (all litellm params). See more details about the config [here](https://docs.litellm.ai/docs/proxy/configs)
|
|
|
|
### Create a Config for LiteLLM Proxy
|
|
Example config
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: gpt-3.5-turbo # user-facing model alias
|
|
litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
|
|
model: azure/<your-deployment-name>
|
|
api_base: <your-azure-api-endpoint>
|
|
api_key: <your-azure-api-key>
|
|
- model_name: gpt-3.5-turbo
|
|
litellm_params:
|
|
model: azure/gpt-turbo-small-ca
|
|
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
|
|
api_key: <your-azure-api-key>
|
|
- model_name: vllm-model
|
|
litellm_params:
|
|
model: openai/<your-model-name>
|
|
api_base: <your-api-base> # e.g. http://0.0.0.0:3000
|
|
```
|
|
|
|
### Run proxy with config
|
|
|
|
```shell
|
|
litellm --config your_config.yaml
|
|
```
|
|
|
|
[**More Info**](./configs.md)
|
|
|
|
|
|
|
|
## 📖 Proxy Endpoints - [Swagger Docs](https://litellm-api.up.railway.app/)
|
|
- POST `/chat/completions` - chat completions endpoint to call 100+ LLMs
|
|
- POST `/completions` - completions endpoint
|
|
- POST `/embeddings` - embedding endpoint for Azure, OpenAI, Huggingface endpoints
|
|
- GET `/models` - available models on server
|
|
- POST `/key/generate` - generate a key to access the proxy
|
|
|
|
## Quick Start Docker Image: Github Container Registry
|
|
|
|
### Pull the litellm ghcr docker image
|
|
See the latest available ghcr docker image here:
|
|
https://github.com/berriai/litellm/pkgs/container/litellm
|
|
|
|
```shell
|
|
docker pull ghcr.io/berriai/litellm:main-v1.16.13
|
|
```
|
|
|
|
### Run the Docker Image
|
|
```shell
|
|
docker run ghcr.io/berriai/litellm:main-v1.16.13
|
|
```
|
|
|
|
#### Run the Docker Image with LiteLLM CLI args
|
|
|
|
See all supported CLI args [here](https://docs.litellm.ai/docs/proxy/cli):
|
|
|
|
Here's how you can run the docker image and pass your config to `litellm`
|
|
```shell
|
|
docker run ghcr.io/berriai/litellm:main-v1.16.13 --config your_config.yaml
|
|
```
|
|
|
|
Here's how you can run the docker image and start litellm on port 8002 with `num_workers=8`
|
|
```shell
|
|
docker run ghcr.io/berriai/litellm:main-v1.16.13 --port 8002 --num_workers 8
|
|
```
|
|
|
|
#### Run the Docker Image using docker compose
|
|
|
|
**Step 1**
|
|
|
|
- (Recommended) Use the example file `docker-compose.example.yml` given in the project root. e.g. https://github.com/BerriAI/litellm/blob/main/docker-compose.example.yml
|
|
|
|
- Rename the file `docker-compose.example.yml` to `docker-compose.yml`.
|
|
|
|
Here's an example `docker-compose.yml` file
|
|
```yaml
|
|
version: "3.9"
|
|
services:
|
|
litellm:
|
|
image: ghcr.io/berriai/litellm:main
|
|
ports:
|
|
- "8000:8000" # Map the container port to the host, change the host port if necessary
|
|
volumes:
|
|
- ./litellm-config.yaml:/app/config.yaml # Mount the local configuration file
|
|
# You can change the port or number of workers as per your requirements or pass any new supported CLI augument. Make sure the port passed here matches with the container port defined above in `ports` value
|
|
command: [ "--config", "/app/config.yaml", "--port", "8000", "--num_workers", "8" ]
|
|
|
|
# ...rest of your docker-compose config if any
|
|
```
|
|
|
|
**Step 2**
|
|
|
|
Create a `litellm-config.yaml` file with your LiteLLM config relative to your `docker-compose.yml` file.
|
|
|
|
Check the config doc [here](https://docs.litellm.ai/docs/proxy/configs)
|
|
|
|
**Step 3**
|
|
|
|
Run the command `docker-compose up` or `docker compose up` as per your docker installation.
|
|
|
|
> Use `-d` flag to run the container in detached mode (background) e.g. `docker compose up -d`
|
|
|
|
|
|
Your LiteLLM container should be running now on the defined port e.g. `8000`.
|
|
|
|
|
|
## Using with OpenAI compatible projects
|
|
Set `base_url` to the LiteLLM Proxy server
|
|
|
|
<Tabs>
|
|
<TabItem value="openai" label="OpenAI v1.0.0+">
|
|
|
|
```python
|
|
import openai
|
|
client = openai.OpenAI(
|
|
api_key="anything",
|
|
base_url="http://0.0.0.0:8000"
|
|
)
|
|
|
|
# request sent to model set on litellm proxy, `litellm --model`
|
|
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
|
|
{
|
|
"role": "user",
|
|
"content": "this is a test request, write a short poem"
|
|
}
|
|
])
|
|
|
|
print(response)
|
|
|
|
```
|
|
</TabItem>
|
|
<TabItem value="librechat" label="LibreChat">
|
|
|
|
#### Start the LiteLLM proxy
|
|
```shell
|
|
litellm --model gpt-3.5-turbo
|
|
|
|
#INFO: Proxy running on http://0.0.0.0:8000
|
|
```
|
|
|
|
#### 1. Clone the repo
|
|
|
|
```shell
|
|
git clone https://github.com/danny-avila/LibreChat.git
|
|
```
|
|
|
|
|
|
#### 2. Modify Librechat's `docker-compose.yml`
|
|
LiteLLM Proxy is running on port `8000`, set `8000` as the proxy below
|
|
```yaml
|
|
OPENAI_REVERSE_PROXY=http://host.docker.internal:8000/v1/chat/completions
|
|
```
|
|
|
|
#### 3. Save fake OpenAI key in Librechat's `.env`
|
|
|
|
Copy Librechat's `.env.example` to `.env` and overwrite the default OPENAI_API_KEY (by default it requires the user to pass a key).
|
|
```env
|
|
OPENAI_API_KEY=sk-1234
|
|
```
|
|
|
|
#### 4. Run LibreChat:
|
|
```shell
|
|
docker compose up
|
|
```
|
|
</TabItem>
|
|
|
|
<TabItem value="continue-dev" label="ContinueDev">
|
|
|
|
Continue-Dev brings ChatGPT to VSCode. See how to [install it here](https://continue.dev/docs/quickstart).
|
|
|
|
In the [config.py](https://continue.dev/docs/reference/Models/openai) set this as your default model.
|
|
```python
|
|
default=OpenAI(
|
|
api_key="IGNORED",
|
|
model="fake-model-name",
|
|
context_length=2048, # customize if needed for your model
|
|
api_base="http://localhost:8000" # your proxy server url
|
|
),
|
|
```
|
|
|
|
Credits [@vividfog](https://github.com/jmorganca/ollama/issues/305#issuecomment-1751848077) for this tutorial.
|
|
</TabItem>
|
|
|
|
<TabItem value="aider" label="Aider">
|
|
|
|
```shell
|
|
$ pip install aider
|
|
|
|
$ aider --openai-api-base http://0.0.0.0:8000 --openai-api-key fake-key
|
|
```
|
|
</TabItem>
|
|
<TabItem value="autogen" label="AutoGen">
|
|
|
|
```python
|
|
pip install pyautogen
|
|
```
|
|
|
|
```python
|
|
from autogen import AssistantAgent, UserProxyAgent, oai
|
|
config_list=[
|
|
{
|
|
"model": "my-fake-model",
|
|
"api_base": "http://localhost:8000", #litellm compatible endpoint
|
|
"api_type": "open_ai",
|
|
"api_key": "NULL", # just a placeholder
|
|
}
|
|
]
|
|
|
|
response = oai.Completion.create(config_list=config_list, prompt="Hi")
|
|
print(response) # works fine
|
|
|
|
llm_config={
|
|
"config_list": config_list,
|
|
}
|
|
|
|
assistant = AssistantAgent("assistant", llm_config=llm_config)
|
|
user_proxy = UserProxyAgent("user_proxy")
|
|
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)
|
|
```
|
|
|
|
Credits [@victordibia](https://github.com/microsoft/autogen/issues/45#issuecomment-1749921972) for this tutorial.
|
|
</TabItem>
|
|
|
|
<TabItem value="guidance" label="guidance">
|
|
A guidance language for controlling large language models.
|
|
https://github.com/guidance-ai/guidance
|
|
|
|
**NOTE:** Guidance sends additional params like `stop_sequences` which can cause some models to fail if they don't support it.
|
|
|
|
**Fix**: Start your proxy using the `--drop_params` flag
|
|
|
|
```shell
|
|
litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params
|
|
```
|
|
|
|
```python
|
|
import guidance
|
|
|
|
# set api_base to your proxy
|
|
# set api_key to anything
|
|
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:8000", api_key="anything")
|
|
|
|
experts = guidance('''
|
|
{{#system~}}
|
|
You are a helpful and terse assistant.
|
|
{{~/system}}
|
|
|
|
{{#user~}}
|
|
I want a response to the following question:
|
|
{{query}}
|
|
Name 3 world-class experts (past or present) who would be great at answering this?
|
|
Don't answer the question yet.
|
|
{{~/user}}
|
|
|
|
{{#assistant~}}
|
|
{{gen 'expert_names' temperature=0 max_tokens=300}}
|
|
{{~/assistant}}
|
|
''', llm=gpt4)
|
|
|
|
result = experts(query='How can I be more productive?')
|
|
print(result)
|
|
```
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Debugging Proxy
|
|
|
|
Events that occur during normal operation
|
|
```shell
|
|
litellm --model gpt-3.5-turbo --debug
|
|
```
|
|
|
|
Detailed information
|
|
```shell
|
|
litellm --model gpt-3.5-turbo --detailed_debug
|
|
```
|
|
|
|
### Set Debug Level using env variables
|
|
|
|
Events that occur during normal operation
|
|
```shell
|
|
export LITELLM_LOG=INFO
|
|
```
|
|
|
|
Detailed information
|
|
```shell
|
|
export LITELLM_LOG=DEBUG
|
|
```
|
|
|
|
No Logs
|
|
```shell
|
|
export LITELLM_LOG=None
|
|
```
|
|
|