forked from phoenix/litellm-mirror
574 lines
13 KiB
Markdown
574 lines
13 KiB
Markdown
import Image from '@theme/IdealImage';
|
|
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
|
|
# Quick Start
|
|
Quick start CLI, Config, Docker
|
|
|
|
LiteLLM Server manages:
|
|
|
|
* **Unified Interface**: Calling 100+ LLMs [Huggingface/Bedrock/TogetherAI/etc.](#other-supported-models) in the OpenAI `ChatCompletions` & `Completions` format
|
|
* **Cost tracking**: Authentication, Spend Tracking & Budgets [Virtual Keys](https://docs.litellm.ai/docs/proxy/virtual_keys)
|
|
* **Load Balancing**: between [Multiple Models](#multiple-models---quick-start) + [Deployments of the same model](#multiple-instances-of-1-model) - LiteLLM proxy can handle 1.5k+ requests/second during load tests.
|
|
|
|
```shell
|
|
$ pip install 'litellm[proxy]'
|
|
```
|
|
|
|
## Quick Start - LiteLLM Proxy CLI
|
|
|
|
Run the following command to start the litellm proxy
|
|
```shell
|
|
$ litellm --model huggingface/bigcode/starcoder
|
|
|
|
#INFO: Proxy running on http://0.0.0.0:4000
|
|
```
|
|
|
|
### Test
|
|
In a new shell, run, this will make an `openai.chat.completions` request. Ensure you're using openai v1.0.0+
|
|
```shell
|
|
litellm --test
|
|
```
|
|
|
|
This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.
|
|
|
|
### Supported LLMs
|
|
All LiteLLM supported LLMs are supported on the Proxy. Seel all [supported llms](https://docs.litellm.ai/docs/providers)
|
|
<Tabs>
|
|
<TabItem value="bedrock" label="AWS Bedrock">
|
|
|
|
```shell
|
|
$ export AWS_ACCESS_KEY_ID=
|
|
$ export AWS_REGION_NAME=
|
|
$ export AWS_SECRET_ACCESS_KEY=
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model bedrock/anthropic.claude-v2
|
|
```
|
|
</TabItem>
|
|
<TabItem value="azure" label="Azure OpenAI">
|
|
|
|
```shell
|
|
$ export AZURE_API_KEY=my-api-key
|
|
$ export AZURE_API_BASE=my-api-base
|
|
```
|
|
```
|
|
$ litellm --model azure/my-deployment-name
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="openai" label="OpenAI">
|
|
|
|
```shell
|
|
$ export OPENAI_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model gpt-3.5-turbo
|
|
```
|
|
</TabItem>
|
|
<TabItem value="ollama" label="Ollama">
|
|
|
|
```
|
|
$ litellm --model ollama/<ollama-model-name>
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="openai-proxy" label="OpenAI Compatible Endpoint">
|
|
|
|
```shell
|
|
$ export OPENAI_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model openai/<your model name> --api_base <your-api-base> # e.g. http://0.0.0.0:3000
|
|
```
|
|
</TabItem>
|
|
|
|
<TabItem value="vertex-ai" label="Vertex AI [Gemini]">
|
|
|
|
```shell
|
|
$ export VERTEX_PROJECT="hardy-project"
|
|
$ export VERTEX_LOCATION="us-west"
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model vertex_ai/gemini-pro
|
|
```
|
|
</TabItem>
|
|
|
|
<TabItem value="huggingface" label="Huggingface (TGI) Deployed">
|
|
|
|
```shell
|
|
$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
|
|
```
|
|
```shell
|
|
$ litellm --model huggingface/<your model name> --api_base <your-api-base> # e.g. http://0.0.0.0:3000
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="huggingface-local" label="Huggingface (TGI) Local">
|
|
|
|
```shell
|
|
$ litellm --model huggingface/<your model name> --api_base http://0.0.0.0:8001
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="aws-sagemaker" label="AWS Sagemaker">
|
|
|
|
```shell
|
|
export AWS_ACCESS_KEY_ID=
|
|
export AWS_REGION_NAME=
|
|
export AWS_SECRET_ACCESS_KEY=
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model sagemaker/jumpstart-dft-meta-textgeneration-llama-2-7b
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="anthropic" label="Anthropic">
|
|
|
|
```shell
|
|
$ export ANTHROPIC_API_KEY=my-api-key
|
|
```
|
|
```shell
|
|
$ litellm --model claude-instant-1
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="vllm-local" label="VLLM">
|
|
Assuming you're running vllm locally
|
|
|
|
```shell
|
|
$ litellm --model vllm/facebook/opt-125m
|
|
```
|
|
</TabItem>
|
|
<TabItem value="together_ai" label="TogetherAI">
|
|
|
|
```shell
|
|
$ export TOGETHERAI_API_KEY=my-api-key
|
|
```
|
|
```shell
|
|
$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="replicate" label="Replicate">
|
|
|
|
```shell
|
|
$ export REPLICATE_API_KEY=my-api-key
|
|
```
|
|
```shell
|
|
$ litellm \
|
|
--model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="petals" label="Petals">
|
|
|
|
```shell
|
|
$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="palm" label="Palm">
|
|
|
|
```shell
|
|
$ export PALM_API_KEY=my-palm-key
|
|
```
|
|
```shell
|
|
$ litellm --model palm/chat-bison
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="ai21" label="AI21">
|
|
|
|
```shell
|
|
$ export AI21_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model j2-light
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
<TabItem value="cohere" label="Cohere">
|
|
|
|
```shell
|
|
$ export COHERE_API_KEY=my-api-key
|
|
```
|
|
|
|
```shell
|
|
$ litellm --model command-nightly
|
|
```
|
|
|
|
</TabItem>
|
|
|
|
</Tabs>
|
|
|
|
## Quick Start - LiteLLM Proxy + Config.yaml
|
|
The config allows you to create a model list and set `api_base`, `max_tokens` (all litellm params). See more details about the config [here](https://docs.litellm.ai/docs/proxy/configs)
|
|
|
|
### Create a Config for LiteLLM Proxy
|
|
Example config
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: gpt-3.5-turbo # user-facing model alias
|
|
litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
|
|
model: azure/<your-deployment-name>
|
|
api_base: <your-azure-api-endpoint>
|
|
api_key: <your-azure-api-key>
|
|
- model_name: gpt-3.5-turbo
|
|
litellm_params:
|
|
model: azure/gpt-turbo-small-ca
|
|
api_base: https://my-endpoint-canada-berri992.openai.azure.com/
|
|
api_key: <your-azure-api-key>
|
|
- model_name: vllm-model
|
|
litellm_params:
|
|
model: openai/<your-model-name>
|
|
api_base: <your-api-base> # e.g. http://0.0.0.0:3000
|
|
```
|
|
|
|
### Run proxy with config
|
|
|
|
```shell
|
|
litellm --config your_config.yaml
|
|
```
|
|
|
|
|
|
## Using LiteLLM Proxy - Curl Request, OpenAI Package, Langchain
|
|
|
|
<Tabs>
|
|
<TabItem value="Curl" label="Curl Request">
|
|
|
|
```shell
|
|
curl --location 'http://0.0.0.0:4000/chat/completions' \
|
|
--header 'Content-Type: application/json' \
|
|
--data ' {
|
|
"model": "gpt-3.5-turbo",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": "what llm are you"
|
|
}
|
|
]
|
|
}
|
|
'
|
|
```
|
|
</TabItem>
|
|
<TabItem value="openai" label="OpenAI v1.0.0+">
|
|
|
|
```python
|
|
import openai
|
|
client = openai.OpenAI(
|
|
api_key="anything",
|
|
base_url="http://0.0.0.0:4000"
|
|
)
|
|
|
|
# request sent to model set on litellm proxy, `litellm --model`
|
|
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
|
|
{
|
|
"role": "user",
|
|
"content": "this is a test request, write a short poem"
|
|
}
|
|
])
|
|
|
|
print(response)
|
|
|
|
```
|
|
</TabItem>
|
|
<TabItem value="litellm" label="LiteLLM SDK">
|
|
|
|
```python
|
|
from litellm import completion
|
|
|
|
response = completion(
|
|
model="openai/gpt-3.5-turbo",
|
|
messages = [
|
|
{
|
|
"role": "user",
|
|
"content": "this is a test request, write a short poem"
|
|
}
|
|
],
|
|
api_key="anything",
|
|
base_url="http://0.0.0.0:4000"
|
|
)
|
|
|
|
print(response)
|
|
|
|
```
|
|
</TabItem>
|
|
<TabItem value="langchain" label="Langchain">
|
|
|
|
```python
|
|
from langchain.chat_models import ChatOpenAI
|
|
from langchain.prompts.chat import (
|
|
ChatPromptTemplate,
|
|
HumanMessagePromptTemplate,
|
|
SystemMessagePromptTemplate,
|
|
)
|
|
from langchain.schema import HumanMessage, SystemMessage
|
|
|
|
chat = ChatOpenAI(
|
|
openai_api_base="http://0.0.0.0:4000", # set openai_api_base to the LiteLLM Proxy
|
|
model = "gpt-3.5-turbo",
|
|
temperature=0.1
|
|
)
|
|
|
|
messages = [
|
|
SystemMessage(
|
|
content="You are a helpful assistant that im using to make a test request to."
|
|
),
|
|
HumanMessage(
|
|
content="test from litellm. tell me why it's amazing in 1 sentence"
|
|
),
|
|
]
|
|
response = chat(messages)
|
|
|
|
print(response)
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="langchain-embedding" label="Langchain Embeddings">
|
|
|
|
```python
|
|
from langchain.embeddings import OpenAIEmbeddings
|
|
|
|
embeddings = OpenAIEmbeddings(model="sagemaker-embeddings", openai_api_base="http://0.0.0.0:4000", openai_api_key="temp-key")
|
|
|
|
|
|
text = "This is a test document."
|
|
|
|
query_result = embeddings.embed_query(text)
|
|
|
|
print(f"SAGEMAKER EMBEDDINGS")
|
|
print(query_result[:5])
|
|
|
|
embeddings = OpenAIEmbeddings(model="bedrock-embeddings", openai_api_base="http://0.0.0.0:4000", openai_api_key="temp-key")
|
|
|
|
text = "This is a test document."
|
|
|
|
query_result = embeddings.embed_query(text)
|
|
|
|
print(f"BEDROCK EMBEDDINGS")
|
|
print(query_result[:5])
|
|
|
|
embeddings = OpenAIEmbeddings(model="bedrock-titan-embeddings", openai_api_base="http://0.0.0.0:4000", openai_api_key="temp-key")
|
|
|
|
text = "This is a test document."
|
|
|
|
query_result = embeddings.embed_query(text)
|
|
|
|
print(f"TITAN EMBEDDINGS")
|
|
print(query_result[:5])
|
|
```
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
[**More Info**](./configs.md)
|
|
|
|
|
|
|
|
## 📖 Proxy Endpoints - [Swagger Docs](https://litellm-api.up.railway.app/)
|
|
- POST `/chat/completions` - chat completions endpoint to call 100+ LLMs
|
|
- POST `/completions` - completions endpoint
|
|
- POST `/embeddings` - embedding endpoint for Azure, OpenAI, Huggingface endpoints
|
|
- GET `/models` - available models on server
|
|
- POST `/key/generate` - generate a key to access the proxy
|
|
|
|
|
|
## Using with OpenAI compatible projects
|
|
Set `base_url` to the LiteLLM Proxy server
|
|
|
|
<Tabs>
|
|
<TabItem value="openai" label="OpenAI v1.0.0+">
|
|
|
|
```python
|
|
import openai
|
|
client = openai.OpenAI(
|
|
api_key="anything",
|
|
base_url="http://0.0.0.0:4000"
|
|
)
|
|
|
|
# request sent to model set on litellm proxy, `litellm --model`
|
|
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
|
|
{
|
|
"role": "user",
|
|
"content": "this is a test request, write a short poem"
|
|
}
|
|
])
|
|
|
|
print(response)
|
|
|
|
```
|
|
</TabItem>
|
|
<TabItem value="librechat" label="LibreChat">
|
|
|
|
#### Start the LiteLLM proxy
|
|
```shell
|
|
litellm --model gpt-3.5-turbo
|
|
|
|
#INFO: Proxy running on http://0.0.0.0:4000
|
|
```
|
|
|
|
#### 1. Clone the repo
|
|
|
|
```shell
|
|
git clone https://github.com/danny-avila/LibreChat.git
|
|
```
|
|
|
|
|
|
#### 2. Modify Librechat's `docker-compose.yml`
|
|
LiteLLM Proxy is running on port `4000`, set `4000` as the proxy below
|
|
```yaml
|
|
OPENAI_REVERSE_PROXY=http://host.docker.internal:4000/v1/chat/completions
|
|
```
|
|
|
|
#### 3. Save fake OpenAI key in Librechat's `.env`
|
|
|
|
Copy Librechat's `.env.example` to `.env` and overwrite the default OPENAI_API_KEY (by default it requires the user to pass a key).
|
|
```env
|
|
OPENAI_API_KEY=sk-1234
|
|
```
|
|
|
|
#### 4. Run LibreChat:
|
|
```shell
|
|
docker compose up
|
|
```
|
|
</TabItem>
|
|
|
|
<TabItem value="continue-dev" label="ContinueDev">
|
|
|
|
Continue-Dev brings ChatGPT to VSCode. See how to [install it here](https://continue.dev/docs/quickstart).
|
|
|
|
In the [config.py](https://continue.dev/docs/reference/Models/openai) set this as your default model.
|
|
```python
|
|
default=OpenAI(
|
|
api_key="IGNORED",
|
|
model="fake-model-name",
|
|
context_length=2048, # customize if needed for your model
|
|
api_base="http://localhost:4000" # your proxy server url
|
|
),
|
|
```
|
|
|
|
Credits [@vividfog](https://github.com/ollama/ollama/issues/305#issuecomment-1751848077) for this tutorial.
|
|
</TabItem>
|
|
|
|
<TabItem value="aider" label="Aider">
|
|
|
|
```shell
|
|
$ pip install aider
|
|
|
|
$ aider --openai-api-base http://0.0.0.0:4000 --openai-api-key fake-key
|
|
```
|
|
</TabItem>
|
|
<TabItem value="autogen" label="AutoGen">
|
|
|
|
```python
|
|
pip install pyautogen
|
|
```
|
|
|
|
```python
|
|
from autogen import AssistantAgent, UserProxyAgent, oai
|
|
config_list=[
|
|
{
|
|
"model": "my-fake-model",
|
|
"api_base": "http://localhost:4000", #litellm compatible endpoint
|
|
"api_type": "open_ai",
|
|
"api_key": "NULL", # just a placeholder
|
|
}
|
|
]
|
|
|
|
response = oai.Completion.create(config_list=config_list, prompt="Hi")
|
|
print(response) # works fine
|
|
|
|
llm_config={
|
|
"config_list": config_list,
|
|
}
|
|
|
|
assistant = AssistantAgent("assistant", llm_config=llm_config)
|
|
user_proxy = UserProxyAgent("user_proxy")
|
|
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)
|
|
```
|
|
|
|
Credits [@victordibia](https://github.com/microsoft/autogen/issues/45#issuecomment-1749921972) for this tutorial.
|
|
</TabItem>
|
|
|
|
<TabItem value="guidance" label="guidance">
|
|
A guidance language for controlling large language models.
|
|
https://github.com/guidance-ai/guidance
|
|
|
|
**NOTE:** Guidance sends additional params like `stop_sequences` which can cause some models to fail if they don't support it.
|
|
|
|
**Fix**: Start your proxy using the `--drop_params` flag
|
|
|
|
```shell
|
|
litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params
|
|
```
|
|
|
|
```python
|
|
import guidance
|
|
|
|
# set api_base to your proxy
|
|
# set api_key to anything
|
|
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:4000", api_key="anything")
|
|
|
|
experts = guidance('''
|
|
{{#system~}}
|
|
You are a helpful and terse assistant.
|
|
{{~/system}}
|
|
|
|
{{#user~}}
|
|
I want a response to the following question:
|
|
{{query}}
|
|
Name 3 world-class experts (past or present) who would be great at answering this?
|
|
Don't answer the question yet.
|
|
{{~/user}}
|
|
|
|
{{#assistant~}}
|
|
{{gen 'expert_names' temperature=0 max_tokens=300}}
|
|
{{~/assistant}}
|
|
''', llm=gpt4)
|
|
|
|
result = experts(query='How can I be more productive?')
|
|
print(result)
|
|
```
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Debugging Proxy
|
|
|
|
Events that occur during normal operation
|
|
```shell
|
|
litellm --model gpt-3.5-turbo --debug
|
|
```
|
|
|
|
Detailed information
|
|
```shell
|
|
litellm --model gpt-3.5-turbo --detailed_debug
|
|
```
|
|
|
|
### Set Debug Level using env variables
|
|
|
|
Events that occur during normal operation
|
|
```shell
|
|
export LITELLM_LOG=INFO
|
|
```
|
|
|
|
Detailed information
|
|
```shell
|
|
export LITELLM_LOG=DEBUG
|
|
```
|
|
|
|
No Logs
|
|
```shell
|
|
export LITELLM_LOG=None
|
|
```
|