phoenix/litellm

Fork 0

forked from phoenix/litellm-mirror

Krrish Dholakia 8963a507f7 docs(quick_start.md): add disclaimer

2024-04-20 08:49:14 -07:00

13 KiB

Raw Blame History

import Image from '@theme/IdealImage'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem';

Quick Start

Quick start CLI, Config, Docker

LiteLLM Server manages:

Unified Interface: Calling 100+ LLMs Huggingface/Bedrock/TogetherAI/etc. in the OpenAI ChatCompletions & Completions format
Cost tracking: Authentication, Spend Tracking & Budgets Virtual Keys
Load Balancing: between Multiple Models + Deployments of the same model - LiteLLM proxy can handle 1.5k+ requests/second during load tests.

$ pip install 'litellm[proxy]'

Quick Start - LiteLLM Proxy CLI

Run the following command to start the litellm proxy

$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:4000

Test

In a new shell, run, this will make an openai.chat.completions request. Ensure you're using openai v1.0.0+

litellm --test

This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.

Supported LLMs

All LiteLLM supported LLMs are supported on the Proxy. Seel all supported llms

$ export AWS_ACCESS_KEY_ID=
$ export AWS_REGION_NAME=
$ export AWS_SECRET_ACCESS_KEY=

$ litellm --model bedrock/anthropic.claude-v2

$ export AZURE_API_KEY=my-api-key
$ export AZURE_API_BASE=my-api-base

$ litellm --model azure/my-deployment-name

$ export OPENAI_API_KEY=my-api-key

$ litellm --model gpt-3.5-turbo

$ litellm --model ollama/<ollama-model-name>

$ export OPENAI_API_KEY=my-api-key

$ litellm --model openai/<your model name> --api_base <your-api-base> # e.g. http://0.0.0.0:3000

$ export VERTEX_PROJECT="hardy-project"
$ export VERTEX_LOCATION="us-west"

$ litellm --model vertex_ai/gemini-pro

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]

$ litellm --model huggingface/<your model name> --api_base <your-api-base> # e.g. http://0.0.0.0:3000

$ litellm --model huggingface/<your model name> --api_base http://0.0.0.0:8001

export AWS_ACCESS_KEY_ID=
export AWS_REGION_NAME=
export AWS_SECRET_ACCESS_KEY=

$ litellm --model sagemaker/jumpstart-dft-meta-textgeneration-llama-2-7b

$ export ANTHROPIC_API_KEY=my-api-key

$ litellm --model claude-instant-1

Assuming you're running vllm locally

$ litellm --model vllm/facebook/opt-125m

$ export TOGETHERAI_API_KEY=my-api-key

$ litellm --model together_ai/lmsys/vicuna-13b-v1.5-16k

$ export REPLICATE_API_KEY=my-api-key

$ litellm \
  --model replicate/meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3

$ litellm --model petals/meta-llama/Llama-2-70b-chat-hf

$ export PALM_API_KEY=my-palm-key

$ litellm --model palm/chat-bison

$ export AI21_API_KEY=my-api-key

$ litellm --model j2-light

$ export COHERE_API_KEY=my-api-key

$ litellm --model command-nightly

Quick Start - LiteLLM Proxy + Config.yaml

The config allows you to create a model list and set api_base, max_tokens (all litellm params). See more details about the config here

Create a Config for LiteLLM Proxy

Example config

model_list: 
  - model_name: gpt-3.5-turbo # user-facing model alias
    litellm_params: # all params accepted by litellm.completion() - https://docs.litellm.ai/docs/completion/input
      model: azure/<your-deployment-name>
      api_base: <your-azure-api-endpoint>
      api_key: <your-azure-api-key>
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: azure/gpt-turbo-small-ca
      api_base: https://my-endpoint-canada-berri992.openai.azure.com/
      api_key: <your-azure-api-key>
  - model_name: vllm-model
    litellm_params:
      model: openai/<your-model-name>
      api_base: <your-api-base> # e.g. http://0.0.0.0:3000

Run proxy with config

litellm --config your_config.yaml

Using LiteLLM Proxy - Curl Request, OpenAI Package, Langchain

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "gpt-3.5-turbo",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ]
    }
'

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem"
    }
])

print(response)

from langchain.chat_models import ChatOpenAI
from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
    SystemMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage

chat = ChatOpenAI(
    openai_api_base="http://0.0.0.0:4000", # set openai_api_base to the LiteLLM Proxy
    model = "gpt-3.5-turbo",
    temperature=0.1
)

messages = [
    SystemMessage(
        content="You are a helpful assistant that im using to make a test request to."
    ),
    HumanMessage(
        content="test from litellm. tell me why it's amazing in 1 sentence"
    ),
]
response = chat(messages)

print(response)

from langchain.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="sagemaker-embeddings", openai_api_base="http://0.0.0.0:4000", openai_api_key="temp-key")


text = "This is a test document."

query_result = embeddings.embed_query(text)

print(f"SAGEMAKER EMBEDDINGS")
print(query_result[:5])

embeddings = OpenAIEmbeddings(model="bedrock-embeddings", openai_api_base="http://0.0.0.0:4000", openai_api_key="temp-key")

text = "This is a test document."

query_result = embeddings.embed_query(text)

print(f"BEDROCK EMBEDDINGS")
print(query_result[:5])

embeddings = OpenAIEmbeddings(model="bedrock-titan-embeddings", openai_api_base="http://0.0.0.0:4000", openai_api_key="temp-key")

text = "This is a test document."

query_result = embeddings.embed_query(text)

print(f"TITAN EMBEDDINGS")
print(query_result[:5])

This is not recommended. There is duplicate logic as the proxy also uses the sdk, which might lead to unexpected errors.

from litellm import completion 

response = completion(
    model="openai/gpt-3.5-turbo", 
    messages = [
        {
            "role": "user",
            "content": "this is a test request, write a short poem"
        }
    ], 
    api_key="anything", 
    base_url="http://0.0.0.0:4000"
    )

print(response)

More Info

📖 Proxy Endpoints - Swagger Docs

POST /chat/completions - chat completions endpoint to call 100+ LLMs
POST /completions - completions endpoint
POST /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
GET /models - available models on server
POST /key/generate - generate a key to access the proxy

Using with OpenAI compatible projects

Set base_url to the LiteLLM Proxy server

import openai
client = openai.OpenAI(
    api_key="anything",
    base_url="http://0.0.0.0:4000"
)

# request sent to model set on litellm proxy, `litellm --model`
response = client.chat.completions.create(model="gpt-3.5-turbo", messages = [
    {
        "role": "user",
        "content": "this is a test request, write a short poem"
    }
])

print(response)

Start the LiteLLM proxy

litellm --model gpt-3.5-turbo

#INFO: Proxy running on http://0.0.0.0:4000

1. Clone the repo

git clone https://github.com/danny-avila/LibreChat.git

2. Modify Librechat's `docker-compose.yml`

LiteLLM Proxy is running on port 4000, set 4000 as the proxy below

OPENAI_REVERSE_PROXY=http://host.docker.internal:4000/v1/chat/completions

3. Save fake OpenAI key in Librechat's `.env`

Copy Librechat's .env.example to .env and overwrite the default OPENAI_API_KEY (by default it requires the user to pass a key).

OPENAI_API_KEY=sk-1234

4. Run LibreChat:

docker compose up

Continue-Dev brings ChatGPT to VSCode. See how to install it here.

In the config.py set this as your default model.

  default=OpenAI(
      api_key="IGNORED",
      model="fake-model-name",
      context_length=2048, # customize if needed for your model
      api_base="http://localhost:4000" # your proxy server url
  ),

Credits @vividfog for this tutorial.

$ pip install aider 

$ aider --openai-api-base http://0.0.0.0:4000 --openai-api-key fake-key

pip install pyautogen

from autogen import AssistantAgent, UserProxyAgent, oai
config_list=[
    {
        "model": "my-fake-model",
        "api_base": "http://localhost:4000",  #litellm compatible endpoint
        "api_type": "open_ai",
        "api_key": "NULL", # just a placeholder
    }
]

response = oai.Completion.create(config_list=config_list, prompt="Hi")
print(response) # works fine

llm_config={
    "config_list": config_list,
}

assistant = AssistantAgent("assistant", llm_config=llm_config)
user_proxy = UserProxyAgent("user_proxy")
user_proxy.initiate_chat(assistant, message="Plot a chart of META and TESLA stock price change YTD.", config_list=config_list)

Credits @victordibia for this tutorial.

A guidance language for controlling large language models. https://github.com/guidance-ai/guidance

NOTE: Guidance sends additional params like stop_sequences which can cause some models to fail if they don't support it.

Fix: Start your proxy using the --drop_params flag

litellm --model ollama/codellama --temperature 0.3 --max_tokens 2048 --drop_params

import guidance

# set api_base to your proxy
# set api_key to anything
gpt4 = guidance.llms.OpenAI("gpt-4", api_base="http://0.0.0.0:4000", api_key="anything")

experts = guidance('''
{{#system~}}
You are a helpful and terse assistant.
{{~/system}}

{{#user~}}
I want a response to the following question:
{{query}}
Name 3 world-class experts (past or present) who would be great at answering this?
Don't answer the question yet.
{{~/user}}

{{#assistant~}}
{{gen 'expert_names' temperature=0 max_tokens=300}}
{{~/assistant}}
''', llm=gpt4)

result = experts(query='How can I be more productive?')
print(result)

Debugging Proxy

Events that occur during normal operation

litellm --model gpt-3.5-turbo --debug

Detailed information

litellm --model gpt-3.5-turbo --detailed_debug

Set Debug Level using env variables