almost working llmonitor

This commit is contained in:
Vince Lwt 2023-08-21 16:26:47 +02:00
parent 22c7e38de5
commit 3675d3e029
5 changed files with 425 additions and 326 deletions

View file

@ -1,6 +1,7 @@
# liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching # liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models ### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
[![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/) [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
[![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/) [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
![Downloads](https://img.shields.io/pypi/dm/litellm) ![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,34 +12,36 @@
![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724) ![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)
## What does liteLLM proxy do ## What does liteLLM proxy do
- Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face** - Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k` Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
```json ```json
{ {
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1", "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
"messages": [ "messages": [
{ {
"content": "Hello, whats the weather in San Francisco??", "content": "Hello, whats the weather in San Francisco??",
"role": "user" "role": "user"
} }
] ]
} }
``` ```
- **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to Supabase** - **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `LLMonitor,` `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to Supabase**
<img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08"> <img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
- **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
- **Caching** - Implementation of Semantic Caching - **Caching** - Implementation of Semantic Caching
- **Streaming & Async Support** - Return generators to stream text responses - **Streaming & Async Support** - Return generators to stream text responses
## API Endpoints ## API Endpoints
### `/chat/completions` (POST) ### `/chat/completions` (POST)
@ -46,34 +49,37 @@
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
#### Input #### Input
This API endpoint accepts all inputs in raw JSON and expects the following inputs This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k` - `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
- `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role). - `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/ - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
#### Example JSON body #### Example JSON body
For claude-2 For claude-2
```json ```json
{ {
"model": "claude-2", "model": "claude-2",
"messages": [ "messages": [
{ {
"content": "Hello, whats the weather in San Francisco??", "content": "Hello, whats the weather in San Francisco??",
"role": "user" "role": "user"
} }
] ]
} }
``` ```
### Making an API request to the Proxy Server ### Making an API request to the Proxy Server
```python ```python
import requests import requests
import json import json
# TODO: use your URL # TODO: use your URL
url = "http://localhost:5000/chat/completions" url = "http://localhost:5000/chat/completions"
payload = json.dumps({ payload = json.dumps({
@ -94,34 +100,38 @@ print(response.text)
``` ```
### Output [Response Format] ### Output [Response Format]
Responses from the server are given in the following format.
Responses from the server are given in the following format.
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/ All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
```json ```json
{ {
"choices": [ "choices": [
{ {
"finish_reason": "stop", "finish_reason": "stop",
"index": 0, "index": 0,
"message": { "message": {
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.", "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
"role": "assistant" "role": "assistant"
} }
}
],
"created": 1691790381,
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 41,
"prompt_tokens": 16,
"total_tokens": 57
} }
],
"created": 1691790381,
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 41,
"prompt_tokens": 16,
"total_tokens": 57
}
} }
``` ```
## Installation & Usage ## Installation & Usage
### Running Locally ### Running Locally
1. Clone liteLLM repository to your local machine: 1. Clone liteLLM repository to your local machine:
``` ```
git clone https://github.com/BerriAI/liteLLM-proxy git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
python main.py python main.py
``` ```
## Deploying ## Deploying
1. Quick Start: Deploy on Railway 1. Quick Start: Deploy on Railway
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU) [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
2. `GCP`, `AWS`, `Azure` 2. `GCP`, `AWS`, `Azure`
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
# Support / Talk with founders # Support / Talk with founders
- [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version) - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw) - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238 - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
## Roadmap ## Roadmap
- [ ] Support hosted db (e.g. Supabase) - [ ] Support hosted db (e.g. Supabase)
- [ ] Easily send data to places like posthog and sentry. - [ ] Easily send data to places like posthog and sentry.
- [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings - [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings

View file

@ -5,6 +5,7 @@ import traceback
import dotenv import dotenv
import os import os
import requests import requests
dotenv.load_dotenv() # Loading env variables using dotenv dotenv.load_dotenv() # Loading env variables using dotenv
@ -14,45 +15,34 @@ class LLMonitorLogger:
# Instance variables # Instance variables
self.api_url = os.getenv( self.api_url = os.getenv(
"LLMONITOR_API_URL") or "https://app.llmonitor.com" "LLMONITOR_API_URL") or "https://app.llmonitor.com"
self.account_id = os.getenv("LLMONITOR_APP_ID") self.app_id = os.getenv("LLMONITOR_APP_ID")
def log_event(self, model, messages, response_obj, start_time, end_time, print_verbose): def log_event(self, type, run_id, error, usage, model, messages,
response_obj, user_id, time, print_verbose):
# Method definition # Method definition
try: try:
print_verbose( print_verbose(
f"LLMonitor Logging - Enters logging function for model {model}") f"LLMonitor Logging - Enters logging function for model {model}"
)
print(model, messages, response_obj, start_time, end_time) print(type, model, messages, response_obj, time, end_user)
# headers = { headers = {'Content-Type': 'application/json'}
# 'Content-Type': 'application/json'
# }
# prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = self.price_calculator( data = {
# model, response_obj, start_time, end_time) "type": "llm",
# total_cost = prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar "name": model,
"runId": run_id,
"app": self.app_id,
"error": error,
"event": type,
"timestamp": time.isoformat(),
"userId": user_id,
"input": messages,
"output": response_obj['choices'][0]['message']['content'],
}
# response_time = (end_time-start_time).total_seconds() print_verbose(f"LLMonitor Logging - final data object: {data}")
# if "response" in response_obj:
# data = [{
# "response_time": response_time,
# "model_id": response_obj["model"],
# "total_cost": total_cost,
# "messages": messages,
# "response": response_obj['choices'][0]['message']['content'],
# "account_id": self.account_id
# }]
# elif "error" in response_obj:
# data = [{
# "response_time": response_time,
# "model_id": response_obj["model"],
# "total_cost": total_cost,
# "messages": messages,
# "error": response_obj['error'],
# "account_id": self.account_id
# }]
# print_verbose(f"BerriSpend Logging - final data object: {data}")
# response = requests.post(url, headers=headers, json=data) # response = requests.post(url, headers=headers, json=data)
except: except:
# traceback.print_exc() # traceback.print_exc()

View file

@ -1,28 +1,36 @@
#### What this tests #### #### What this tests ####
# This tests if logging to the helicone integration actually works # This tests if logging to the llmonitor integration actually works
# Adds the parent directory to the system path
from litellm import embedding, completion
import litellm
import sys import sys
import os import os
import traceback
import pytest
# Adds the parent directory to the system path
sys.path.insert(0, os.path.abspath('../..')) sys.path.insert(0, os.path.abspath('../..'))
from litellm import completion
import litellm
litellm.input_callback = ["llmonitor"]
litellm.success_callback = ["llmonitor"] litellm.success_callback = ["llmonitor"]
litellm.error_callback = ["llmonitor"]
litellm.set_verbose = True litellm.set_verbose = True
user_message = "Hello, how are you?"
messages = [{"content": user_message, "role": "user"}]
# openai call # openai call
response = completion(model="gpt-3.5-turbo", # response = completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}]) # messages=[{
# "role": "user",
# "content": "Hi 👋 - i'm openai"
# }])
# print(response)
# #bad request call
# response = completion(model="chatgpt-test", messages=[{"role": "user", "content": "Hi 👋 - i'm a bad request"}])
# cohere call # cohere call
# response = completion(model="command-nightly", response = completion(model="command-nightly",
# messages=[{"role": "user", "content": "Hi 👋 - i'm cohere"}]) messages=[{
"role": "user",
"content": "Hi 👋 - i'm cohere"
}])
print(response)

View file

@ -1,20 +1,7 @@
import sys import aiohttp
import dotenv, json, traceback, threading import subprocess
import subprocess, os import importlib
import litellm, openai from typing import List, Dict, Union, Optional
import random, uuid, requests
import datetime, time
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
import pkg_resources
from .integrations.helicone import HeliconeLogger
from .integrations.aispend import AISpendLogger
from .integrations.berrispend import BerriSpendLogger
from .integrations.supabase import Supabase
from .integrations.litedebugger import LiteDebugger
from openai.error import OpenAIError as OriginalError
from openai.openai_object import OpenAIObject
from .exceptions import ( from .exceptions import (
AuthenticationError, AuthenticationError,
InvalidRequestError, InvalidRequestError,
@ -22,7 +9,32 @@ from .exceptions import (
ServiceUnavailableError, ServiceUnavailableError,
OpenAIError, OpenAIError,
) )
from typing import List, Dict, Union, Optional from openai.openai_object import OpenAIObject
from openai.error import OpenAIError as OriginalError
from .integrations.llmonitor import LLMonitorLogger
from .integrations.litedebugger import LiteDebugger
from .integrations.supabase import Supabase
from .integrations.berrispend import BerriSpendLogger
from .integrations.aispend import AISpendLogger
from .integrations.helicone import HeliconeLogger
import pkg_resources
import sys
import dotenv
import json
import traceback
import threading
import subprocess
import os
import litellm
import openai
import random
import uuid
import requests
import datetime
import time
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
####### ENVIRONMENT VARIABLES ################### ####### ENVIRONMENT VARIABLES ###################
dotenv.load_dotenv() # Loading env variables using dotenv dotenv.load_dotenv() # Loading env variables using dotenv
@ -37,6 +49,7 @@ aispendLogger = None
berrispendLogger = None berrispendLogger = None
supabaseClient = None supabaseClient = None
liteDebuggerClient = None liteDebuggerClient = None
llmonitorLogger = None
callback_list: Optional[List[str]] = [] callback_list: Optional[List[str]] = []
user_logger_fn = None user_logger_fn = None
additional_details: Optional[Dict[str, str]] = {} additional_details: Optional[Dict[str, str]] = {}
@ -63,6 +76,7 @@ local_cache: Optional[Dict[str, str]] = {}
class Message(OpenAIObject): class Message(OpenAIObject):
def __init__(self, content="default", role="assistant", **params): def __init__(self, content="default", role="assistant", **params):
super(Message, self).__init__(**params) super(Message, self).__init__(**params)
self.content = content self.content = content
@ -70,7 +84,12 @@ class Message(OpenAIObject):
class Choices(OpenAIObject): class Choices(OpenAIObject):
def __init__(self, finish_reason="stop", index=0, message=Message(), **params):
def __init__(self,
finish_reason="stop",
index=0,
message=Message(),
**params):
super(Choices, self).__init__(**params) super(Choices, self).__init__(**params)
self.finish_reason = finish_reason self.finish_reason = finish_reason
self.index = index self.index = index
@ -78,20 +97,22 @@ class Choices(OpenAIObject):
class ModelResponse(OpenAIObject): class ModelResponse(OpenAIObject):
def __init__(self, choices=None, created=None, model=None, usage=None, **params):
def __init__(self,
choices=None,
created=None,
model=None,
usage=None,
**params):
super(ModelResponse, self).__init__(**params) super(ModelResponse, self).__init__(**params)
self.choices = choices if choices else [Choices()] self.choices = choices if choices else [Choices()]
self.created = created self.created = created
self.model = model self.model = model
self.usage = ( self.usage = (usage if usage else {
usage "prompt_tokens": None,
if usage "completion_tokens": None,
else { "total_tokens": None,
"prompt_tokens": None, })
"completion_tokens": None,
"total_tokens": None,
}
)
def to_dict_recursive(self): def to_dict_recursive(self):
d = super().to_dict_recursive() d = super().to_dict_recursive()
@ -108,8 +129,6 @@ def print_verbose(print_statement):
####### Package Import Handler ################### ####### Package Import Handler ###################
import importlib
import subprocess
def install_and_import(package: str): def install_and_import(package: str):
@ -139,6 +158,7 @@ def install_and_import(package: str):
# Logging function -> log the exact model details + what's being sent | Non-Blocking # Logging function -> log the exact model details + what's being sent | Non-Blocking
class Logging: class Logging:
global supabaseClient, liteDebuggerClient global supabaseClient, liteDebuggerClient
def __init__(self, model, messages, optional_params, litellm_params): def __init__(self, model, messages, optional_params, litellm_params):
self.model = model self.model = model
self.messages = messages self.messages = messages
@ -146,20 +166,20 @@ class Logging:
self.litellm_params = litellm_params self.litellm_params = litellm_params
self.logger_fn = litellm_params["logger_fn"] self.logger_fn = litellm_params["logger_fn"]
self.model_call_details = { self.model_call_details = {
"model": model, "model": model,
"messages": messages, "messages": messages,
"optional_params": self.optional_params, "optional_params": self.optional_params,
"litellm_params": self.litellm_params, "litellm_params": self.litellm_params,
} }
def pre_call(self, input, api_key, additional_args={}): def pre_call(self, input, api_key, additional_args={}):
try: try:
print(f"logging pre call for model: {self.model}") print(f"logging pre call for model: {self.model}")
self.model_call_details["input"] = input self.model_call_details["input"] = input
self.model_call_details["api_key"] = api_key self.model_call_details["api_key"] = api_key
self.model_call_details["additional_args"] = additional_args self.model_call_details["additional_args"] = additional_args
## User Logging -> if you pass in a custom logging function # User Logging -> if you pass in a custom logging function
print_verbose( print_verbose(
f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}" f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
) )
@ -173,7 +193,7 @@ class Logging:
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}" f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
) )
## Input Integration Logging -> If you want to log the fact that an attempt to call the model was made # Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
for callback in litellm.input_callback: for callback in litellm.input_callback:
try: try:
if callback == "supabase": if callback == "supabase":
@ -185,7 +205,21 @@ class Logging:
model=model, model=model,
messages=messages, messages=messages,
end_user=litellm._thread_context.user, end_user=litellm._thread_context.user,
litellm_call_id=self.litellm_params["litellm_call_id"], litellm_call_id=self.
litellm_params["litellm_call_id"],
print_verbose=print_verbose,
)
elif callback == "llmonitor":
print_verbose("reaches llmonitor for logging!")
model = self.model
messages = self.messages
print(f"liteDebuggerClient: {liteDebuggerClient}")
llmonitorLogger.log_event(
type="start",
model=model,
messages=messages,
user_id=litellm._thread_context.user,
run_id=self.litellm_params["litellm_call_id"],
print_verbose=print_verbose, print_verbose=print_verbose,
) )
elif callback == "lite_debugger": elif callback == "lite_debugger":
@ -197,15 +231,18 @@ class Logging:
model=model, model=model,
messages=messages, messages=messages,
end_user=litellm._thread_context.user, end_user=litellm._thread_context.user,
litellm_call_id=self.litellm_params["litellm_call_id"], litellm_call_id=self.
litellm_params["litellm_call_id"],
print_verbose=print_verbose, print_verbose=print_verbose,
) )
except Exception as e: except Exception as e:
print_verbose(f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}") print_verbose(
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}"
)
print_verbose( print_verbose(
f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}" f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
) )
if capture_exception: # log this error to sentry for debugging if capture_exception: # log this error to sentry for debugging
capture_exception(e) capture_exception(e)
except: except:
print_verbose( print_verbose(
@ -214,9 +251,9 @@ class Logging:
print_verbose( print_verbose(
f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}" f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
) )
if capture_exception: # log this error to sentry for debugging if capture_exception: # log this error to sentry for debugging
capture_exception(e) capture_exception(e)
def post_call(self, input, api_key, original_response, additional_args={}): def post_call(self, input, api_key, original_response, additional_args={}):
# Do something here # Do something here
try: try:
@ -224,8 +261,8 @@ class Logging:
self.model_call_details["api_key"] = api_key self.model_call_details["api_key"] = api_key
self.model_call_details["original_response"] = original_response self.model_call_details["original_response"] = original_response
self.model_call_details["additional_args"] = additional_args self.model_call_details["additional_args"] = additional_args
## User Logging -> if you pass in a custom logging function # User Logging -> if you pass in a custom logging function
print_verbose( print_verbose(
f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}" f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
) )
@ -243,9 +280,9 @@ class Logging:
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}" f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
) )
pass pass
# Add more methods as needed # Add more methods as needed
def exception_logging( def exception_logging(
additional_args={}, additional_args={},
@ -257,7 +294,7 @@ def exception_logging(
if exception: if exception:
model_call_details["exception"] = exception model_call_details["exception"] = exception
model_call_details["additional_args"] = additional_args model_call_details["additional_args"] = additional_args
## User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs # User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
print_verbose( print_verbose(
f"Logging Details: logger_fn - {logger_fn} | callable(logger_fn) - {callable(logger_fn)}" f"Logging Details: logger_fn - {logger_fn} | callable(logger_fn) - {callable(logger_fn)}"
) )
@ -280,20 +317,20 @@ def exception_logging(
####### CLIENT ################### ####### CLIENT ###################
# make it easy to log if completion/embedding runs succeeded or failed + see what happened | Non-Blocking # make it easy to log if completion/embedding runs succeeded or failed + see what happened | Non-Blocking
def client(original_function): def client(original_function):
def function_setup( def function_setup(
*args, **kwargs *args, **kwargs
): # just run once to check if user wants to send their data anywhere - PostHog/Sentry/Slack/etc. ): # just run once to check if user wants to send their data anywhere - PostHog/Sentry/Slack/etc.
try: try:
global callback_list, add_breadcrumb, user_logger_fn global callback_list, add_breadcrumb, user_logger_fn
if ( if (len(litellm.input_callback) > 0
len(litellm.input_callback) > 0 or len(litellm.success_callback) > 0 or len(litellm.failure_callback) > 0 or len(litellm.success_callback) > 0
) and len(callback_list) == 0: or len(litellm.failure_callback)
> 0) and len(callback_list) == 0:
callback_list = list( callback_list = list(
set(litellm.input_callback + litellm.success_callback + litellm.failure_callback) set(litellm.input_callback + litellm.success_callback +
) litellm.failure_callback))
set_callbacks( set_callbacks(callback_list=callback_list, )
callback_list=callback_list,
)
if add_breadcrumb: if add_breadcrumb:
add_breadcrumb( add_breadcrumb(
category="litellm.llm_call", category="litellm.llm_call",
@ -310,12 +347,11 @@ def client(original_function):
if litellm.telemetry: if litellm.telemetry:
try: try:
model = args[0] if len(args) > 0 else kwargs["model"] model = args[0] if len(args) > 0 else kwargs["model"]
exception = kwargs["exception"] if "exception" in kwargs else None exception = kwargs[
custom_llm_provider = ( "exception"] if "exception" in kwargs else None
kwargs["custom_llm_provider"] custom_llm_provider = (kwargs["custom_llm_provider"]
if "custom_llm_provider" in kwargs if "custom_llm_provider" in kwargs else
else None None)
)
safe_crash_reporting( safe_crash_reporting(
model=model, model=model,
exception=exception, exception=exception,
@ -340,15 +376,12 @@ def client(original_function):
def check_cache(*args, **kwargs): def check_cache(*args, **kwargs):
try: # never block execution try: # never block execution
prompt = get_prompt(*args, **kwargs) prompt = get_prompt(*args, **kwargs)
if ( if (prompt != None and prompt
prompt != None and prompt in local_cache in local_cache): # check if messages / prompt exists
): # check if messages / prompt exists
if litellm.caching_with_models: if litellm.caching_with_models:
# if caching with model names is enabled, key is prompt + model name # if caching with model names is enabled, key is prompt + model name
if ( if ("model" in kwargs and kwargs["model"]
"model" in kwargs in local_cache[prompt]["models"]):
and kwargs["model"] in local_cache[prompt]["models"]
):
cache_key = prompt + kwargs["model"] cache_key = prompt + kwargs["model"]
return local_cache[cache_key] return local_cache[cache_key]
else: # caching only with prompts else: # caching only with prompts
@ -363,10 +396,8 @@ def client(original_function):
try: # never block execution try: # never block execution
prompt = get_prompt(*args, **kwargs) prompt = get_prompt(*args, **kwargs)
if litellm.caching_with_models: # caching with model + prompt if litellm.caching_with_models: # caching with model + prompt
if ( if ("model" in kwargs
"model" in kwargs and kwargs["model"] in local_cache[prompt]["models"]):
and kwargs["model"] in local_cache[prompt]["models"]
):
cache_key = prompt + kwargs["model"] cache_key = prompt + kwargs["model"]
local_cache[cache_key] = result local_cache[cache_key] = result
else: # caching based only on prompts else: # caching based only on prompts
@ -381,24 +412,24 @@ def client(original_function):
function_setup(*args, **kwargs) function_setup(*args, **kwargs)
litellm_call_id = str(uuid.uuid4()) litellm_call_id = str(uuid.uuid4())
kwargs["litellm_call_id"] = litellm_call_id kwargs["litellm_call_id"] = litellm_call_id
## [OPTIONAL] CHECK CACHE # [OPTIONAL] CHECK CACHE
start_time = datetime.datetime.now() start_time = datetime.datetime.now()
if (litellm.caching or litellm.caching_with_models) and ( if (litellm.caching or litellm.caching_with_models) and (
cached_result := check_cache(*args, **kwargs) cached_result := check_cache(*args, **kwargs)) is not None:
) is not None:
result = cached_result result = cached_result
else: else:
## MODEL CALL # MODEL CALL
result = original_function(*args, **kwargs) result = original_function(*args, **kwargs)
end_time = datetime.datetime.now() end_time = datetime.datetime.now()
## Add response to CACHE # Add response to CACHE
if litellm.caching: if litellm.caching:
add_cache(result, *args, **kwargs) add_cache(result, *args, **kwargs)
## LOG SUCCESS # LOG SUCCESS
crash_reporting(*args, **kwargs) crash_reporting(*args, **kwargs)
my_thread = threading.Thread( my_thread = threading.Thread(
target=handle_success, args=(args, kwargs, result, start_time, end_time) target=handle_success,
) # don't interrupt execution of main thread args=(args, kwargs, result, start_time,
end_time)) # don't interrupt execution of main thread
my_thread.start() my_thread.start()
return result return result
except Exception as e: except Exception as e:
@ -407,7 +438,8 @@ def client(original_function):
end_time = datetime.datetime.now() end_time = datetime.datetime.now()
my_thread = threading.Thread( my_thread = threading.Thread(
target=handle_failure, target=handle_failure,
args=(e, traceback_exception, start_time, end_time, args, kwargs), args=(e, traceback_exception, start_time, end_time, args,
kwargs),
) # don't interrupt execution of main thread ) # don't interrupt execution of main thread
my_thread.start() my_thread.start()
raise e raise e
@ -432,18 +464,18 @@ def token_counter(model, text):
return num_tokens return num_tokens
def cost_per_token(model="gpt-3.5-turbo", prompt_tokens=0, completion_tokens=0): def cost_per_token(model="gpt-3.5-turbo",
## given prompt_tokens=0,
completion_tokens=0):
# given
prompt_tokens_cost_usd_dollar = 0 prompt_tokens_cost_usd_dollar = 0
completion_tokens_cost_usd_dollar = 0 completion_tokens_cost_usd_dollar = 0
model_cost_ref = litellm.model_cost model_cost_ref = litellm.model_cost
if model in model_cost_ref: if model in model_cost_ref:
prompt_tokens_cost_usd_dollar = ( prompt_tokens_cost_usd_dollar = (
model_cost_ref[model]["input_cost_per_token"] * prompt_tokens model_cost_ref[model]["input_cost_per_token"] * prompt_tokens)
)
completion_tokens_cost_usd_dollar = ( completion_tokens_cost_usd_dollar = (
model_cost_ref[model]["output_cost_per_token"] * completion_tokens model_cost_ref[model]["output_cost_per_token"] * completion_tokens)
)
return prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar return prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar
else: else:
# calculate average input cost # calculate average input cost
@ -464,8 +496,9 @@ def completion_cost(model="gpt-3.5-turbo", prompt="", completion=""):
prompt_tokens = token_counter(model=model, text=prompt) prompt_tokens = token_counter(model=model, text=prompt)
completion_tokens = token_counter(model=model, text=completion) completion_tokens = token_counter(model=model, text=completion)
prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = cost_per_token( prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = cost_per_token(
model=model, prompt_tokens=prompt_tokens, completion_tokens=completion_tokens model=model,
) prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens)
return prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar return prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
@ -557,9 +590,8 @@ def get_optional_params(
optional_params["max_tokens"] = max_tokens optional_params["max_tokens"] = max_tokens
if frequency_penalty != 0: if frequency_penalty != 0:
optional_params["frequency_penalty"] = frequency_penalty optional_params["frequency_penalty"] = frequency_penalty
elif ( elif (model == "chat-bison"
model == "chat-bison" ): # chat-bison has diff args from chat-bison@001 ty Google
): # chat-bison has diff args from chat-bison@001 ty Google
if temperature != 1: if temperature != 1:
optional_params["temperature"] = temperature optional_params["temperature"] = temperature
if top_p != 1: if top_p != 1:
@ -619,7 +651,10 @@ def load_test_model(
test_prompt = prompt test_prompt = prompt
if num_calls: if num_calls:
test_calls = num_calls test_calls = num_calls
messages = [[{"role": "user", "content": test_prompt}] for _ in range(test_calls)] messages = [[{
"role": "user",
"content": test_prompt
}] for _ in range(test_calls)]
start_time = time.time() start_time = time.time()
try: try:
litellm.batch_completion( litellm.batch_completion(
@ -649,7 +684,7 @@ def load_test_model(
def set_callbacks(callback_list): def set_callbacks(callback_list):
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
try: try:
for callback in callback_list: for callback in callback_list:
print(f"callback: {callback}") print(f"callback: {callback}")
@ -657,17 +692,15 @@ def set_callbacks(callback_list):
try: try:
import sentry_sdk import sentry_sdk
except ImportError: except ImportError:
print_verbose("Package 'sentry_sdk' is missing. Installing it...") print_verbose(
"Package 'sentry_sdk' is missing. Installing it...")
subprocess.check_call( subprocess.check_call(
[sys.executable, "-m", "pip", "install", "sentry_sdk"] [sys.executable, "-m", "pip", "install", "sentry_sdk"])
)
import sentry_sdk import sentry_sdk
sentry_sdk_instance = sentry_sdk sentry_sdk_instance = sentry_sdk
sentry_trace_rate = ( sentry_trace_rate = (os.environ.get("SENTRY_API_TRACE_RATE")
os.environ.get("SENTRY_API_TRACE_RATE") if "SENTRY_API_TRACE_RATE" in os.environ
if "SENTRY_API_TRACE_RATE" in os.environ else "1.0")
else "1.0"
)
sentry_sdk_instance.init( sentry_sdk_instance.init(
dsn=os.environ.get("SENTRY_API_URL"), dsn=os.environ.get("SENTRY_API_URL"),
traces_sample_rate=float(sentry_trace_rate), traces_sample_rate=float(sentry_trace_rate),
@ -678,10 +711,10 @@ def set_callbacks(callback_list):
try: try:
from posthog import Posthog from posthog import Posthog
except ImportError: except ImportError:
print_verbose("Package 'posthog' is missing. Installing it...") print_verbose(
"Package 'posthog' is missing. Installing it...")
subprocess.check_call( subprocess.check_call(
[sys.executable, "-m", "pip", "install", "posthog"] [sys.executable, "-m", "pip", "install", "posthog"])
)
from posthog import Posthog from posthog import Posthog
posthog = Posthog( posthog = Posthog(
project_api_key=os.environ.get("POSTHOG_API_KEY"), project_api_key=os.environ.get("POSTHOG_API_KEY"),
@ -691,10 +724,10 @@ def set_callbacks(callback_list):
try: try:
from slack_bolt import App from slack_bolt import App
except ImportError: except ImportError:
print_verbose("Package 'slack_bolt' is missing. Installing it...") print_verbose(
"Package 'slack_bolt' is missing. Installing it...")
subprocess.check_call( subprocess.check_call(
[sys.executable, "-m", "pip", "install", "slack_bolt"] [sys.executable, "-m", "pip", "install", "slack_bolt"])
)
from slack_bolt import App from slack_bolt import App
slack_app = App( slack_app = App(
token=os.environ.get("SLACK_API_TOKEN"), token=os.environ.get("SLACK_API_TOKEN"),
@ -704,6 +737,8 @@ def set_callbacks(callback_list):
print_verbose(f"Initialized Slack App: {slack_app}") print_verbose(f"Initialized Slack App: {slack_app}")
elif callback == "helicone": elif callback == "helicone":
heliconeLogger = HeliconeLogger() heliconeLogger = HeliconeLogger()
elif callback == "llmonitor":
llmonitorLogger = LLMonitorLogger()
elif callback == "aispend": elif callback == "aispend":
aispendLogger = AISpendLogger() aispendLogger = AISpendLogger()
elif callback == "berrispend": elif callback == "berrispend":
@ -718,7 +753,8 @@ def set_callbacks(callback_list):
raise e raise e
def handle_failure(exception, traceback_exception, start_time, end_time, args, kwargs): def handle_failure(exception, traceback_exception, start_time, end_time, args,
kwargs):
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
try: try:
# print_verbose(f"handle_failure args: {args}") # print_verbose(f"handle_failure args: {args}")
@ -728,8 +764,7 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
failure_handler = additional_details.pop("failure_handler", None) failure_handler = additional_details.pop("failure_handler", None)
additional_details["Event_Name"] = additional_details.pop( additional_details["Event_Name"] = additional_details.pop(
"failed_event_name", "litellm.failed_query" "failed_event_name", "litellm.failed_query")
)
print_verbose(f"self.failure_callback: {litellm.failure_callback}") print_verbose(f"self.failure_callback: {litellm.failure_callback}")
# print_verbose(f"additional_details: {additional_details}") # print_verbose(f"additional_details: {additional_details}")
@ -746,9 +781,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
for detail in additional_details: for detail in additional_details:
slack_msg += f"{detail}: {additional_details[detail]}\n" slack_msg += f"{detail}: {additional_details[detail]}\n"
slack_msg += f"Traceback: {traceback_exception}" slack_msg += f"Traceback: {traceback_exception}"
slack_app.client.chat_postMessage( slack_app.client.chat_postMessage(channel=alerts_channel,
channel=alerts_channel, text=slack_msg text=slack_msg)
)
elif callback == "sentry": elif callback == "sentry":
capture_exception(exception) capture_exception(exception)
elif callback == "posthog": elif callback == "posthog":
@ -767,9 +801,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
print_verbose(f"ph_obj: {ph_obj}") print_verbose(f"ph_obj: {ph_obj}")
print_verbose(f"PostHog Event Name: {event_name}") print_verbose(f"PostHog Event Name: {event_name}")
if "user_id" in additional_details: if "user_id" in additional_details:
posthog.capture( posthog.capture(additional_details["user_id"],
additional_details["user_id"], event_name, ph_obj event_name, ph_obj)
)
else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
unique_id = str(uuid.uuid4()) unique_id = str(uuid.uuid4())
posthog.capture(unique_id, event_name) posthog.capture(unique_id, event_name)
@ -783,10 +816,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"created": time.time(), "created": time.time(),
"error": traceback_exception, "error": traceback_exception,
"usage": { "usage": {
"prompt_tokens": prompt_token_calculator( "prompt_tokens":
model, messages=messages prompt_token_calculator(model, messages=messages),
), "completion_tokens":
"completion_tokens": 0, 0,
}, },
} }
berrispendLogger.log_event( berrispendLogger.log_event(
@ -805,10 +838,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"model": model, "model": model,
"created": time.time(), "created": time.time(),
"usage": { "usage": {
"prompt_tokens": prompt_token_calculator( "prompt_tokens":
model, messages=messages prompt_token_calculator(model, messages=messages),
), "completion_tokens":
"completion_tokens": 0, 0,
}, },
} }
aispendLogger.log_event( aispendLogger.log_event(
@ -818,6 +851,27 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
end_time=end_time, end_time=end_time,
print_verbose=print_verbose, print_verbose=print_verbose,
) )
elif callback == "llmonitor":
print_verbose("reaches llmonitor for logging!")
model = args[0] if len(args) > 0 else kwargs["model"]
messages = args[1] if len(args) > 1 else kwargs["messages"]
usage = {
"prompt_tokens":
prompt_token_calculator(model, messages=messages),
"completion_tokens":
0,
}
llmonitorLogger.log_event(
type="error",
user_id=litellm._thread_context.user,
model=model,
error=traceback_exception,
response_obj=result,
run_id=kwargs["litellm_call_id"],
timestamp=end_time,
usage=usage,
print_verbose=print_verbose,
)
elif callback == "supabase": elif callback == "supabase":
print_verbose("reaches supabase for logging!") print_verbose("reaches supabase for logging!")
print_verbose(f"supabaseClient: {supabaseClient}") print_verbose(f"supabaseClient: {supabaseClient}")
@ -828,10 +882,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"created": time.time(), "created": time.time(),
"error": traceback_exception, "error": traceback_exception,
"usage": { "usage": {
"prompt_tokens": prompt_token_calculator( "prompt_tokens":
model, messages=messages prompt_token_calculator(model, messages=messages),
), "completion_tokens":
"completion_tokens": 0, 0,
}, },
} }
supabaseClient.log_event( supabaseClient.log_event(
@ -854,10 +908,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"created": time.time(), "created": time.time(),
"error": traceback_exception, "error": traceback_exception,
"usage": { "usage": {
"prompt_tokens": prompt_token_calculator( "prompt_tokens":
model, messages=messages prompt_token_calculator(model, messages=messages),
), "completion_tokens":
"completion_tokens": 0, 0,
}, },
} }
liteDebuggerClient.log_event( liteDebuggerClient.log_event(
@ -884,19 +938,18 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
failure_handler(call_details) failure_handler(call_details)
pass pass
except Exception as e: except Exception as e:
## LOGGING # LOGGING
exception_logging(logger_fn=user_logger_fn, exception=e) exception_logging(logger_fn=user_logger_fn, exception=e)
pass pass
def handle_success(args, kwargs, result, start_time, end_time): def handle_success(args, kwargs, result, start_time, end_time):
global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
try: try:
success_handler = additional_details.pop("success_handler", None) success_handler = additional_details.pop("success_handler", None)
failure_handler = additional_details.pop("failure_handler", None) failure_handler = additional_details.pop("failure_handler", None)
additional_details["Event_Name"] = additional_details.pop( additional_details["Event_Name"] = additional_details.pop(
"successful_event_name", "litellm.succes_query" "successful_event_name", "litellm.succes_query")
)
for callback in litellm.success_callback: for callback in litellm.success_callback:
try: try:
if callback == "posthog": if callback == "posthog":
@ -905,9 +958,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
ph_obj[detail] = additional_details[detail] ph_obj[detail] = additional_details[detail]
event_name = additional_details["Event_Name"] event_name = additional_details["Event_Name"]
if "user_id" in additional_details: if "user_id" in additional_details:
posthog.capture( posthog.capture(additional_details["user_id"],
additional_details["user_id"], event_name, ph_obj event_name, ph_obj)
)
else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
unique_id = str(uuid.uuid4()) unique_id = str(uuid.uuid4())
posthog.capture(unique_id, event_name, ph_obj) posthog.capture(unique_id, event_name, ph_obj)
@ -916,9 +968,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
slack_msg = "" slack_msg = ""
for detail in additional_details: for detail in additional_details:
slack_msg += f"{detail}: {additional_details[detail]}\n" slack_msg += f"{detail}: {additional_details[detail]}\n"
slack_app.client.chat_postMessage( slack_app.client.chat_postMessage(channel=alerts_channel,
channel=alerts_channel, text=slack_msg text=slack_msg)
)
elif callback == "helicone": elif callback == "helicone":
print_verbose("reaches helicone for logging!") print_verbose("reaches helicone for logging!")
model = args[0] if len(args) > 0 else kwargs["model"] model = args[0] if len(args) > 0 else kwargs["model"]
@ -931,6 +982,22 @@ def handle_success(args, kwargs, result, start_time, end_time):
end_time=end_time, end_time=end_time,
print_verbose=print_verbose, print_verbose=print_verbose,
) )
elif callback == "llmonitor":
print_verbose("reaches llmonitor for logging!")
model = args[0] if len(args) > 0 else kwargs["model"]
messages = args[1] if len(args) > 1 else kwargs["messages"]
usage = kwargs["usage"]
llmonitorLogger.log_event(
type="end",
model=model,
messages=messages,
user_id=litellm._thread_context.user,
response_obj=result,
time=end_time,
usage=usage,
run_id=kwargs["litellm_call_id"],
print_verbose=print_verbose,
)
elif callback == "aispend": elif callback == "aispend":
print_verbose("reaches aispend for logging!") print_verbose("reaches aispend for logging!")
model = args[0] if len(args) > 0 else kwargs["model"] model = args[0] if len(args) > 0 else kwargs["model"]
@ -984,7 +1051,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
print_verbose=print_verbose, print_verbose=print_verbose,
) )
except Exception as e: except Exception as e:
## LOGGING # LOGGING
exception_logging(logger_fn=user_logger_fn, exception=e) exception_logging(logger_fn=user_logger_fn, exception=e)
print_verbose( print_verbose(
f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}" f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -995,7 +1062,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
success_handler(args, kwargs) success_handler(args, kwargs)
pass pass
except Exception as e: except Exception as e:
## LOGGING # LOGGING
exception_logging(logger_fn=user_logger_fn, exception=e) exception_logging(logger_fn=user_logger_fn, exception=e)
print_verbose( print_verbose(
f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}" f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -1046,33 +1113,36 @@ def exception_type(model, original_exception, custom_llm_provider):
exception_type = "" exception_type = ""
if "claude" in model: # one of the anthropics if "claude" in model: # one of the anthropics
if hasattr(original_exception, "status_code"): if hasattr(original_exception, "status_code"):
print_verbose(f"status_code: {original_exception.status_code}") print_verbose(
f"status_code: {original_exception.status_code}")
if original_exception.status_code == 401: if original_exception.status_code == 401:
exception_mapping_worked = True exception_mapping_worked = True
raise AuthenticationError( raise AuthenticationError(
message=f"AnthropicException - {original_exception.message}", message=
f"AnthropicException - {original_exception.message}",
llm_provider="anthropic", llm_provider="anthropic",
) )
elif original_exception.status_code == 400: elif original_exception.status_code == 400:
exception_mapping_worked = True exception_mapping_worked = True
raise InvalidRequestError( raise InvalidRequestError(
message=f"AnthropicException - {original_exception.message}", message=
f"AnthropicException - {original_exception.message}",
model=model, model=model,
llm_provider="anthropic", llm_provider="anthropic",
) )
elif original_exception.status_code == 429: elif original_exception.status_code == 429:
exception_mapping_worked = True exception_mapping_worked = True
raise RateLimitError( raise RateLimitError(
message=f"AnthropicException - {original_exception.message}", message=
f"AnthropicException - {original_exception.message}",
llm_provider="anthropic", llm_provider="anthropic",
) )
elif ( elif ("Could not resolve authentication method. Expected either api_key or auth_token to be set."
"Could not resolve authentication method. Expected either api_key or auth_token to be set." in error_str):
in error_str
):
exception_mapping_worked = True exception_mapping_worked = True
raise AuthenticationError( raise AuthenticationError(
message=f"AnthropicException - {original_exception.message}", message=
f"AnthropicException - {original_exception.message}",
llm_provider="anthropic", llm_provider="anthropic",
) )
elif "replicate" in model: elif "replicate" in model:
@ -1096,35 +1166,36 @@ def exception_type(model, original_exception, custom_llm_provider):
llm_provider="replicate", llm_provider="replicate",
) )
elif ( elif (
exception_type == "ReplicateError" exception_type == "ReplicateError"
): ## ReplicateError implies an error on Replicate server side, not user side ): # ReplicateError implies an error on Replicate server side, not user side
raise ServiceUnavailableError( raise ServiceUnavailableError(
message=f"ReplicateException - {error_str}", message=f"ReplicateException - {error_str}",
llm_provider="replicate", llm_provider="replicate",
) )
elif model == "command-nightly": # Cohere elif model == "command-nightly": # Cohere
if ( if ("invalid api token" in error_str
"invalid api token" in error_str or "No API key provided." in error_str):
or "No API key provided." in error_str
):
exception_mapping_worked = True exception_mapping_worked = True
raise AuthenticationError( raise AuthenticationError(
message=f"CohereException - {original_exception.message}", message=
f"CohereException - {original_exception.message}",
llm_provider="cohere", llm_provider="cohere",
) )
elif "too many tokens" in error_str: elif "too many tokens" in error_str:
exception_mapping_worked = True exception_mapping_worked = True
raise InvalidRequestError( raise InvalidRequestError(
message=f"CohereException - {original_exception.message}", message=
f"CohereException - {original_exception.message}",
model=model, model=model,
llm_provider="cohere", llm_provider="cohere",
) )
elif ( elif (
"CohereConnectionError" in exception_type "CohereConnectionError" in exception_type
): # cohere seems to fire these errors when we load test it (1k+ messages / min) ): # cohere seems to fire these errors when we load test it (1k+ messages / min)
exception_mapping_worked = True exception_mapping_worked = True
raise RateLimitError( raise RateLimitError(
message=f"CohereException - {original_exception.message}", message=
f"CohereException - {original_exception.message}",
llm_provider="cohere", llm_provider="cohere",
) )
elif custom_llm_provider == "huggingface": elif custom_llm_provider == "huggingface":
@ -1132,27 +1203,30 @@ def exception_type(model, original_exception, custom_llm_provider):
if original_exception.status_code == 401: if original_exception.status_code == 401:
exception_mapping_worked = True exception_mapping_worked = True
raise AuthenticationError( raise AuthenticationError(
message=f"HuggingfaceException - {original_exception.message}", message=
f"HuggingfaceException - {original_exception.message}",
llm_provider="huggingface", llm_provider="huggingface",
) )
elif original_exception.status_code == 400: elif original_exception.status_code == 400:
exception_mapping_worked = True exception_mapping_worked = True
raise InvalidRequestError( raise InvalidRequestError(
message=f"HuggingfaceException - {original_exception.message}", message=
f"HuggingfaceException - {original_exception.message}",
model=model, model=model,
llm_provider="huggingface", llm_provider="huggingface",
) )
elif original_exception.status_code == 429: elif original_exception.status_code == 429:
exception_mapping_worked = True exception_mapping_worked = True
raise RateLimitError( raise RateLimitError(
message=f"HuggingfaceException - {original_exception.message}", message=
f"HuggingfaceException - {original_exception.message}",
llm_provider="huggingface", llm_provider="huggingface",
) )
raise original_exception # base case - return the original exception raise original_exception # base case - return the original exception
else: else:
raise original_exception raise original_exception
except Exception as e: except Exception as e:
## LOGGING # LOGGING
exception_logging( exception_logging(
logger_fn=user_logger_fn, logger_fn=user_logger_fn,
additional_args={ additional_args={
@ -1173,7 +1247,7 @@ def safe_crash_reporting(model=None, exception=None, custom_llm_provider=None):
"exception": str(exception), "exception": str(exception),
"custom_llm_provider": custom_llm_provider, "custom_llm_provider": custom_llm_provider,
} }
threading.Thread(target=litellm_telemetry, args=(data,)).start() threading.Thread(target=litellm_telemetry, args=(data, )).start()
def litellm_telemetry(data): def litellm_telemetry(data):
@ -1223,11 +1297,13 @@ def get_secret(secret_name):
if litellm.secret_manager_client != None: if litellm.secret_manager_client != None:
# TODO: check which secret manager is being used # TODO: check which secret manager is being used
# currently only supports Infisical # currently only supports Infisical
secret = litellm.secret_manager_client.get_secret(secret_name).secret_value secret = litellm.secret_manager_client.get_secret(
secret_name).secret_value
if secret != None: if secret != None:
return secret # if secret found in secret manager return it return secret # if secret found in secret manager return it
else: else:
raise ValueError(f"Secret '{secret_name}' not found in secret manager") raise ValueError(
f"Secret '{secret_name}' not found in secret manager")
elif litellm.api_key != None: # if users use litellm default key elif litellm.api_key != None: # if users use litellm default key
return litellm.api_key return litellm.api_key
else: else:
@ -1238,6 +1314,7 @@ def get_secret(secret_name):
# wraps the completion stream to return the correct format for the model # wraps the completion stream to return the correct format for the model
# replicate/anthropic/cohere # replicate/anthropic/cohere
class CustomStreamWrapper: class CustomStreamWrapper:
def __init__(self, completion_stream, model, custom_llm_provider=None): def __init__(self, completion_stream, model, custom_llm_provider=None):
self.model = model self.model = model
self.custom_llm_provider = custom_llm_provider self.custom_llm_provider = custom_llm_provider
@ -1288,7 +1365,8 @@ class CustomStreamWrapper:
elif self.model == "replicate": elif self.model == "replicate":
chunk = next(self.completion_stream) chunk = next(self.completion_stream)
completion_obj["content"] = chunk completion_obj["content"] = chunk
elif (self.model == "together_ai") or ("togethercomputer" in self.model): elif (self.model == "together_ai") or ("togethercomputer"
in self.model):
chunk = next(self.completion_stream) chunk = next(self.completion_stream)
text_data = self.handle_together_ai_chunk(chunk) text_data = self.handle_together_ai_chunk(chunk)
if text_data == "": if text_data == "":
@ -1321,12 +1399,11 @@ def read_config_args(config_path):
########## ollama implementation ############################ ########## ollama implementation ############################
import aiohttp
async def get_ollama_response_stream( async def get_ollama_response_stream(api_base="http://localhost:11434",
api_base="http://localhost:11434", model="llama2", prompt="Why is the sky blue?" model="llama2",
): prompt="Why is the sky blue?"):
session = aiohttp.ClientSession() session = aiohttp.ClientSession()
url = f"{api_base}/api/generate" url = f"{api_base}/api/generate"
data = { data = {
@ -1349,7 +1426,11 @@ async def get_ollama_response_stream(
"content": "", "content": "",
} }
completion_obj["content"] = j["response"] completion_obj["content"] = j["response"]
yield {"choices": [{"delta": completion_obj}]} yield {
"choices": [{
"delta": completion_obj
}]
}
# self.responses.append(j["response"]) # self.responses.append(j["response"])
# yield "blank" # yield "blank"
except Exception as e: except Exception as e:

View file

@ -1,6 +1,7 @@
# liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching # liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models ### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
[![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/) [![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
[![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/) [![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
![Downloads](https://img.shields.io/pypi/dm/litellm) ![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,34 +12,36 @@
![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724) ![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)
## What does liteLLM proxy do ## What does liteLLM proxy do
- Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face** - Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k` Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
```json ```json
{ {
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1", "model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
"messages": [ "messages": [
{ {
"content": "Hello, whats the weather in San Francisco??", "content": "Hello, whats the weather in San Francisco??",
"role": "user" "role": "user"
} }
] ]
} }
``` ```
- **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to Supabase** - **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone`, `LLMonitor` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to Supabase**
<img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08"> <img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
- **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model - **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
- **Caching** - Implementation of Semantic Caching - **Caching** - Implementation of Semantic Caching
- **Streaming & Async Support** - Return generators to stream text responses - **Streaming & Async Support** - Return generators to stream text responses
## API Endpoints ## API Endpoints
### `/chat/completions` (POST) ### `/chat/completions` (POST)
@ -46,34 +49,37 @@
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
#### Input #### Input
This API endpoint accepts all inputs in raw JSON and expects the following inputs This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k` - `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
- `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role). - `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/ - Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
#### Example JSON body #### Example JSON body
For claude-2 For claude-2
```json ```json
{ {
"model": "claude-2", "model": "claude-2",
"messages": [ "messages": [
{ {
"content": "Hello, whats the weather in San Francisco??", "content": "Hello, whats the weather in San Francisco??",
"role": "user" "role": "user"
} }
] ]
} }
``` ```
### Making an API request to the Proxy Server ### Making an API request to the Proxy Server
```python ```python
import requests import requests
import json import json
# TODO: use your URL # TODO: use your URL
url = "http://localhost:5000/chat/completions" url = "http://localhost:5000/chat/completions"
payload = json.dumps({ payload = json.dumps({
@ -94,34 +100,38 @@ print(response.text)
``` ```
### Output [Response Format] ### Output [Response Format]
Responses from the server are given in the following format.
Responses from the server are given in the following format.
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/ All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
```json ```json
{ {
"choices": [ "choices": [
{ {
"finish_reason": "stop", "finish_reason": "stop",
"index": 0, "index": 0,
"message": { "message": {
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.", "content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
"role": "assistant" "role": "assistant"
} }
}
],
"created": 1691790381,
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 41,
"prompt_tokens": 16,
"total_tokens": 57
} }
],
"created": 1691790381,
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 41,
"prompt_tokens": 16,
"total_tokens": 57
}
} }
``` ```
## Installation & Usage ## Installation & Usage
### Running Locally ### Running Locally
1. Clone liteLLM repository to your local machine: 1. Clone liteLLM repository to your local machine:
``` ```
git clone https://github.com/BerriAI/liteLLM-proxy git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
python main.py python main.py
``` ```
## Deploying ## Deploying
1. Quick Start: Deploy on Railway 1. Quick Start: Deploy on Railway
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU) [![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
2. `GCP`, `AWS`, `Azure` 2. `GCP`, `AWS`, `Azure`
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
# Support / Talk with founders # Support / Talk with founders
- [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version) - [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw) - [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238 - Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai - Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
## Roadmap ## Roadmap
- [ ] Support hosted db (e.g. Supabase) - [ ] Support hosted db (e.g. Supabase)
- [ ] Easily send data to places like posthog and sentry. - [ ] Easily send data to places like posthog and sentry.
- [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings - [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings