almost working llmonitor

This commit is contained in:
Vince Lwt 2023-08-21 16:26:47 +02:00
parent 22c7e38de5
commit 3675d3e029
5 changed files with 425 additions and 326 deletions

View file

@ -1,6 +1,7 @@
# liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
[![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
[![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,9 +12,11 @@
![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)
## What does liteLLM proxy do
- Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
```json
{
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
@ -25,11 +28,12 @@
]
}
```
- **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `LLMonitor,` `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to Supabase**
<img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
@ -38,7 +42,6 @@
- **Caching** - Implementation of Semantic Caching
- **Streaming & Async Support** - Return generators to stream text responses
## API Endpoints
### `/chat/completions` (POST)
@ -46,15 +49,18 @@
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
#### Input
This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
- `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
#### Example JSON body
For claude-2
```json
{
"model": "claude-2",
@ -64,11 +70,11 @@ For claude-2
"role": "user"
}
]
}
```
### Making an API request to the Proxy Server
```python
import requests
import json
@ -94,8 +100,10 @@ print(response.text)
```
### Output [Response Format]
Responses from the server are given in the following format.
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
```json
{
"choices": [
@ -121,7 +129,9 @@ All responses from the server are returned in the following format (for all LLM
```
## Installation & Usage
### Running Locally
1. Clone liteLLM repository to your local machine:
```
git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
python main.py
```
## Deploying
1. Quick Start: Deploy on Railway
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
2. `GCP`, `AWS`, `Azure`
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
# Support / Talk with founders
- [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
## Roadmap
- [ ] Support hosted db (e.g. Supabase)
- [ ] Easily send data to places like posthog and sentry.
- [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings

View file

@ -5,6 +5,7 @@ import traceback
import dotenv
import os
import requests
dotenv.load_dotenv() # Loading env variables using dotenv
@ -14,45 +15,34 @@ class LLMonitorLogger:
# Instance variables
self.api_url = os.getenv(
"LLMONITOR_API_URL") or "https://app.llmonitor.com"
self.account_id = os.getenv("LLMONITOR_APP_ID")
self.app_id = os.getenv("LLMONITOR_APP_ID")
def log_event(self, model, messages, response_obj, start_time, end_time, print_verbose):
def log_event(self, type, run_id, error, usage, model, messages,
response_obj, user_id, time, print_verbose):
# Method definition
try:
print_verbose(
f"LLMonitor Logging - Enters logging function for model {model}")
f"LLMonitor Logging - Enters logging function for model {model}"
)
print(model, messages, response_obj, start_time, end_time)
print(type, model, messages, response_obj, time, end_user)
# headers = {
# 'Content-Type': 'application/json'
# }
headers = {'Content-Type': 'application/json'}
# prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = self.price_calculator(
# model, response_obj, start_time, end_time)
# total_cost = prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
data = {
"type": "llm",
"name": model,
"runId": run_id,
"app": self.app_id,
"error": error,
"event": type,
"timestamp": time.isoformat(),
"userId": user_id,
"input": messages,
"output": response_obj['choices'][0]['message']['content'],
}
# response_time = (end_time-start_time).total_seconds()
# if "response" in response_obj:
# data = [{
# "response_time": response_time,
# "model_id": response_obj["model"],
# "total_cost": total_cost,
# "messages": messages,
# "response": response_obj['choices'][0]['message']['content'],
# "account_id": self.account_id
# }]
# elif "error" in response_obj:
# data = [{
# "response_time": response_time,
# "model_id": response_obj["model"],
# "total_cost": total_cost,
# "messages": messages,
# "error": response_obj['error'],
# "account_id": self.account_id
# }]
# print_verbose(f"BerriSpend Logging - final data object: {data}")
print_verbose(f"LLMonitor Logging - final data object: {data}")
# response = requests.post(url, headers=headers, json=data)
except:
# traceback.print_exc()

View file

@ -1,28 +1,36 @@
#### What this tests ####
# This tests if logging to the helicone integration actually works
from litellm import embedding, completion
import litellm
# This tests if logging to the llmonitor integration actually works
# Adds the parent directory to the system path
import sys
import os
import traceback
import pytest
# Adds the parent directory to the system path
sys.path.insert(0, os.path.abspath('../..'))
from litellm import completion
import litellm
litellm.input_callback = ["llmonitor"]
litellm.success_callback = ["llmonitor"]
litellm.error_callback = ["llmonitor"]
litellm.set_verbose = True
user_message = "Hello, how are you?"
messages = [{"content": user_message, "role": "user"}]
# openai call
response = completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}])
# response = completion(model="gpt-3.5-turbo",
# messages=[{
# "role": "user",
# "content": "Hi 👋 - i'm openai"
# }])
# print(response)
# #bad request call
# response = completion(model="chatgpt-test", messages=[{"role": "user", "content": "Hi 👋 - i'm a bad request"}])
# cohere call
# response = completion(model="command-nightly",
# messages=[{"role": "user", "content": "Hi 👋 - i'm cohere"}])
response = completion(model="command-nightly",
messages=[{
"role": "user",
"content": "Hi 👋 - i'm cohere"
}])
print(response)

View file

@ -1,20 +1,7 @@
import sys
import dotenv, json, traceback, threading
import subprocess, os
import litellm, openai
import random, uuid, requests
import datetime, time
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
import pkg_resources
from .integrations.helicone import HeliconeLogger
from .integrations.aispend import AISpendLogger
from .integrations.berrispend import BerriSpendLogger
from .integrations.supabase import Supabase
from .integrations.litedebugger import LiteDebugger
from openai.error import OpenAIError as OriginalError
from openai.openai_object import OpenAIObject
import aiohttp
import subprocess
import importlib
from typing import List, Dict, Union, Optional
from .exceptions import (
AuthenticationError,
InvalidRequestError,
@ -22,7 +9,32 @@ from .exceptions import (
ServiceUnavailableError,
OpenAIError,
)
from typing import List, Dict, Union, Optional
from openai.openai_object import OpenAIObject
from openai.error import OpenAIError as OriginalError
from .integrations.llmonitor import LLMonitorLogger
from .integrations.litedebugger import LiteDebugger
from .integrations.supabase import Supabase
from .integrations.berrispend import BerriSpendLogger
from .integrations.aispend import AISpendLogger
from .integrations.helicone import HeliconeLogger
import pkg_resources
import sys
import dotenv
import json
import traceback
import threading
import subprocess
import os
import litellm
import openai
import random
import uuid
import requests
import datetime
import time
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")
####### ENVIRONMENT VARIABLES ###################
dotenv.load_dotenv() # Loading env variables using dotenv
@ -37,6 +49,7 @@ aispendLogger = None
berrispendLogger = None
supabaseClient = None
liteDebuggerClient = None
llmonitorLogger = None
callback_list: Optional[List[str]] = []
user_logger_fn = None
additional_details: Optional[Dict[str, str]] = {}
@ -63,6 +76,7 @@ local_cache: Optional[Dict[str, str]] = {}
class Message(OpenAIObject):
def __init__(self, content="default", role="assistant", **params):
super(Message, self).__init__(**params)
self.content = content
@ -70,7 +84,12 @@ class Message(OpenAIObject):
class Choices(OpenAIObject):
def __init__(self, finish_reason="stop", index=0, message=Message(), **params):
def __init__(self,
finish_reason="stop",
index=0,
message=Message(),
**params):
super(Choices, self).__init__(**params)
self.finish_reason = finish_reason
self.index = index
@ -78,20 +97,22 @@ class Choices(OpenAIObject):
class ModelResponse(OpenAIObject):
def __init__(self, choices=None, created=None, model=None, usage=None, **params):
def __init__(self,
choices=None,
created=None,
model=None,
usage=None,
**params):
super(ModelResponse, self).__init__(**params)
self.choices = choices if choices else [Choices()]
self.created = created
self.model = model
self.usage = (
usage
if usage
else {
self.usage = (usage if usage else {
"prompt_tokens": None,
"completion_tokens": None,
"total_tokens": None,
}
)
})
def to_dict_recursive(self):
d = super().to_dict_recursive()
@ -108,8 +129,6 @@ def print_verbose(print_statement):
####### Package Import Handler ###################
import importlib
import subprocess
def install_and_import(package: str):
@ -139,6 +158,7 @@ def install_and_import(package: str):
# Logging function -> log the exact model details + what's being sent | Non-Blocking
class Logging:
global supabaseClient, liteDebuggerClient
def __init__(self, model, messages, optional_params, litellm_params):
self.model = model
self.messages = messages
@ -159,7 +179,7 @@ class Logging:
self.model_call_details["api_key"] = api_key
self.model_call_details["additional_args"] = additional_args
## User Logging -> if you pass in a custom logging function
# User Logging -> if you pass in a custom logging function
print_verbose(
f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
)
@ -173,7 +193,7 @@ class Logging:
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
)
## Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
# Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
for callback in litellm.input_callback:
try:
if callback == "supabase":
@ -185,7 +205,21 @@ class Logging:
model=model,
messages=messages,
end_user=litellm._thread_context.user,
litellm_call_id=self.litellm_params["litellm_call_id"],
litellm_call_id=self.
litellm_params["litellm_call_id"],
print_verbose=print_verbose,
)
elif callback == "llmonitor":
print_verbose("reaches llmonitor for logging!")
model = self.model
messages = self.messages
print(f"liteDebuggerClient: {liteDebuggerClient}")
llmonitorLogger.log_event(
type="start",
model=model,
messages=messages,
user_id=litellm._thread_context.user,
run_id=self.litellm_params["litellm_call_id"],
print_verbose=print_verbose,
)
elif callback == "lite_debugger":
@ -197,11 +231,14 @@ class Logging:
model=model,
messages=messages,
end_user=litellm._thread_context.user,
litellm_call_id=self.litellm_params["litellm_call_id"],
litellm_call_id=self.
litellm_params["litellm_call_id"],
print_verbose=print_verbose,
)
except Exception as e:
print_verbose(f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}")
print_verbose(
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}"
)
print_verbose(
f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
)
@ -225,7 +262,7 @@ class Logging:
self.model_call_details["original_response"] = original_response
self.model_call_details["additional_args"] = additional_args
## User Logging -> if you pass in a custom logging function
# User Logging -> if you pass in a custom logging function
print_verbose(
f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
)
@ -257,7 +294,7 @@ def exception_logging(
if exception:
model_call_details["exception"] = exception
model_call_details["additional_args"] = additional_args
## User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
# User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
print_verbose(
f"Logging Details: logger_fn - {logger_fn} | callable(logger_fn) - {callable(logger_fn)}"
)
@ -280,20 +317,20 @@ def exception_logging(
####### CLIENT ###################
# make it easy to log if completion/embedding runs succeeded or failed + see what happened | Non-Blocking
def client(original_function):
def function_setup(
*args, **kwargs
): # just run once to check if user wants to send their data anywhere - PostHog/Sentry/Slack/etc.
try:
global callback_list, add_breadcrumb, user_logger_fn
if (
len(litellm.input_callback) > 0 or len(litellm.success_callback) > 0 or len(litellm.failure_callback) > 0
) and len(callback_list) == 0:
if (len(litellm.input_callback) > 0
or len(litellm.success_callback) > 0
or len(litellm.failure_callback)
> 0) and len(callback_list) == 0:
callback_list = list(
set(litellm.input_callback + litellm.success_callback + litellm.failure_callback)
)
set_callbacks(
callback_list=callback_list,
)
set(litellm.input_callback + litellm.success_callback +
litellm.failure_callback))
set_callbacks(callback_list=callback_list, )
if add_breadcrumb:
add_breadcrumb(
category="litellm.llm_call",
@ -310,12 +347,11 @@ def client(original_function):
if litellm.telemetry:
try:
model = args[0] if len(args) > 0 else kwargs["model"]
exception = kwargs["exception"] if "exception" in kwargs else None
custom_llm_provider = (
kwargs["custom_llm_provider"]
if "custom_llm_provider" in kwargs
else None
)
exception = kwargs[
"exception"] if "exception" in kwargs else None
custom_llm_provider = (kwargs["custom_llm_provider"]
if "custom_llm_provider" in kwargs else
None)
safe_crash_reporting(
model=model,
exception=exception,
@ -340,15 +376,12 @@ def client(original_function):
def check_cache(*args, **kwargs):
try: # never block execution
prompt = get_prompt(*args, **kwargs)
if (
prompt != None and prompt in local_cache
): # check if messages / prompt exists
if (prompt != None and prompt
in local_cache): # check if messages / prompt exists
if litellm.caching_with_models:
# if caching with model names is enabled, key is prompt + model name
if (
"model" in kwargs
and kwargs["model"] in local_cache[prompt]["models"]
):
if ("model" in kwargs and kwargs["model"]
in local_cache[prompt]["models"]):
cache_key = prompt + kwargs["model"]
return local_cache[cache_key]
else: # caching only with prompts
@ -363,10 +396,8 @@ def client(original_function):
try: # never block execution
prompt = get_prompt(*args, **kwargs)
if litellm.caching_with_models: # caching with model + prompt
if (
"model" in kwargs
and kwargs["model"] in local_cache[prompt]["models"]
):
if ("model" in kwargs
and kwargs["model"] in local_cache[prompt]["models"]):
cache_key = prompt + kwargs["model"]
local_cache[cache_key] = result
else: # caching based only on prompts
@ -381,24 +412,24 @@ def client(original_function):
function_setup(*args, **kwargs)
litellm_call_id = str(uuid.uuid4())
kwargs["litellm_call_id"] = litellm_call_id
## [OPTIONAL] CHECK CACHE
# [OPTIONAL] CHECK CACHE
start_time = datetime.datetime.now()
if (litellm.caching or litellm.caching_with_models) and (
cached_result := check_cache(*args, **kwargs)
) is not None:
cached_result := check_cache(*args, **kwargs)) is not None:
result = cached_result
else:
## MODEL CALL
# MODEL CALL
result = original_function(*args, **kwargs)
end_time = datetime.datetime.now()
## Add response to CACHE
# Add response to CACHE
if litellm.caching:
add_cache(result, *args, **kwargs)
## LOG SUCCESS
# LOG SUCCESS
crash_reporting(*args, **kwargs)
my_thread = threading.Thread(
target=handle_success, args=(args, kwargs, result, start_time, end_time)
) # don't interrupt execution of main thread
target=handle_success,
args=(args, kwargs, result, start_time,
end_time)) # don't interrupt execution of main thread
my_thread.start()
return result
except Exception as e:
@ -407,7 +438,8 @@ def client(original_function):
end_time = datetime.datetime.now()
my_thread = threading.Thread(
target=handle_failure,
args=(e, traceback_exception, start_time, end_time, args, kwargs),
args=(e, traceback_exception, start_time, end_time, args,
kwargs),
) # don't interrupt execution of main thread
my_thread.start()
raise e
@ -432,18 +464,18 @@ def token_counter(model, text):
return num_tokens
def cost_per_token(model="gpt-3.5-turbo", prompt_tokens=0, completion_tokens=0):
## given
def cost_per_token(model="gpt-3.5-turbo",
prompt_tokens=0,
completion_tokens=0):
# given
prompt_tokens_cost_usd_dollar = 0
completion_tokens_cost_usd_dollar = 0
model_cost_ref = litellm.model_cost
if model in model_cost_ref:
prompt_tokens_cost_usd_dollar = (
model_cost_ref[model]["input_cost_per_token"] * prompt_tokens
)
model_cost_ref[model]["input_cost_per_token"] * prompt_tokens)
completion_tokens_cost_usd_dollar = (
model_cost_ref[model]["output_cost_per_token"] * completion_tokens
)
model_cost_ref[model]["output_cost_per_token"] * completion_tokens)
return prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar
else:
# calculate average input cost
@ -464,8 +496,9 @@ def completion_cost(model="gpt-3.5-turbo", prompt="", completion=""):
prompt_tokens = token_counter(model=model, text=prompt)
completion_tokens = token_counter(model=model, text=completion)
prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = cost_per_token(
model=model, prompt_tokens=prompt_tokens, completion_tokens=completion_tokens
)
model=model,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens)
return prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
@ -557,8 +590,7 @@ def get_optional_params(
optional_params["max_tokens"] = max_tokens
if frequency_penalty != 0:
optional_params["frequency_penalty"] = frequency_penalty
elif (
model == "chat-bison"
elif (model == "chat-bison"
): # chat-bison has diff args from chat-bison@001 ty Google
if temperature != 1:
optional_params["temperature"] = temperature
@ -619,7 +651,10 @@ def load_test_model(
test_prompt = prompt
if num_calls:
test_calls = num_calls
messages = [[{"role": "user", "content": test_prompt}] for _ in range(test_calls)]
messages = [[{
"role": "user",
"content": test_prompt
}] for _ in range(test_calls)]
start_time = time.time()
try:
litellm.batch_completion(
@ -649,7 +684,7 @@ def load_test_model(
def set_callbacks(callback_list):
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
try:
for callback in callback_list:
print(f"callback: {callback}")
@ -657,17 +692,15 @@ def set_callbacks(callback_list):
try:
import sentry_sdk
except ImportError:
print_verbose("Package 'sentry_sdk' is missing. Installing it...")
print_verbose(
"Package 'sentry_sdk' is missing. Installing it...")
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "sentry_sdk"]
)
[sys.executable, "-m", "pip", "install", "sentry_sdk"])
import sentry_sdk
sentry_sdk_instance = sentry_sdk
sentry_trace_rate = (
os.environ.get("SENTRY_API_TRACE_RATE")
sentry_trace_rate = (os.environ.get("SENTRY_API_TRACE_RATE")
if "SENTRY_API_TRACE_RATE" in os.environ
else "1.0"
)
else "1.0")
sentry_sdk_instance.init(
dsn=os.environ.get("SENTRY_API_URL"),
traces_sample_rate=float(sentry_trace_rate),
@ -678,10 +711,10 @@ def set_callbacks(callback_list):
try:
from posthog import Posthog
except ImportError:
print_verbose("Package 'posthog' is missing. Installing it...")
print_verbose(
"Package 'posthog' is missing. Installing it...")
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "posthog"]
)
[sys.executable, "-m", "pip", "install", "posthog"])
from posthog import Posthog
posthog = Posthog(
project_api_key=os.environ.get("POSTHOG_API_KEY"),
@ -691,10 +724,10 @@ def set_callbacks(callback_list):
try:
from slack_bolt import App
except ImportError:
print_verbose("Package 'slack_bolt' is missing. Installing it...")
print_verbose(
"Package 'slack_bolt' is missing. Installing it...")
subprocess.check_call(
[sys.executable, "-m", "pip", "install", "slack_bolt"]
)
[sys.executable, "-m", "pip", "install", "slack_bolt"])
from slack_bolt import App
slack_app = App(
token=os.environ.get("SLACK_API_TOKEN"),
@ -704,6 +737,8 @@ def set_callbacks(callback_list):
print_verbose(f"Initialized Slack App: {slack_app}")
elif callback == "helicone":
heliconeLogger = HeliconeLogger()
elif callback == "llmonitor":
llmonitorLogger = LLMonitorLogger()
elif callback == "aispend":
aispendLogger = AISpendLogger()
elif callback == "berrispend":
@ -718,7 +753,8 @@ def set_callbacks(callback_list):
raise e
def handle_failure(exception, traceback_exception, start_time, end_time, args, kwargs):
def handle_failure(exception, traceback_exception, start_time, end_time, args,
kwargs):
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
try:
# print_verbose(f"handle_failure args: {args}")
@ -728,8 +764,7 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
failure_handler = additional_details.pop("failure_handler", None)
additional_details["Event_Name"] = additional_details.pop(
"failed_event_name", "litellm.failed_query"
)
"failed_event_name", "litellm.failed_query")
print_verbose(f"self.failure_callback: {litellm.failure_callback}")
# print_verbose(f"additional_details: {additional_details}")
@ -746,9 +781,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
for detail in additional_details:
slack_msg += f"{detail}: {additional_details[detail]}\n"
slack_msg += f"Traceback: {traceback_exception}"
slack_app.client.chat_postMessage(
channel=alerts_channel, text=slack_msg
)
slack_app.client.chat_postMessage(channel=alerts_channel,
text=slack_msg)
elif callback == "sentry":
capture_exception(exception)
elif callback == "posthog":
@ -767,9 +801,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
print_verbose(f"ph_obj: {ph_obj}")
print_verbose(f"PostHog Event Name: {event_name}")
if "user_id" in additional_details:
posthog.capture(
additional_details["user_id"], event_name, ph_obj
)
posthog.capture(additional_details["user_id"],
event_name, ph_obj)
else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
unique_id = str(uuid.uuid4())
posthog.capture(unique_id, event_name)
@ -783,10 +816,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"created": time.time(),
"error": traceback_exception,
"usage": {
"prompt_tokens": prompt_token_calculator(
model, messages=messages
),
"completion_tokens": 0,
"prompt_tokens":
prompt_token_calculator(model, messages=messages),
"completion_tokens":
0,
},
}
berrispendLogger.log_event(
@ -805,10 +838,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"model": model,
"created": time.time(),
"usage": {
"prompt_tokens": prompt_token_calculator(
model, messages=messages
),
"completion_tokens": 0,
"prompt_tokens":
prompt_token_calculator(model, messages=messages),
"completion_tokens":
0,
},
}
aispendLogger.log_event(
@ -818,6 +851,27 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
end_time=end_time,
print_verbose=print_verbose,
)
elif callback == "llmonitor":
print_verbose("reaches llmonitor for logging!")
model = args[0] if len(args) > 0 else kwargs["model"]
messages = args[1] if len(args) > 1 else kwargs["messages"]
usage = {
"prompt_tokens":
prompt_token_calculator(model, messages=messages),
"completion_tokens":
0,
}
llmonitorLogger.log_event(
type="error",
user_id=litellm._thread_context.user,
model=model,
error=traceback_exception,
response_obj=result,
run_id=kwargs["litellm_call_id"],
timestamp=end_time,
usage=usage,
print_verbose=print_verbose,
)
elif callback == "supabase":
print_verbose("reaches supabase for logging!")
print_verbose(f"supabaseClient: {supabaseClient}")
@ -828,10 +882,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"created": time.time(),
"error": traceback_exception,
"usage": {
"prompt_tokens": prompt_token_calculator(
model, messages=messages
),
"completion_tokens": 0,
"prompt_tokens":
prompt_token_calculator(model, messages=messages),
"completion_tokens":
0,
},
}
supabaseClient.log_event(
@ -854,10 +908,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
"created": time.time(),
"error": traceback_exception,
"usage": {
"prompt_tokens": prompt_token_calculator(
model, messages=messages
),
"completion_tokens": 0,
"prompt_tokens":
prompt_token_calculator(model, messages=messages),
"completion_tokens":
0,
},
}
liteDebuggerClient.log_event(
@ -884,19 +938,18 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
failure_handler(call_details)
pass
except Exception as e:
## LOGGING
# LOGGING
exception_logging(logger_fn=user_logger_fn, exception=e)
pass
def handle_success(args, kwargs, result, start_time, end_time):
global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient
global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
try:
success_handler = additional_details.pop("success_handler", None)
failure_handler = additional_details.pop("failure_handler", None)
additional_details["Event_Name"] = additional_details.pop(
"successful_event_name", "litellm.succes_query"
)
"successful_event_name", "litellm.succes_query")
for callback in litellm.success_callback:
try:
if callback == "posthog":
@ -905,9 +958,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
ph_obj[detail] = additional_details[detail]
event_name = additional_details["Event_Name"]
if "user_id" in additional_details:
posthog.capture(
additional_details["user_id"], event_name, ph_obj
)
posthog.capture(additional_details["user_id"],
event_name, ph_obj)
else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
unique_id = str(uuid.uuid4())
posthog.capture(unique_id, event_name, ph_obj)
@ -916,9 +968,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
slack_msg = ""
for detail in additional_details:
slack_msg += f"{detail}: {additional_details[detail]}\n"
slack_app.client.chat_postMessage(
channel=alerts_channel, text=slack_msg
)
slack_app.client.chat_postMessage(channel=alerts_channel,
text=slack_msg)
elif callback == "helicone":
print_verbose("reaches helicone for logging!")
model = args[0] if len(args) > 0 else kwargs["model"]
@ -931,6 +982,22 @@ def handle_success(args, kwargs, result, start_time, end_time):
end_time=end_time,
print_verbose=print_verbose,
)
elif callback == "llmonitor":
print_verbose("reaches llmonitor for logging!")
model = args[0] if len(args) > 0 else kwargs["model"]
messages = args[1] if len(args) > 1 else kwargs["messages"]
usage = kwargs["usage"]
llmonitorLogger.log_event(
type="end",
model=model,
messages=messages,
user_id=litellm._thread_context.user,
response_obj=result,
time=end_time,
usage=usage,
run_id=kwargs["litellm_call_id"],
print_verbose=print_verbose,
)
elif callback == "aispend":
print_verbose("reaches aispend for logging!")
model = args[0] if len(args) > 0 else kwargs["model"]
@ -984,7 +1051,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
print_verbose=print_verbose,
)
except Exception as e:
## LOGGING
# LOGGING
exception_logging(logger_fn=user_logger_fn, exception=e)
print_verbose(
f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -995,7 +1062,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
success_handler(args, kwargs)
pass
except Exception as e:
## LOGGING
# LOGGING
exception_logging(logger_fn=user_logger_fn, exception=e)
print_verbose(
f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
@ -1046,33 +1113,36 @@ def exception_type(model, original_exception, custom_llm_provider):
exception_type = ""
if "claude" in model: # one of the anthropics
if hasattr(original_exception, "status_code"):
print_verbose(f"status_code: {original_exception.status_code}")
print_verbose(
f"status_code: {original_exception.status_code}")
if original_exception.status_code == 401:
exception_mapping_worked = True
raise AuthenticationError(
message=f"AnthropicException - {original_exception.message}",
message=
f"AnthropicException - {original_exception.message}",
llm_provider="anthropic",
)
elif original_exception.status_code == 400:
exception_mapping_worked = True
raise InvalidRequestError(
message=f"AnthropicException - {original_exception.message}",
message=
f"AnthropicException - {original_exception.message}",
model=model,
llm_provider="anthropic",
)
elif original_exception.status_code == 429:
exception_mapping_worked = True
raise RateLimitError(
message=f"AnthropicException - {original_exception.message}",
message=
f"AnthropicException - {original_exception.message}",
llm_provider="anthropic",
)
elif (
"Could not resolve authentication method. Expected either api_key or auth_token to be set."
in error_str
):
elif ("Could not resolve authentication method. Expected either api_key or auth_token to be set."
in error_str):
exception_mapping_worked = True
raise AuthenticationError(
message=f"AnthropicException - {original_exception.message}",
message=
f"AnthropicException - {original_exception.message}",
llm_provider="anthropic",
)
elif "replicate" in model:
@ -1097,25 +1167,25 @@ def exception_type(model, original_exception, custom_llm_provider):
)
elif (
exception_type == "ReplicateError"
): ## ReplicateError implies an error on Replicate server side, not user side
): # ReplicateError implies an error on Replicate server side, not user side
raise ServiceUnavailableError(
message=f"ReplicateException - {error_str}",
llm_provider="replicate",
)
elif model == "command-nightly": # Cohere
if (
"invalid api token" in error_str
or "No API key provided." in error_str
):
if ("invalid api token" in error_str
or "No API key provided." in error_str):
exception_mapping_worked = True
raise AuthenticationError(
message=f"CohereException - {original_exception.message}",
message=
f"CohereException - {original_exception.message}",
llm_provider="cohere",
)
elif "too many tokens" in error_str:
exception_mapping_worked = True
raise InvalidRequestError(
message=f"CohereException - {original_exception.message}",
message=
f"CohereException - {original_exception.message}",
model=model,
llm_provider="cohere",
)
@ -1124,7 +1194,8 @@ def exception_type(model, original_exception, custom_llm_provider):
): # cohere seems to fire these errors when we load test it (1k+ messages / min)
exception_mapping_worked = True
raise RateLimitError(
message=f"CohereException - {original_exception.message}",
message=
f"CohereException - {original_exception.message}",
llm_provider="cohere",
)
elif custom_llm_provider == "huggingface":
@ -1132,27 +1203,30 @@ def exception_type(model, original_exception, custom_llm_provider):
if original_exception.status_code == 401:
exception_mapping_worked = True
raise AuthenticationError(
message=f"HuggingfaceException - {original_exception.message}",
message=
f"HuggingfaceException - {original_exception.message}",
llm_provider="huggingface",
)
elif original_exception.status_code == 400:
exception_mapping_worked = True
raise InvalidRequestError(
message=f"HuggingfaceException - {original_exception.message}",
message=
f"HuggingfaceException - {original_exception.message}",
model=model,
llm_provider="huggingface",
)
elif original_exception.status_code == 429:
exception_mapping_worked = True
raise RateLimitError(
message=f"HuggingfaceException - {original_exception.message}",
message=
f"HuggingfaceException - {original_exception.message}",
llm_provider="huggingface",
)
raise original_exception # base case - return the original exception
else:
raise original_exception
except Exception as e:
## LOGGING
# LOGGING
exception_logging(
logger_fn=user_logger_fn,
additional_args={
@ -1173,7 +1247,7 @@ def safe_crash_reporting(model=None, exception=None, custom_llm_provider=None):
"exception": str(exception),
"custom_llm_provider": custom_llm_provider,
}
threading.Thread(target=litellm_telemetry, args=(data,)).start()
threading.Thread(target=litellm_telemetry, args=(data, )).start()
def litellm_telemetry(data):
@ -1223,11 +1297,13 @@ def get_secret(secret_name):
if litellm.secret_manager_client != None:
# TODO: check which secret manager is being used
# currently only supports Infisical
secret = litellm.secret_manager_client.get_secret(secret_name).secret_value
secret = litellm.secret_manager_client.get_secret(
secret_name).secret_value
if secret != None:
return secret # if secret found in secret manager return it
else:
raise ValueError(f"Secret '{secret_name}' not found in secret manager")
raise ValueError(
f"Secret '{secret_name}' not found in secret manager")
elif litellm.api_key != None: # if users use litellm default key
return litellm.api_key
else:
@ -1238,6 +1314,7 @@ def get_secret(secret_name):
# wraps the completion stream to return the correct format for the model
# replicate/anthropic/cohere
class CustomStreamWrapper:
def __init__(self, completion_stream, model, custom_llm_provider=None):
self.model = model
self.custom_llm_provider = custom_llm_provider
@ -1288,7 +1365,8 @@ class CustomStreamWrapper:
elif self.model == "replicate":
chunk = next(self.completion_stream)
completion_obj["content"] = chunk
elif (self.model == "together_ai") or ("togethercomputer" in self.model):
elif (self.model == "together_ai") or ("togethercomputer"
in self.model):
chunk = next(self.completion_stream)
text_data = self.handle_together_ai_chunk(chunk)
if text_data == "":
@ -1321,12 +1399,11 @@ def read_config_args(config_path):
########## ollama implementation ############################
import aiohttp
async def get_ollama_response_stream(
api_base="http://localhost:11434", model="llama2", prompt="Why is the sky blue?"
):
async def get_ollama_response_stream(api_base="http://localhost:11434",
model="llama2",
prompt="Why is the sky blue?"):
session = aiohttp.ClientSession()
url = f"{api_base}/api/generate"
data = {
@ -1349,7 +1426,11 @@ async def get_ollama_response_stream(
"content": "",
}
completion_obj["content"] = j["response"]
yield {"choices": [{"delta": completion_obj}]}
yield {
"choices": [{
"delta": completion_obj
}]
}
# self.responses.append(j["response"])
# yield "blank"
except Exception as e:

View file

@ -1,6 +1,7 @@
# liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
[![PyPI Version](https://img.shields.io/pypi/v/litellm.svg)](https://pypi.org/project/litellm/)
[![PyPI Version](https://img.shields.io/badge/stable%20version-v0.1.345-blue?color=green&link=https://pypi.org/project/litellm/0.1.1/)](https://pypi.org/project/litellm/0.1.1/)
![Downloads](https://img.shields.io/pypi/dm/litellm)
@ -11,9 +12,11 @@
![4BC6491E-86D0-4833-B061-9F54524B2579](https://github.com/BerriAI/litellm/assets/17561003/f5dd237b-db5e-42e1-b1ac-f05683b1d724)
## What does liteLLM proxy do
- Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
```json
{
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
@ -25,11 +28,12 @@
]
}
```
- **Consistent Input/Output** Format
- Call all models using the OpenAI format - `completion(model, messages)`
- Text responses will always be available at `['choices'][0]['message']['content']`
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone`, `LLMonitor` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
**Example: Logs sent to Supabase**
<img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
@ -38,7 +42,6 @@
- **Caching** - Implementation of Semantic Caching
- **Streaming & Async Support** - Return generators to stream text responses
## API Endpoints
### `/chat/completions` (POST)
@ -46,15 +49,18 @@
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
#### Input
This API endpoint accepts all inputs in raw JSON and expects the following inputs
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
- `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
#### Example JSON body
For claude-2
```json
{
"model": "claude-2",
@ -64,11 +70,11 @@ For claude-2
"role": "user"
}
]
}
```
### Making an API request to the Proxy Server
```python
import requests
import json
@ -94,8 +100,10 @@ print(response.text)
```
### Output [Response Format]
Responses from the server are given in the following format.
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
```json
{
"choices": [
@ -121,7 +129,9 @@ All responses from the server are returned in the following format (for all LLM
```
## Installation & Usage
### Running Locally
1. Clone liteLLM repository to your local machine:
```
git clone https://github.com/BerriAI/liteLLM-proxy
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
python main.py
```
## Deploying
1. Quick Start: Deploy on Railway
[![Deploy on Railway](https://railway.app/button.svg)](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
2. `GCP`, `AWS`, `Azure`
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
# Support / Talk with founders
- [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
## Roadmap
- [ ] Support hosted db (e.g. Supabase)
- [ ] Easily send data to places like posthog and sentry.
- [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings