forked from phoenix/litellm-mirror
almost working llmonitor
This commit is contained in:
parent
22c7e38de5
commit
3675d3e029
5 changed files with 425 additions and 326 deletions
|
@ -1,6 +1,7 @@
|
|||
|
||||
# liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
|
||||
|
||||
### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
|
||||
|
||||
[](https://pypi.org/project/litellm/)
|
||||
[](https://pypi.org/project/litellm/0.1.1/)
|
||||

|
||||
|
@ -11,34 +12,36 @@
|
|||

|
||||
|
||||
## What does liteLLM proxy do
|
||||
|
||||
- Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
|
||||
|
||||
|
||||
Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
|
||||
"messages": [
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
- **Consistent Input/Output** Format
|
||||
- Call all models using the OpenAI format - `completion(model, messages)`
|
||||
- Text responses will always be available at `['choices'][0]['message']['content']`
|
||||
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
|
||||
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
|
||||
|
||||
**Example: Logs sent to Supabase**
|
||||
- **Consistent Input/Output** Format
|
||||
- Call all models using the OpenAI format - `completion(model, messages)`
|
||||
- Text responses will always be available at `['choices'][0]['message']['content']`
|
||||
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
|
||||
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `LLMonitor,` `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
|
||||
|
||||
**Example: Logs sent to Supabase**
|
||||
<img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
|
||||
|
||||
- **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
|
||||
- **Caching** - Implementation of Semantic Caching
|
||||
- **Streaming & Async Support** - Return generators to stream text responses
|
||||
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### `/chat/completions` (POST)
|
||||
|
@ -46,34 +49,37 @@
|
|||
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
|
||||
|
||||
#### Input
|
||||
|
||||
This API endpoint accepts all inputs in raw JSON and expects the following inputs
|
||||
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
|
||||
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
|
||||
|
||||
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
|
||||
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
|
||||
- `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
|
||||
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
|
||||
|
||||
|
||||
#### Example JSON body
|
||||
|
||||
For claude-2
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "claude-2",
|
||||
"messages": [
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
|
||||
"model": "claude-2",
|
||||
"messages": [
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Making an API request to the Proxy Server
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
# TODO: use your URL
|
||||
# TODO: use your URL
|
||||
url = "http://localhost:5000/chat/completions"
|
||||
|
||||
payload = json.dumps({
|
||||
|
@ -94,34 +100,38 @@ print(response.text)
|
|||
```
|
||||
|
||||
### Output [Response Format]
|
||||
Responses from the server are given in the following format.
|
||||
|
||||
Responses from the server are given in the following format.
|
||||
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
|
||||
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"finish_reason": "stop",
|
||||
"index": 0,
|
||||
"message": {
|
||||
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
|
||||
"role": "assistant"
|
||||
}
|
||||
}
|
||||
],
|
||||
"created": 1691790381,
|
||||
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
|
||||
"model": "gpt-3.5-turbo-0613",
|
||||
"object": "chat.completion",
|
||||
"usage": {
|
||||
"completion_tokens": 41,
|
||||
"prompt_tokens": 16,
|
||||
"total_tokens": 57
|
||||
"choices": [
|
||||
{
|
||||
"finish_reason": "stop",
|
||||
"index": 0,
|
||||
"message": {
|
||||
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
|
||||
"role": "assistant"
|
||||
}
|
||||
}
|
||||
],
|
||||
"created": 1691790381,
|
||||
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
|
||||
"model": "gpt-3.5-turbo-0613",
|
||||
"object": "chat.completion",
|
||||
"usage": {
|
||||
"completion_tokens": 41,
|
||||
"prompt_tokens": 16,
|
||||
"total_tokens": 57
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Installation & Usage
|
||||
|
||||
### Running Locally
|
||||
|
||||
1. Clone liteLLM repository to your local machine:
|
||||
```
|
||||
git clone https://github.com/BerriAI/liteLLM-proxy
|
||||
|
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
|
|||
python main.py
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Deploying
|
||||
|
||||
1. Quick Start: Deploy on Railway
|
||||
|
||||
[](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
|
||||
|
||||
2. `GCP`, `AWS`, `Azure`
|
||||
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
|
||||
|
||||
2. `GCP`, `AWS`, `Azure`
|
||||
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
|
||||
|
||||
# Support / Talk with founders
|
||||
|
||||
- [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
|
||||
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
|
||||
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
|
||||
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
|
||||
|
||||
|
||||
## Roadmap
|
||||
|
||||
- [ ] Support hosted db (e.g. Supabase)
|
||||
- [ ] Easily send data to places like posthog and sentry.
|
||||
- [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings
|
||||
|
|
|
@ -5,6 +5,7 @@ import traceback
|
|||
import dotenv
|
||||
import os
|
||||
import requests
|
||||
|
||||
dotenv.load_dotenv() # Loading env variables using dotenv
|
||||
|
||||
|
||||
|
@ -14,45 +15,34 @@ class LLMonitorLogger:
|
|||
# Instance variables
|
||||
self.api_url = os.getenv(
|
||||
"LLMONITOR_API_URL") or "https://app.llmonitor.com"
|
||||
self.account_id = os.getenv("LLMONITOR_APP_ID")
|
||||
self.app_id = os.getenv("LLMONITOR_APP_ID")
|
||||
|
||||
def log_event(self, model, messages, response_obj, start_time, end_time, print_verbose):
|
||||
def log_event(self, type, run_id, error, usage, model, messages,
|
||||
response_obj, user_id, time, print_verbose):
|
||||
# Method definition
|
||||
try:
|
||||
print_verbose(
|
||||
f"LLMonitor Logging - Enters logging function for model {model}")
|
||||
f"LLMonitor Logging - Enters logging function for model {model}"
|
||||
)
|
||||
|
||||
print(model, messages, response_obj, start_time, end_time)
|
||||
print(type, model, messages, response_obj, time, end_user)
|
||||
|
||||
# headers = {
|
||||
# 'Content-Type': 'application/json'
|
||||
# }
|
||||
headers = {'Content-Type': 'application/json'}
|
||||
|
||||
# prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = self.price_calculator(
|
||||
# model, response_obj, start_time, end_time)
|
||||
# total_cost = prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
|
||||
data = {
|
||||
"type": "llm",
|
||||
"name": model,
|
||||
"runId": run_id,
|
||||
"app": self.app_id,
|
||||
"error": error,
|
||||
"event": type,
|
||||
"timestamp": time.isoformat(),
|
||||
"userId": user_id,
|
||||
"input": messages,
|
||||
"output": response_obj['choices'][0]['message']['content'],
|
||||
}
|
||||
|
||||
# response_time = (end_time-start_time).total_seconds()
|
||||
# if "response" in response_obj:
|
||||
# data = [{
|
||||
# "response_time": response_time,
|
||||
# "model_id": response_obj["model"],
|
||||
# "total_cost": total_cost,
|
||||
# "messages": messages,
|
||||
# "response": response_obj['choices'][0]['message']['content'],
|
||||
# "account_id": self.account_id
|
||||
# }]
|
||||
# elif "error" in response_obj:
|
||||
# data = [{
|
||||
# "response_time": response_time,
|
||||
# "model_id": response_obj["model"],
|
||||
# "total_cost": total_cost,
|
||||
# "messages": messages,
|
||||
# "error": response_obj['error'],
|
||||
# "account_id": self.account_id
|
||||
# }]
|
||||
|
||||
# print_verbose(f"BerriSpend Logging - final data object: {data}")
|
||||
print_verbose(f"LLMonitor Logging - final data object: {data}")
|
||||
# response = requests.post(url, headers=headers, json=data)
|
||||
except:
|
||||
# traceback.print_exc()
|
||||
|
|
|
@ -1,28 +1,36 @@
|
|||
#### What this tests ####
|
||||
# This tests if logging to the helicone integration actually works
|
||||
|
||||
from litellm import embedding, completion
|
||||
import litellm
|
||||
# This tests if logging to the llmonitor integration actually works
|
||||
# Adds the parent directory to the system path
|
||||
import sys
|
||||
import os
|
||||
import traceback
|
||||
import pytest
|
||||
|
||||
# Adds the parent directory to the system path
|
||||
sys.path.insert(0, os.path.abspath('../..'))
|
||||
|
||||
from litellm import completion
|
||||
import litellm
|
||||
|
||||
litellm.input_callback = ["llmonitor"]
|
||||
litellm.success_callback = ["llmonitor"]
|
||||
litellm.error_callback = ["llmonitor"]
|
||||
|
||||
litellm.set_verbose = True
|
||||
|
||||
user_message = "Hello, how are you?"
|
||||
messages = [{"content": user_message, "role": "user"}]
|
||||
|
||||
|
||||
# openai call
|
||||
response = completion(model="gpt-3.5-turbo",
|
||||
messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}])
|
||||
# response = completion(model="gpt-3.5-turbo",
|
||||
# messages=[{
|
||||
# "role": "user",
|
||||
# "content": "Hi 👋 - i'm openai"
|
||||
# }])
|
||||
|
||||
# print(response)
|
||||
|
||||
# #bad request call
|
||||
# response = completion(model="chatgpt-test", messages=[{"role": "user", "content": "Hi 👋 - i'm a bad request"}])
|
||||
|
||||
# cohere call
|
||||
# response = completion(model="command-nightly",
|
||||
# messages=[{"role": "user", "content": "Hi 👋 - i'm cohere"}])
|
||||
response = completion(model="command-nightly",
|
||||
messages=[{
|
||||
"role": "user",
|
||||
"content": "Hi 👋 - i'm cohere"
|
||||
}])
|
||||
print(response)
|
||||
|
|
437
litellm/utils.py
437
litellm/utils.py
|
@ -1,20 +1,7 @@
|
|||
import sys
|
||||
import dotenv, json, traceback, threading
|
||||
import subprocess, os
|
||||
import litellm, openai
|
||||
import random, uuid, requests
|
||||
import datetime, time
|
||||
import tiktoken
|
||||
|
||||
encoding = tiktoken.get_encoding("cl100k_base")
|
||||
import pkg_resources
|
||||
from .integrations.helicone import HeliconeLogger
|
||||
from .integrations.aispend import AISpendLogger
|
||||
from .integrations.berrispend import BerriSpendLogger
|
||||
from .integrations.supabase import Supabase
|
||||
from .integrations.litedebugger import LiteDebugger
|
||||
from openai.error import OpenAIError as OriginalError
|
||||
from openai.openai_object import OpenAIObject
|
||||
import aiohttp
|
||||
import subprocess
|
||||
import importlib
|
||||
from typing import List, Dict, Union, Optional
|
||||
from .exceptions import (
|
||||
AuthenticationError,
|
||||
InvalidRequestError,
|
||||
|
@ -22,7 +9,32 @@ from .exceptions import (
|
|||
ServiceUnavailableError,
|
||||
OpenAIError,
|
||||
)
|
||||
from typing import List, Dict, Union, Optional
|
||||
from openai.openai_object import OpenAIObject
|
||||
from openai.error import OpenAIError as OriginalError
|
||||
from .integrations.llmonitor import LLMonitorLogger
|
||||
from .integrations.litedebugger import LiteDebugger
|
||||
from .integrations.supabase import Supabase
|
||||
from .integrations.berrispend import BerriSpendLogger
|
||||
from .integrations.aispend import AISpendLogger
|
||||
from .integrations.helicone import HeliconeLogger
|
||||
import pkg_resources
|
||||
import sys
|
||||
import dotenv
|
||||
import json
|
||||
import traceback
|
||||
import threading
|
||||
import subprocess
|
||||
import os
|
||||
import litellm
|
||||
import openai
|
||||
import random
|
||||
import uuid
|
||||
import requests
|
||||
import datetime
|
||||
import time
|
||||
import tiktoken
|
||||
|
||||
encoding = tiktoken.get_encoding("cl100k_base")
|
||||
|
||||
####### ENVIRONMENT VARIABLES ###################
|
||||
dotenv.load_dotenv() # Loading env variables using dotenv
|
||||
|
@ -37,6 +49,7 @@ aispendLogger = None
|
|||
berrispendLogger = None
|
||||
supabaseClient = None
|
||||
liteDebuggerClient = None
|
||||
llmonitorLogger = None
|
||||
callback_list: Optional[List[str]] = []
|
||||
user_logger_fn = None
|
||||
additional_details: Optional[Dict[str, str]] = {}
|
||||
|
@ -63,6 +76,7 @@ local_cache: Optional[Dict[str, str]] = {}
|
|||
|
||||
|
||||
class Message(OpenAIObject):
|
||||
|
||||
def __init__(self, content="default", role="assistant", **params):
|
||||
super(Message, self).__init__(**params)
|
||||
self.content = content
|
||||
|
@ -70,7 +84,12 @@ class Message(OpenAIObject):
|
|||
|
||||
|
||||
class Choices(OpenAIObject):
|
||||
def __init__(self, finish_reason="stop", index=0, message=Message(), **params):
|
||||
|
||||
def __init__(self,
|
||||
finish_reason="stop",
|
||||
index=0,
|
||||
message=Message(),
|
||||
**params):
|
||||
super(Choices, self).__init__(**params)
|
||||
self.finish_reason = finish_reason
|
||||
self.index = index
|
||||
|
@ -78,20 +97,22 @@ class Choices(OpenAIObject):
|
|||
|
||||
|
||||
class ModelResponse(OpenAIObject):
|
||||
def __init__(self, choices=None, created=None, model=None, usage=None, **params):
|
||||
|
||||
def __init__(self,
|
||||
choices=None,
|
||||
created=None,
|
||||
model=None,
|
||||
usage=None,
|
||||
**params):
|
||||
super(ModelResponse, self).__init__(**params)
|
||||
self.choices = choices if choices else [Choices()]
|
||||
self.created = created
|
||||
self.model = model
|
||||
self.usage = (
|
||||
usage
|
||||
if usage
|
||||
else {
|
||||
"prompt_tokens": None,
|
||||
"completion_tokens": None,
|
||||
"total_tokens": None,
|
||||
}
|
||||
)
|
||||
self.usage = (usage if usage else {
|
||||
"prompt_tokens": None,
|
||||
"completion_tokens": None,
|
||||
"total_tokens": None,
|
||||
})
|
||||
|
||||
def to_dict_recursive(self):
|
||||
d = super().to_dict_recursive()
|
||||
|
@ -108,8 +129,6 @@ def print_verbose(print_statement):
|
|||
|
||||
|
||||
####### Package Import Handler ###################
|
||||
import importlib
|
||||
import subprocess
|
||||
|
||||
|
||||
def install_and_import(package: str):
|
||||
|
@ -139,6 +158,7 @@ def install_and_import(package: str):
|
|||
# Logging function -> log the exact model details + what's being sent | Non-Blocking
|
||||
class Logging:
|
||||
global supabaseClient, liteDebuggerClient
|
||||
|
||||
def __init__(self, model, messages, optional_params, litellm_params):
|
||||
self.model = model
|
||||
self.messages = messages
|
||||
|
@ -146,20 +166,20 @@ class Logging:
|
|||
self.litellm_params = litellm_params
|
||||
self.logger_fn = litellm_params["logger_fn"]
|
||||
self.model_call_details = {
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"model": model,
|
||||
"messages": messages,
|
||||
"optional_params": self.optional_params,
|
||||
"litellm_params": self.litellm_params,
|
||||
}
|
||||
|
||||
|
||||
def pre_call(self, input, api_key, additional_args={}):
|
||||
try:
|
||||
print(f"logging pre call for model: {self.model}")
|
||||
self.model_call_details["input"] = input
|
||||
self.model_call_details["api_key"] = api_key
|
||||
self.model_call_details["additional_args"] = additional_args
|
||||
|
||||
## User Logging -> if you pass in a custom logging function
|
||||
|
||||
# User Logging -> if you pass in a custom logging function
|
||||
print_verbose(
|
||||
f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
|
||||
)
|
||||
|
@ -173,7 +193,7 @@ class Logging:
|
|||
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
|
||||
)
|
||||
|
||||
## Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
|
||||
# Input Integration Logging -> If you want to log the fact that an attempt to call the model was made
|
||||
for callback in litellm.input_callback:
|
||||
try:
|
||||
if callback == "supabase":
|
||||
|
@ -185,7 +205,21 @@ class Logging:
|
|||
model=model,
|
||||
messages=messages,
|
||||
end_user=litellm._thread_context.user,
|
||||
litellm_call_id=self.litellm_params["litellm_call_id"],
|
||||
litellm_call_id=self.
|
||||
litellm_params["litellm_call_id"],
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
elif callback == "llmonitor":
|
||||
print_verbose("reaches llmonitor for logging!")
|
||||
model = self.model
|
||||
messages = self.messages
|
||||
print(f"liteDebuggerClient: {liteDebuggerClient}")
|
||||
llmonitorLogger.log_event(
|
||||
type="start",
|
||||
model=model,
|
||||
messages=messages,
|
||||
user_id=litellm._thread_context.user,
|
||||
run_id=self.litellm_params["litellm_call_id"],
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
elif callback == "lite_debugger":
|
||||
|
@ -197,15 +231,18 @@ class Logging:
|
|||
model=model,
|
||||
messages=messages,
|
||||
end_user=litellm._thread_context.user,
|
||||
litellm_call_id=self.litellm_params["litellm_call_id"],
|
||||
litellm_call_id=self.
|
||||
litellm_params["litellm_call_id"],
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
except Exception as e:
|
||||
print_verbose(f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}")
|
||||
print_verbose(
|
||||
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while input logging with integrations {traceback.format_exc()}"
|
||||
)
|
||||
print_verbose(
|
||||
f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
|
||||
)
|
||||
if capture_exception: # log this error to sentry for debugging
|
||||
if capture_exception: # log this error to sentry for debugging
|
||||
capture_exception(e)
|
||||
except:
|
||||
print_verbose(
|
||||
|
@ -214,9 +251,9 @@ class Logging:
|
|||
print_verbose(
|
||||
f"LiteLLM.Logging: is sentry capture exception initialized {capture_exception}"
|
||||
)
|
||||
if capture_exception: # log this error to sentry for debugging
|
||||
if capture_exception: # log this error to sentry for debugging
|
||||
capture_exception(e)
|
||||
|
||||
|
||||
def post_call(self, input, api_key, original_response, additional_args={}):
|
||||
# Do something here
|
||||
try:
|
||||
|
@ -224,8 +261,8 @@ class Logging:
|
|||
self.model_call_details["api_key"] = api_key
|
||||
self.model_call_details["original_response"] = original_response
|
||||
self.model_call_details["additional_args"] = additional_args
|
||||
|
||||
## User Logging -> if you pass in a custom logging function
|
||||
|
||||
# User Logging -> if you pass in a custom logging function
|
||||
print_verbose(
|
||||
f"Logging Details: logger_fn - {self.logger_fn} | callable(logger_fn) - {callable(self.logger_fn)}"
|
||||
)
|
||||
|
@ -243,9 +280,9 @@ class Logging:
|
|||
f"LiteLLM.LoggingError: [Non-Blocking] Exception occurred while logging {traceback.format_exc()}"
|
||||
)
|
||||
pass
|
||||
|
||||
|
||||
# Add more methods as needed
|
||||
|
||||
|
||||
|
||||
def exception_logging(
|
||||
additional_args={},
|
||||
|
@ -257,7 +294,7 @@ def exception_logging(
|
|||
if exception:
|
||||
model_call_details["exception"] = exception
|
||||
model_call_details["additional_args"] = additional_args
|
||||
## User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
|
||||
# User Logging -> if you pass in a custom logging function or want to use sentry breadcrumbs
|
||||
print_verbose(
|
||||
f"Logging Details: logger_fn - {logger_fn} | callable(logger_fn) - {callable(logger_fn)}"
|
||||
)
|
||||
|
@ -280,20 +317,20 @@ def exception_logging(
|
|||
####### CLIENT ###################
|
||||
# make it easy to log if completion/embedding runs succeeded or failed + see what happened | Non-Blocking
|
||||
def client(original_function):
|
||||
|
||||
def function_setup(
|
||||
*args, **kwargs
|
||||
): # just run once to check if user wants to send their data anywhere - PostHog/Sentry/Slack/etc.
|
||||
try:
|
||||
global callback_list, add_breadcrumb, user_logger_fn
|
||||
if (
|
||||
len(litellm.input_callback) > 0 or len(litellm.success_callback) > 0 or len(litellm.failure_callback) > 0
|
||||
) and len(callback_list) == 0:
|
||||
if (len(litellm.input_callback) > 0
|
||||
or len(litellm.success_callback) > 0
|
||||
or len(litellm.failure_callback)
|
||||
> 0) and len(callback_list) == 0:
|
||||
callback_list = list(
|
||||
set(litellm.input_callback + litellm.success_callback + litellm.failure_callback)
|
||||
)
|
||||
set_callbacks(
|
||||
callback_list=callback_list,
|
||||
)
|
||||
set(litellm.input_callback + litellm.success_callback +
|
||||
litellm.failure_callback))
|
||||
set_callbacks(callback_list=callback_list, )
|
||||
if add_breadcrumb:
|
||||
add_breadcrumb(
|
||||
category="litellm.llm_call",
|
||||
|
@ -310,12 +347,11 @@ def client(original_function):
|
|||
if litellm.telemetry:
|
||||
try:
|
||||
model = args[0] if len(args) > 0 else kwargs["model"]
|
||||
exception = kwargs["exception"] if "exception" in kwargs else None
|
||||
custom_llm_provider = (
|
||||
kwargs["custom_llm_provider"]
|
||||
if "custom_llm_provider" in kwargs
|
||||
else None
|
||||
)
|
||||
exception = kwargs[
|
||||
"exception"] if "exception" in kwargs else None
|
||||
custom_llm_provider = (kwargs["custom_llm_provider"]
|
||||
if "custom_llm_provider" in kwargs else
|
||||
None)
|
||||
safe_crash_reporting(
|
||||
model=model,
|
||||
exception=exception,
|
||||
|
@ -340,15 +376,12 @@ def client(original_function):
|
|||
def check_cache(*args, **kwargs):
|
||||
try: # never block execution
|
||||
prompt = get_prompt(*args, **kwargs)
|
||||
if (
|
||||
prompt != None and prompt in local_cache
|
||||
): # check if messages / prompt exists
|
||||
if (prompt != None and prompt
|
||||
in local_cache): # check if messages / prompt exists
|
||||
if litellm.caching_with_models:
|
||||
# if caching with model names is enabled, key is prompt + model name
|
||||
if (
|
||||
"model" in kwargs
|
||||
and kwargs["model"] in local_cache[prompt]["models"]
|
||||
):
|
||||
if ("model" in kwargs and kwargs["model"]
|
||||
in local_cache[prompt]["models"]):
|
||||
cache_key = prompt + kwargs["model"]
|
||||
return local_cache[cache_key]
|
||||
else: # caching only with prompts
|
||||
|
@ -363,10 +396,8 @@ def client(original_function):
|
|||
try: # never block execution
|
||||
prompt = get_prompt(*args, **kwargs)
|
||||
if litellm.caching_with_models: # caching with model + prompt
|
||||
if (
|
||||
"model" in kwargs
|
||||
and kwargs["model"] in local_cache[prompt]["models"]
|
||||
):
|
||||
if ("model" in kwargs
|
||||
and kwargs["model"] in local_cache[prompt]["models"]):
|
||||
cache_key = prompt + kwargs["model"]
|
||||
local_cache[cache_key] = result
|
||||
else: # caching based only on prompts
|
||||
|
@ -381,24 +412,24 @@ def client(original_function):
|
|||
function_setup(*args, **kwargs)
|
||||
litellm_call_id = str(uuid.uuid4())
|
||||
kwargs["litellm_call_id"] = litellm_call_id
|
||||
## [OPTIONAL] CHECK CACHE
|
||||
# [OPTIONAL] CHECK CACHE
|
||||
start_time = datetime.datetime.now()
|
||||
if (litellm.caching or litellm.caching_with_models) and (
|
||||
cached_result := check_cache(*args, **kwargs)
|
||||
) is not None:
|
||||
cached_result := check_cache(*args, **kwargs)) is not None:
|
||||
result = cached_result
|
||||
else:
|
||||
## MODEL CALL
|
||||
# MODEL CALL
|
||||
result = original_function(*args, **kwargs)
|
||||
end_time = datetime.datetime.now()
|
||||
## Add response to CACHE
|
||||
# Add response to CACHE
|
||||
if litellm.caching:
|
||||
add_cache(result, *args, **kwargs)
|
||||
## LOG SUCCESS
|
||||
# LOG SUCCESS
|
||||
crash_reporting(*args, **kwargs)
|
||||
my_thread = threading.Thread(
|
||||
target=handle_success, args=(args, kwargs, result, start_time, end_time)
|
||||
) # don't interrupt execution of main thread
|
||||
target=handle_success,
|
||||
args=(args, kwargs, result, start_time,
|
||||
end_time)) # don't interrupt execution of main thread
|
||||
my_thread.start()
|
||||
return result
|
||||
except Exception as e:
|
||||
|
@ -407,7 +438,8 @@ def client(original_function):
|
|||
end_time = datetime.datetime.now()
|
||||
my_thread = threading.Thread(
|
||||
target=handle_failure,
|
||||
args=(e, traceback_exception, start_time, end_time, args, kwargs),
|
||||
args=(e, traceback_exception, start_time, end_time, args,
|
||||
kwargs),
|
||||
) # don't interrupt execution of main thread
|
||||
my_thread.start()
|
||||
raise e
|
||||
|
@ -432,18 +464,18 @@ def token_counter(model, text):
|
|||
return num_tokens
|
||||
|
||||
|
||||
def cost_per_token(model="gpt-3.5-turbo", prompt_tokens=0, completion_tokens=0):
|
||||
## given
|
||||
def cost_per_token(model="gpt-3.5-turbo",
|
||||
prompt_tokens=0,
|
||||
completion_tokens=0):
|
||||
# given
|
||||
prompt_tokens_cost_usd_dollar = 0
|
||||
completion_tokens_cost_usd_dollar = 0
|
||||
model_cost_ref = litellm.model_cost
|
||||
if model in model_cost_ref:
|
||||
prompt_tokens_cost_usd_dollar = (
|
||||
model_cost_ref[model]["input_cost_per_token"] * prompt_tokens
|
||||
)
|
||||
model_cost_ref[model]["input_cost_per_token"] * prompt_tokens)
|
||||
completion_tokens_cost_usd_dollar = (
|
||||
model_cost_ref[model]["output_cost_per_token"] * completion_tokens
|
||||
)
|
||||
model_cost_ref[model]["output_cost_per_token"] * completion_tokens)
|
||||
return prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar
|
||||
else:
|
||||
# calculate average input cost
|
||||
|
@ -464,8 +496,9 @@ def completion_cost(model="gpt-3.5-turbo", prompt="", completion=""):
|
|||
prompt_tokens = token_counter(model=model, text=prompt)
|
||||
completion_tokens = token_counter(model=model, text=completion)
|
||||
prompt_tokens_cost_usd_dollar, completion_tokens_cost_usd_dollar = cost_per_token(
|
||||
model=model, prompt_tokens=prompt_tokens, completion_tokens=completion_tokens
|
||||
)
|
||||
model=model,
|
||||
prompt_tokens=prompt_tokens,
|
||||
completion_tokens=completion_tokens)
|
||||
return prompt_tokens_cost_usd_dollar + completion_tokens_cost_usd_dollar
|
||||
|
||||
|
||||
|
@ -557,9 +590,8 @@ def get_optional_params(
|
|||
optional_params["max_tokens"] = max_tokens
|
||||
if frequency_penalty != 0:
|
||||
optional_params["frequency_penalty"] = frequency_penalty
|
||||
elif (
|
||||
model == "chat-bison"
|
||||
): # chat-bison has diff args from chat-bison@001 ty Google
|
||||
elif (model == "chat-bison"
|
||||
): # chat-bison has diff args from chat-bison@001 ty Google
|
||||
if temperature != 1:
|
||||
optional_params["temperature"] = temperature
|
||||
if top_p != 1:
|
||||
|
@ -619,7 +651,10 @@ def load_test_model(
|
|||
test_prompt = prompt
|
||||
if num_calls:
|
||||
test_calls = num_calls
|
||||
messages = [[{"role": "user", "content": test_prompt}] for _ in range(test_calls)]
|
||||
messages = [[{
|
||||
"role": "user",
|
||||
"content": test_prompt
|
||||
}] for _ in range(test_calls)]
|
||||
start_time = time.time()
|
||||
try:
|
||||
litellm.batch_completion(
|
||||
|
@ -649,7 +684,7 @@ def load_test_model(
|
|||
|
||||
|
||||
def set_callbacks(callback_list):
|
||||
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
|
||||
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, heliconeLogger, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
|
||||
try:
|
||||
for callback in callback_list:
|
||||
print(f"callback: {callback}")
|
||||
|
@ -657,17 +692,15 @@ def set_callbacks(callback_list):
|
|||
try:
|
||||
import sentry_sdk
|
||||
except ImportError:
|
||||
print_verbose("Package 'sentry_sdk' is missing. Installing it...")
|
||||
print_verbose(
|
||||
"Package 'sentry_sdk' is missing. Installing it...")
|
||||
subprocess.check_call(
|
||||
[sys.executable, "-m", "pip", "install", "sentry_sdk"]
|
||||
)
|
||||
[sys.executable, "-m", "pip", "install", "sentry_sdk"])
|
||||
import sentry_sdk
|
||||
sentry_sdk_instance = sentry_sdk
|
||||
sentry_trace_rate = (
|
||||
os.environ.get("SENTRY_API_TRACE_RATE")
|
||||
if "SENTRY_API_TRACE_RATE" in os.environ
|
||||
else "1.0"
|
||||
)
|
||||
sentry_trace_rate = (os.environ.get("SENTRY_API_TRACE_RATE")
|
||||
if "SENTRY_API_TRACE_RATE" in os.environ
|
||||
else "1.0")
|
||||
sentry_sdk_instance.init(
|
||||
dsn=os.environ.get("SENTRY_API_URL"),
|
||||
traces_sample_rate=float(sentry_trace_rate),
|
||||
|
@ -678,10 +711,10 @@ def set_callbacks(callback_list):
|
|||
try:
|
||||
from posthog import Posthog
|
||||
except ImportError:
|
||||
print_verbose("Package 'posthog' is missing. Installing it...")
|
||||
print_verbose(
|
||||
"Package 'posthog' is missing. Installing it...")
|
||||
subprocess.check_call(
|
||||
[sys.executable, "-m", "pip", "install", "posthog"]
|
||||
)
|
||||
[sys.executable, "-m", "pip", "install", "posthog"])
|
||||
from posthog import Posthog
|
||||
posthog = Posthog(
|
||||
project_api_key=os.environ.get("POSTHOG_API_KEY"),
|
||||
|
@ -691,10 +724,10 @@ def set_callbacks(callback_list):
|
|||
try:
|
||||
from slack_bolt import App
|
||||
except ImportError:
|
||||
print_verbose("Package 'slack_bolt' is missing. Installing it...")
|
||||
print_verbose(
|
||||
"Package 'slack_bolt' is missing. Installing it...")
|
||||
subprocess.check_call(
|
||||
[sys.executable, "-m", "pip", "install", "slack_bolt"]
|
||||
)
|
||||
[sys.executable, "-m", "pip", "install", "slack_bolt"])
|
||||
from slack_bolt import App
|
||||
slack_app = App(
|
||||
token=os.environ.get("SLACK_API_TOKEN"),
|
||||
|
@ -704,6 +737,8 @@ def set_callbacks(callback_list):
|
|||
print_verbose(f"Initialized Slack App: {slack_app}")
|
||||
elif callback == "helicone":
|
||||
heliconeLogger = HeliconeLogger()
|
||||
elif callback == "llmonitor":
|
||||
llmonitorLogger = LLMonitorLogger()
|
||||
elif callback == "aispend":
|
||||
aispendLogger = AISpendLogger()
|
||||
elif callback == "berrispend":
|
||||
|
@ -718,7 +753,8 @@ def set_callbacks(callback_list):
|
|||
raise e
|
||||
|
||||
|
||||
def handle_failure(exception, traceback_exception, start_time, end_time, args, kwargs):
|
||||
def handle_failure(exception, traceback_exception, start_time, end_time, args,
|
||||
kwargs):
|
||||
global sentry_sdk_instance, capture_exception, add_breadcrumb, posthog, slack_app, alerts_channel, aispendLogger, berrispendLogger, supabaseClient, liteDebuggerClient
|
||||
try:
|
||||
# print_verbose(f"handle_failure args: {args}")
|
||||
|
@ -728,8 +764,7 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
failure_handler = additional_details.pop("failure_handler", None)
|
||||
|
||||
additional_details["Event_Name"] = additional_details.pop(
|
||||
"failed_event_name", "litellm.failed_query"
|
||||
)
|
||||
"failed_event_name", "litellm.failed_query")
|
||||
print_verbose(f"self.failure_callback: {litellm.failure_callback}")
|
||||
|
||||
# print_verbose(f"additional_details: {additional_details}")
|
||||
|
@ -746,9 +781,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
for detail in additional_details:
|
||||
slack_msg += f"{detail}: {additional_details[detail]}\n"
|
||||
slack_msg += f"Traceback: {traceback_exception}"
|
||||
slack_app.client.chat_postMessage(
|
||||
channel=alerts_channel, text=slack_msg
|
||||
)
|
||||
slack_app.client.chat_postMessage(channel=alerts_channel,
|
||||
text=slack_msg)
|
||||
elif callback == "sentry":
|
||||
capture_exception(exception)
|
||||
elif callback == "posthog":
|
||||
|
@ -767,9 +801,8 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
print_verbose(f"ph_obj: {ph_obj}")
|
||||
print_verbose(f"PostHog Event Name: {event_name}")
|
||||
if "user_id" in additional_details:
|
||||
posthog.capture(
|
||||
additional_details["user_id"], event_name, ph_obj
|
||||
)
|
||||
posthog.capture(additional_details["user_id"],
|
||||
event_name, ph_obj)
|
||||
else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
|
||||
unique_id = str(uuid.uuid4())
|
||||
posthog.capture(unique_id, event_name)
|
||||
|
@ -783,10 +816,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
"created": time.time(),
|
||||
"error": traceback_exception,
|
||||
"usage": {
|
||||
"prompt_tokens": prompt_token_calculator(
|
||||
model, messages=messages
|
||||
),
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens":
|
||||
prompt_token_calculator(model, messages=messages),
|
||||
"completion_tokens":
|
||||
0,
|
||||
},
|
||||
}
|
||||
berrispendLogger.log_event(
|
||||
|
@ -805,10 +838,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
"model": model,
|
||||
"created": time.time(),
|
||||
"usage": {
|
||||
"prompt_tokens": prompt_token_calculator(
|
||||
model, messages=messages
|
||||
),
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens":
|
||||
prompt_token_calculator(model, messages=messages),
|
||||
"completion_tokens":
|
||||
0,
|
||||
},
|
||||
}
|
||||
aispendLogger.log_event(
|
||||
|
@ -818,6 +851,27 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
end_time=end_time,
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
elif callback == "llmonitor":
|
||||
print_verbose("reaches llmonitor for logging!")
|
||||
model = args[0] if len(args) > 0 else kwargs["model"]
|
||||
messages = args[1] if len(args) > 1 else kwargs["messages"]
|
||||
usage = {
|
||||
"prompt_tokens":
|
||||
prompt_token_calculator(model, messages=messages),
|
||||
"completion_tokens":
|
||||
0,
|
||||
}
|
||||
llmonitorLogger.log_event(
|
||||
type="error",
|
||||
user_id=litellm._thread_context.user,
|
||||
model=model,
|
||||
error=traceback_exception,
|
||||
response_obj=result,
|
||||
run_id=kwargs["litellm_call_id"],
|
||||
timestamp=end_time,
|
||||
usage=usage,
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
elif callback == "supabase":
|
||||
print_verbose("reaches supabase for logging!")
|
||||
print_verbose(f"supabaseClient: {supabaseClient}")
|
||||
|
@ -828,10 +882,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
"created": time.time(),
|
||||
"error": traceback_exception,
|
||||
"usage": {
|
||||
"prompt_tokens": prompt_token_calculator(
|
||||
model, messages=messages
|
||||
),
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens":
|
||||
prompt_token_calculator(model, messages=messages),
|
||||
"completion_tokens":
|
||||
0,
|
||||
},
|
||||
}
|
||||
supabaseClient.log_event(
|
||||
|
@ -854,10 +908,10 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
"created": time.time(),
|
||||
"error": traceback_exception,
|
||||
"usage": {
|
||||
"prompt_tokens": prompt_token_calculator(
|
||||
model, messages=messages
|
||||
),
|
||||
"completion_tokens": 0,
|
||||
"prompt_tokens":
|
||||
prompt_token_calculator(model, messages=messages),
|
||||
"completion_tokens":
|
||||
0,
|
||||
},
|
||||
}
|
||||
liteDebuggerClient.log_event(
|
||||
|
@ -884,19 +938,18 @@ def handle_failure(exception, traceback_exception, start_time, end_time, args, k
|
|||
failure_handler(call_details)
|
||||
pass
|
||||
except Exception as e:
|
||||
## LOGGING
|
||||
# LOGGING
|
||||
exception_logging(logger_fn=user_logger_fn, exception=e)
|
||||
pass
|
||||
|
||||
|
||||
def handle_success(args, kwargs, result, start_time, end_time):
|
||||
global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient
|
||||
global heliconeLogger, aispendLogger, supabaseClient, liteDebuggerClient, llmonitorLogger
|
||||
try:
|
||||
success_handler = additional_details.pop("success_handler", None)
|
||||
failure_handler = additional_details.pop("failure_handler", None)
|
||||
additional_details["Event_Name"] = additional_details.pop(
|
||||
"successful_event_name", "litellm.succes_query"
|
||||
)
|
||||
"successful_event_name", "litellm.succes_query")
|
||||
for callback in litellm.success_callback:
|
||||
try:
|
||||
if callback == "posthog":
|
||||
|
@ -905,9 +958,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
|
|||
ph_obj[detail] = additional_details[detail]
|
||||
event_name = additional_details["Event_Name"]
|
||||
if "user_id" in additional_details:
|
||||
posthog.capture(
|
||||
additional_details["user_id"], event_name, ph_obj
|
||||
)
|
||||
posthog.capture(additional_details["user_id"],
|
||||
event_name, ph_obj)
|
||||
else: # PostHog calls require a unique id to identify a user - https://posthog.com/docs/libraries/python
|
||||
unique_id = str(uuid.uuid4())
|
||||
posthog.capture(unique_id, event_name, ph_obj)
|
||||
|
@ -916,9 +968,8 @@ def handle_success(args, kwargs, result, start_time, end_time):
|
|||
slack_msg = ""
|
||||
for detail in additional_details:
|
||||
slack_msg += f"{detail}: {additional_details[detail]}\n"
|
||||
slack_app.client.chat_postMessage(
|
||||
channel=alerts_channel, text=slack_msg
|
||||
)
|
||||
slack_app.client.chat_postMessage(channel=alerts_channel,
|
||||
text=slack_msg)
|
||||
elif callback == "helicone":
|
||||
print_verbose("reaches helicone for logging!")
|
||||
model = args[0] if len(args) > 0 else kwargs["model"]
|
||||
|
@ -931,6 +982,22 @@ def handle_success(args, kwargs, result, start_time, end_time):
|
|||
end_time=end_time,
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
elif callback == "llmonitor":
|
||||
print_verbose("reaches llmonitor for logging!")
|
||||
model = args[0] if len(args) > 0 else kwargs["model"]
|
||||
messages = args[1] if len(args) > 1 else kwargs["messages"]
|
||||
usage = kwargs["usage"]
|
||||
llmonitorLogger.log_event(
|
||||
type="end",
|
||||
model=model,
|
||||
messages=messages,
|
||||
user_id=litellm._thread_context.user,
|
||||
response_obj=result,
|
||||
time=end_time,
|
||||
usage=usage,
|
||||
run_id=kwargs["litellm_call_id"],
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
elif callback == "aispend":
|
||||
print_verbose("reaches aispend for logging!")
|
||||
model = args[0] if len(args) > 0 else kwargs["model"]
|
||||
|
@ -984,7 +1051,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
|
|||
print_verbose=print_verbose,
|
||||
)
|
||||
except Exception as e:
|
||||
## LOGGING
|
||||
# LOGGING
|
||||
exception_logging(logger_fn=user_logger_fn, exception=e)
|
||||
print_verbose(
|
||||
f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
|
||||
|
@ -995,7 +1062,7 @@ def handle_success(args, kwargs, result, start_time, end_time):
|
|||
success_handler(args, kwargs)
|
||||
pass
|
||||
except Exception as e:
|
||||
## LOGGING
|
||||
# LOGGING
|
||||
exception_logging(logger_fn=user_logger_fn, exception=e)
|
||||
print_verbose(
|
||||
f"[Non-Blocking] Success Callback Error - {traceback.format_exc()}"
|
||||
|
@ -1046,33 +1113,36 @@ def exception_type(model, original_exception, custom_llm_provider):
|
|||
exception_type = ""
|
||||
if "claude" in model: # one of the anthropics
|
||||
if hasattr(original_exception, "status_code"):
|
||||
print_verbose(f"status_code: {original_exception.status_code}")
|
||||
print_verbose(
|
||||
f"status_code: {original_exception.status_code}")
|
||||
if original_exception.status_code == 401:
|
||||
exception_mapping_worked = True
|
||||
raise AuthenticationError(
|
||||
message=f"AnthropicException - {original_exception.message}",
|
||||
message=
|
||||
f"AnthropicException - {original_exception.message}",
|
||||
llm_provider="anthropic",
|
||||
)
|
||||
elif original_exception.status_code == 400:
|
||||
exception_mapping_worked = True
|
||||
raise InvalidRequestError(
|
||||
message=f"AnthropicException - {original_exception.message}",
|
||||
message=
|
||||
f"AnthropicException - {original_exception.message}",
|
||||
model=model,
|
||||
llm_provider="anthropic",
|
||||
)
|
||||
elif original_exception.status_code == 429:
|
||||
exception_mapping_worked = True
|
||||
raise RateLimitError(
|
||||
message=f"AnthropicException - {original_exception.message}",
|
||||
message=
|
||||
f"AnthropicException - {original_exception.message}",
|
||||
llm_provider="anthropic",
|
||||
)
|
||||
elif (
|
||||
"Could not resolve authentication method. Expected either api_key or auth_token to be set."
|
||||
in error_str
|
||||
):
|
||||
elif ("Could not resolve authentication method. Expected either api_key or auth_token to be set."
|
||||
in error_str):
|
||||
exception_mapping_worked = True
|
||||
raise AuthenticationError(
|
||||
message=f"AnthropicException - {original_exception.message}",
|
||||
message=
|
||||
f"AnthropicException - {original_exception.message}",
|
||||
llm_provider="anthropic",
|
||||
)
|
||||
elif "replicate" in model:
|
||||
|
@ -1096,35 +1166,36 @@ def exception_type(model, original_exception, custom_llm_provider):
|
|||
llm_provider="replicate",
|
||||
)
|
||||
elif (
|
||||
exception_type == "ReplicateError"
|
||||
): ## ReplicateError implies an error on Replicate server side, not user side
|
||||
exception_type == "ReplicateError"
|
||||
): # ReplicateError implies an error on Replicate server side, not user side
|
||||
raise ServiceUnavailableError(
|
||||
message=f"ReplicateException - {error_str}",
|
||||
llm_provider="replicate",
|
||||
)
|
||||
elif model == "command-nightly": # Cohere
|
||||
if (
|
||||
"invalid api token" in error_str
|
||||
or "No API key provided." in error_str
|
||||
):
|
||||
if ("invalid api token" in error_str
|
||||
or "No API key provided." in error_str):
|
||||
exception_mapping_worked = True
|
||||
raise AuthenticationError(
|
||||
message=f"CohereException - {original_exception.message}",
|
||||
message=
|
||||
f"CohereException - {original_exception.message}",
|
||||
llm_provider="cohere",
|
||||
)
|
||||
elif "too many tokens" in error_str:
|
||||
exception_mapping_worked = True
|
||||
raise InvalidRequestError(
|
||||
message=f"CohereException - {original_exception.message}",
|
||||
message=
|
||||
f"CohereException - {original_exception.message}",
|
||||
model=model,
|
||||
llm_provider="cohere",
|
||||
)
|
||||
elif (
|
||||
"CohereConnectionError" in exception_type
|
||||
"CohereConnectionError" in exception_type
|
||||
): # cohere seems to fire these errors when we load test it (1k+ messages / min)
|
||||
exception_mapping_worked = True
|
||||
raise RateLimitError(
|
||||
message=f"CohereException - {original_exception.message}",
|
||||
message=
|
||||
f"CohereException - {original_exception.message}",
|
||||
llm_provider="cohere",
|
||||
)
|
||||
elif custom_llm_provider == "huggingface":
|
||||
|
@ -1132,27 +1203,30 @@ def exception_type(model, original_exception, custom_llm_provider):
|
|||
if original_exception.status_code == 401:
|
||||
exception_mapping_worked = True
|
||||
raise AuthenticationError(
|
||||
message=f"HuggingfaceException - {original_exception.message}",
|
||||
message=
|
||||
f"HuggingfaceException - {original_exception.message}",
|
||||
llm_provider="huggingface",
|
||||
)
|
||||
elif original_exception.status_code == 400:
|
||||
exception_mapping_worked = True
|
||||
raise InvalidRequestError(
|
||||
message=f"HuggingfaceException - {original_exception.message}",
|
||||
message=
|
||||
f"HuggingfaceException - {original_exception.message}",
|
||||
model=model,
|
||||
llm_provider="huggingface",
|
||||
)
|
||||
elif original_exception.status_code == 429:
|
||||
exception_mapping_worked = True
|
||||
raise RateLimitError(
|
||||
message=f"HuggingfaceException - {original_exception.message}",
|
||||
message=
|
||||
f"HuggingfaceException - {original_exception.message}",
|
||||
llm_provider="huggingface",
|
||||
)
|
||||
raise original_exception # base case - return the original exception
|
||||
else:
|
||||
raise original_exception
|
||||
except Exception as e:
|
||||
## LOGGING
|
||||
# LOGGING
|
||||
exception_logging(
|
||||
logger_fn=user_logger_fn,
|
||||
additional_args={
|
||||
|
@ -1173,7 +1247,7 @@ def safe_crash_reporting(model=None, exception=None, custom_llm_provider=None):
|
|||
"exception": str(exception),
|
||||
"custom_llm_provider": custom_llm_provider,
|
||||
}
|
||||
threading.Thread(target=litellm_telemetry, args=(data,)).start()
|
||||
threading.Thread(target=litellm_telemetry, args=(data, )).start()
|
||||
|
||||
|
||||
def litellm_telemetry(data):
|
||||
|
@ -1223,11 +1297,13 @@ def get_secret(secret_name):
|
|||
if litellm.secret_manager_client != None:
|
||||
# TODO: check which secret manager is being used
|
||||
# currently only supports Infisical
|
||||
secret = litellm.secret_manager_client.get_secret(secret_name).secret_value
|
||||
secret = litellm.secret_manager_client.get_secret(
|
||||
secret_name).secret_value
|
||||
if secret != None:
|
||||
return secret # if secret found in secret manager return it
|
||||
else:
|
||||
raise ValueError(f"Secret '{secret_name}' not found in secret manager")
|
||||
raise ValueError(
|
||||
f"Secret '{secret_name}' not found in secret manager")
|
||||
elif litellm.api_key != None: # if users use litellm default key
|
||||
return litellm.api_key
|
||||
else:
|
||||
|
@ -1238,6 +1314,7 @@ def get_secret(secret_name):
|
|||
# wraps the completion stream to return the correct format for the model
|
||||
# replicate/anthropic/cohere
|
||||
class CustomStreamWrapper:
|
||||
|
||||
def __init__(self, completion_stream, model, custom_llm_provider=None):
|
||||
self.model = model
|
||||
self.custom_llm_provider = custom_llm_provider
|
||||
|
@ -1288,7 +1365,8 @@ class CustomStreamWrapper:
|
|||
elif self.model == "replicate":
|
||||
chunk = next(self.completion_stream)
|
||||
completion_obj["content"] = chunk
|
||||
elif (self.model == "together_ai") or ("togethercomputer" in self.model):
|
||||
elif (self.model == "together_ai") or ("togethercomputer"
|
||||
in self.model):
|
||||
chunk = next(self.completion_stream)
|
||||
text_data = self.handle_together_ai_chunk(chunk)
|
||||
if text_data == "":
|
||||
|
@ -1321,12 +1399,11 @@ def read_config_args(config_path):
|
|||
|
||||
|
||||
########## ollama implementation ############################
|
||||
import aiohttp
|
||||
|
||||
|
||||
async def get_ollama_response_stream(
|
||||
api_base="http://localhost:11434", model="llama2", prompt="Why is the sky blue?"
|
||||
):
|
||||
async def get_ollama_response_stream(api_base="http://localhost:11434",
|
||||
model="llama2",
|
||||
prompt="Why is the sky blue?"):
|
||||
session = aiohttp.ClientSession()
|
||||
url = f"{api_base}/api/generate"
|
||||
data = {
|
||||
|
@ -1349,7 +1426,11 @@ async def get_ollama_response_stream(
|
|||
"content": "",
|
||||
}
|
||||
completion_obj["content"] = j["response"]
|
||||
yield {"choices": [{"delta": completion_obj}]}
|
||||
yield {
|
||||
"choices": [{
|
||||
"delta": completion_obj
|
||||
}]
|
||||
}
|
||||
# self.responses.append(j["response"])
|
||||
# yield "blank"
|
||||
except Exception as e:
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
|
||||
# liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching
|
||||
|
||||
### Azure, Llama2, OpenAI, Claude, Hugging Face, Replicate Models
|
||||
|
||||
[](https://pypi.org/project/litellm/)
|
||||
[](https://pypi.org/project/litellm/0.1.1/)
|
||||

|
||||
|
@ -11,34 +12,36 @@
|
|||

|
||||
|
||||
## What does liteLLM proxy do
|
||||
|
||||
- Make `/chat/completions` requests for 50+ LLM models **Azure, OpenAI, Replicate, Anthropic, Hugging Face**
|
||||
|
||||
|
||||
Example: for `model` use `claude-2`, `gpt-3.5`, `gpt-4`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
|
||||
"messages": [
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
- **Consistent Input/Output** Format
|
||||
- Call all models using the OpenAI format - `completion(model, messages)`
|
||||
- Text responses will always be available at `['choices'][0]['message']['content']`
|
||||
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
|
||||
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
|
||||
|
||||
**Example: Logs sent to Supabase**
|
||||
- **Consistent Input/Output** Format
|
||||
- Call all models using the OpenAI format - `completion(model, messages)`
|
||||
- Text responses will always be available at `['choices'][0]['message']['content']`
|
||||
- **Error Handling** Using Model Fallbacks (if `GPT-4` fails, try `llama2`)
|
||||
- **Logging** - Log Requests, Responses and Errors to `Supabase`, `Posthog`, `Mixpanel`, `Sentry`, `Helicone`, `LLMonitor` (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
|
||||
|
||||
**Example: Logs sent to Supabase**
|
||||
<img width="1015" alt="Screenshot 2023-08-11 at 4 02 46 PM" src="https://github.com/ishaan-jaff/proxy-server/assets/29436595/237557b8-ba09-4917-982c-8f3e1b2c8d08">
|
||||
|
||||
- **Token Usage & Spend** - Track Input + Completion tokens used + Spend/model
|
||||
- **Caching** - Implementation of Semantic Caching
|
||||
- **Streaming & Async Support** - Return generators to stream text responses
|
||||
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### `/chat/completions` (POST)
|
||||
|
@ -46,34 +49,37 @@
|
|||
This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
|
||||
|
||||
#### Input
|
||||
|
||||
This API endpoint accepts all inputs in raw JSON and expects the following inputs
|
||||
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
|
||||
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
|
||||
|
||||
- `model` (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
|
||||
eg `gpt-3.5-turbo`, `gpt-4`, `claude-2`, `command-nightly`, `stabilityai/stablecode-completion-alpha-3b-4k`
|
||||
- `messages` (array, required): A list of messages representing the conversation context. Each message should have a `role` (system, user, assistant, or function), `content` (message text), and `name` (for function role).
|
||||
- Additional Optional parameters: `temperature`, `functions`, `function_call`, `top_p`, `n`, `stream`. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/
|
||||
|
||||
|
||||
#### Example JSON body
|
||||
|
||||
For claude-2
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "claude-2",
|
||||
"messages": [
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
|
||||
"model": "claude-2",
|
||||
"messages": [
|
||||
{
|
||||
"content": "Hello, whats the weather in San Francisco??",
|
||||
"role": "user"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Making an API request to the Proxy Server
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
# TODO: use your URL
|
||||
# TODO: use your URL
|
||||
url = "http://localhost:5000/chat/completions"
|
||||
|
||||
payload = json.dumps({
|
||||
|
@ -94,34 +100,38 @@ print(response.text)
|
|||
```
|
||||
|
||||
### Output [Response Format]
|
||||
Responses from the server are given in the following format.
|
||||
|
||||
Responses from the server are given in the following format.
|
||||
All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
|
||||
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"finish_reason": "stop",
|
||||
"index": 0,
|
||||
"message": {
|
||||
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
|
||||
"role": "assistant"
|
||||
}
|
||||
}
|
||||
],
|
||||
"created": 1691790381,
|
||||
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
|
||||
"model": "gpt-3.5-turbo-0613",
|
||||
"object": "chat.completion",
|
||||
"usage": {
|
||||
"completion_tokens": 41,
|
||||
"prompt_tokens": 16,
|
||||
"total_tokens": 57
|
||||
"choices": [
|
||||
{
|
||||
"finish_reason": "stop",
|
||||
"index": 0,
|
||||
"message": {
|
||||
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
|
||||
"role": "assistant"
|
||||
}
|
||||
}
|
||||
],
|
||||
"created": 1691790381,
|
||||
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
|
||||
"model": "gpt-3.5-turbo-0613",
|
||||
"object": "chat.completion",
|
||||
"usage": {
|
||||
"completion_tokens": 41,
|
||||
"prompt_tokens": 16,
|
||||
"total_tokens": 57
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Installation & Usage
|
||||
|
||||
### Running Locally
|
||||
|
||||
1. Clone liteLLM repository to your local machine:
|
||||
```
|
||||
git clone https://github.com/BerriAI/liteLLM-proxy
|
||||
|
@ -141,24 +151,24 @@ All responses from the server are returned in the following format (for all LLM
|
|||
python main.py
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Deploying
|
||||
|
||||
1. Quick Start: Deploy on Railway
|
||||
|
||||
[](https://railway.app/template/DYqQAW?referralCode=t3ukrU)
|
||||
|
||||
2. `GCP`, `AWS`, `Azure`
|
||||
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
|
||||
|
||||
2. `GCP`, `AWS`, `Azure`
|
||||
This project includes a `Dockerfile` allowing you to build and deploy a Docker Project on your providers
|
||||
|
||||
# Support / Talk with founders
|
||||
|
||||
- [Our calendar 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
|
||||
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
|
||||
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
|
||||
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
|
||||
|
||||
|
||||
## Roadmap
|
||||
|
||||
- [ ] Support hosted db (e.g. Supabase)
|
||||
- [ ] Easily send data to places like posthog and sentry.
|
||||
- [ ] Add a hot-cache for project spend logs - enables fast checks for user + project limitings
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue