forked from phoenix/litellm-mirror
Merge branch 'main' into main
This commit is contained in:
commit
d770df2259
198 changed files with 10972 additions and 7448 deletions
3
.github/workflows/interpret_load_test.py
vendored
3
.github/workflows/interpret_load_test.py
vendored
|
@ -77,6 +77,9 @@ if __name__ == "__main__":
|
|||
new_release_body = (
|
||||
existing_release_body
|
||||
+ "\n\n"
|
||||
+ "### Don't want to maintain your internal proxy? get in touch 🎉"
|
||||
+ "\nHosted Proxy Alpha: https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat"
|
||||
+ "\n\n"
|
||||
+ "## Load Test LiteLLM Proxy Results"
|
||||
+ "\n\n"
|
||||
+ markdown_table
|
||||
|
|
2
.gitignore
vendored
2
.gitignore
vendored
|
@ -50,3 +50,5 @@ kub.yaml
|
|||
loadtest_kub.yaml
|
||||
litellm/proxy/_new_secret_config.yaml
|
||||
litellm/proxy/_new_secret_config.yaml
|
||||
litellm/proxy/_super_secret_config.yaml
|
||||
litellm/proxy/_super_secret_config.yaml
|
||||
|
|
|
@ -5,7 +5,7 @@
|
|||
<p align="center">Call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, etc.]
|
||||
<br>
|
||||
</p>
|
||||
<h4 align="center"><a href="https://docs.litellm.ai/docs/simple_proxy" target="_blank">OpenAI Proxy Server</a> | <a href="https://docs.litellm.ai/docs/enterprise"target="_blank">Enterprise Tier</a></h4>
|
||||
<h4 align="center"><a href="https://docs.litellm.ai/docs/simple_proxy" target="_blank">OpenAI Proxy Server</a> | <a href="https://docs.litellm.ai/docs/hosted" target="_blank"> Hosted Proxy (Preview)</a> | <a href="https://docs.litellm.ai/docs/enterprise"target="_blank">Enterprise Tier</a></h4>
|
||||
<h4 align="center">
|
||||
<a href="https://pypi.org/project/litellm/" target="_blank">
|
||||
<img src="https://img.shields.io/pypi/v/litellm.svg" alt="PyPI Version">
|
||||
|
@ -128,7 +128,9 @@ response = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content
|
|||
|
||||
# OpenAI Proxy - ([Docs](https://docs.litellm.ai/docs/simple_proxy))
|
||||
|
||||
Set Budgets & Rate limits across multiple projects
|
||||
Track spend + Load Balance across multiple projects
|
||||
|
||||
[Hosted Proxy (Preview)](https://docs.litellm.ai/docs/hosted)
|
||||
|
||||
The proxy provides:
|
||||
|
||||
|
@ -225,6 +227,7 @@ curl 'http://0.0.0.0:4000/key/generate' \
|
|||
| [perplexity-ai](https://docs.litellm.ai/docs/providers/perplexity) | ✅ | ✅ | ✅ | ✅ |
|
||||
| [Groq AI](https://docs.litellm.ai/docs/providers/groq) | ✅ | ✅ | ✅ | ✅ |
|
||||
| [anyscale](https://docs.litellm.ai/docs/providers/anyscale) | ✅ | ✅ | ✅ | ✅ |
|
||||
| [IBM - watsonx.ai](https://docs.litellm.ai/docs/providers/watsonx) | ✅ | ✅ | ✅ | ✅ | ✅
|
||||
| [voyage ai](https://docs.litellm.ai/docs/providers/voyage) | | | | | ✅ |
|
||||
| [xinference [Xorbits Inference]](https://docs.litellm.ai/docs/providers/xinference) | | | | | ✅ |
|
||||
|
||||
|
|
300
cookbook/liteLLM_IBM_Watsonx.ipynb
vendored
Normal file
300
cookbook/liteLLM_IBM_Watsonx.ipynb
vendored
Normal file
File diff suppressed because one or more lines are too long
|
@ -23,6 +23,14 @@ response = completion(model="gpt-3.5-turbo", messages=messages)
|
|||
response = completion("command-nightly", messages)
|
||||
```
|
||||
|
||||
## JSON Logs
|
||||
|
||||
If you need to store the logs as JSON, just set the `litellm.json_logs = True`.
|
||||
|
||||
We currently just log the raw POST request from litellm as a JSON - [**See Code**].
|
||||
|
||||
[Share feedback here](https://github.com/BerriAI/litellm/issues)
|
||||
|
||||
## Logger Function
|
||||
But sometimes all you care about is seeing exactly what's getting sent to your api call and what's being returned - e.g. if the api call is failing, why is that happening? what are the exact params being set?
|
||||
|
||||
|
|
|
@ -8,12 +8,13 @@ For companies that need SSO, user management and professional support for LiteLL
|
|||
:::
|
||||
|
||||
This covers:
|
||||
- ✅ **Features under the [LiteLLM Commercial License](https://docs.litellm.ai/docs/proxy/enterprise):**
|
||||
- ✅ **Features under the [LiteLLM Commercial License (Content Mod, Custom Tags, etc.)](https://docs.litellm.ai/docs/proxy/enterprise)**
|
||||
- ✅ **Feature Prioritization**
|
||||
- ✅ **Custom Integrations**
|
||||
- ✅ **Professional Support - Dedicated discord + slack**
|
||||
- ✅ **Custom SLAs**
|
||||
- ✅ **Secure access with Single Sign-On**
|
||||
- ✅ [**Secure UI access with Single Sign-On**](../docs/proxy/ui.md#setup-ssoauth-for-ui)
|
||||
- ✅ [**JWT-Auth**](../docs/proxy/token_auth.md)
|
||||
|
||||
|
||||
## Frequently Asked Questions
|
||||
|
|
49
docs/my-website/docs/hosted.md
Normal file
49
docs/my-website/docs/hosted.md
Normal file
|
@ -0,0 +1,49 @@
|
|||
import Image from '@theme/IdealImage';
|
||||
|
||||
# Hosted LiteLLM Proxy
|
||||
|
||||
LiteLLM maintains the proxy, so you can focus on your core products.
|
||||
|
||||
## [**Get Onboarded**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
|
||||
|
||||
This is in alpha. Schedule a call with us, and we'll give you a hosted proxy within 30 minutes.
|
||||
|
||||
[**🚨 Schedule Call**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
|
||||
|
||||
### **Status**: Alpha
|
||||
|
||||
Our proxy is already used in production by customers.
|
||||
|
||||
See our status page for [**live reliability**](https://status.litellm.ai/)
|
||||
|
||||
### **Benefits**
|
||||
- **No Maintenance, No Infra**: We'll maintain the proxy, and spin up any additional infrastructure (e.g.: separate server for spend logs) to make sure you can load balance + track spend across multiple LLM projects.
|
||||
- **Reliable**: Our hosted proxy is tested on 1k requests per second, making it reliable for high load.
|
||||
- **Secure**: LiteLLM is currently undergoing SOC-2 compliance, to make sure your data is as secure as possible.
|
||||
|
||||
### Pricing
|
||||
|
||||
Pricing is based on usage. We can figure out a price that works for your team, on the call.
|
||||
|
||||
[**🚨 Schedule Call**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
|
||||
|
||||
## **Screenshots**
|
||||
|
||||
### 1. Create keys
|
||||
|
||||
<Image img={require('../img/litellm_hosted_ui_create_key.png')} />
|
||||
|
||||
### 2. Add Models
|
||||
|
||||
<Image img={require('../img/litellm_hosted_ui_add_models.png')}/>
|
||||
|
||||
### 3. Track spend
|
||||
|
||||
<Image img={require('../img/litellm_hosted_usage_dashboard.png')} />
|
||||
|
||||
|
||||
### 4. Configure load balancing
|
||||
|
||||
<Image img={require('../img/litellm_hosted_ui_router.png')} />
|
||||
|
||||
#### [**🚨 Schedule Call**](https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat)
|
|
@ -213,3 +213,349 @@ asyncio.run(loadtest_fn())
|
|||
|
||||
```
|
||||
|
||||
## Multi-Instance TPM/RPM Load Test (Router)
|
||||
|
||||
Test if your defined tpm/rpm limits are respected across multiple instances of the Router object.
|
||||
|
||||
In our test:
|
||||
- Max RPM per deployment is = 100 requests per minute
|
||||
- Max Throughput / min on router = 200 requests per minute (2 deployments)
|
||||
- Load we'll send through router = 600 requests per minute
|
||||
|
||||
:::info
|
||||
|
||||
If you don't want to call a real LLM API endpoint, you can setup a fake openai server. [See code](#extra---setup-fake-openai-server)
|
||||
|
||||
:::
|
||||
|
||||
### Code
|
||||
|
||||
Let's hit the router with 600 requests per minute.
|
||||
|
||||
Copy this script 👇. Save it as `test_loadtest_router.py` AND run it with `python3 test_loadtest_router.py`
|
||||
|
||||
|
||||
```python
|
||||
from litellm import Router
|
||||
import litellm
|
||||
litellm.suppress_debug_info = True
|
||||
litellm.set_verbose = False
|
||||
import logging
|
||||
logging.basicConfig(level=logging.CRITICAL)
|
||||
import os, random, uuid, time, asyncio
|
||||
|
||||
# Model list for OpenAI and Anthropic models
|
||||
model_list = [
|
||||
{
|
||||
"model_name": "fake-openai-endpoint",
|
||||
"litellm_params": {
|
||||
"model": "gpt-3.5-turbo",
|
||||
"api_key": "my-fake-key",
|
||||
"api_base": "http://0.0.0.0:8080",
|
||||
"rpm": 100
|
||||
},
|
||||
},
|
||||
{
|
||||
"model_name": "fake-openai-endpoint",
|
||||
"litellm_params": {
|
||||
"model": "gpt-3.5-turbo",
|
||||
"api_key": "my-fake-key",
|
||||
"api_base": "http://0.0.0.0:8081",
|
||||
"rpm": 100
|
||||
},
|
||||
},
|
||||
]
|
||||
|
||||
router_1 = Router(model_list=model_list, num_retries=0, enable_pre_call_checks=True, routing_strategy="usage-based-routing-v2", redis_host=os.getenv("REDIS_HOST"), redis_port=os.getenv("REDIS_PORT"), redis_password=os.getenv("REDIS_PASSWORD"))
|
||||
router_2 = Router(model_list=model_list, num_retries=0, routing_strategy="usage-based-routing-v2", enable_pre_call_checks=True, redis_host=os.getenv("REDIS_HOST"), redis_port=os.getenv("REDIS_PORT"), redis_password=os.getenv("REDIS_PASSWORD"))
|
||||
|
||||
|
||||
|
||||
async def router_completion_non_streaming():
|
||||
try:
|
||||
client: Router = random.sample([router_1, router_2], 1)[0] # randomly pick b/w clients
|
||||
# print(f"client={client}")
|
||||
response = await client.acompletion(
|
||||
model="fake-openai-endpoint", # [CHANGE THIS] (if you call it something else on your proxy)
|
||||
messages=[{"role": "user", "content": f"This is a test: {uuid.uuid4()}"}],
|
||||
)
|
||||
return response
|
||||
except Exception as e:
|
||||
# print(e)
|
||||
return None
|
||||
|
||||
async def loadtest_fn():
|
||||
start = time.time()
|
||||
n = 600 # Number of concurrent tasks
|
||||
tasks = [router_completion_non_streaming() for _ in range(n)]
|
||||
chat_completions = await asyncio.gather(*tasks)
|
||||
successful_completions = [c for c in chat_completions if c is not None]
|
||||
print(n, time.time() - start, len(successful_completions))
|
||||
|
||||
def get_utc_datetime():
|
||||
import datetime as dt
|
||||
from datetime import datetime
|
||||
|
||||
if hasattr(dt, "UTC"):
|
||||
return datetime.now(dt.UTC) # type: ignore
|
||||
else:
|
||||
return datetime.utcnow() # type: ignore
|
||||
|
||||
|
||||
# Run the event loop to execute the async function
|
||||
async def parent_fn():
|
||||
for _ in range(10):
|
||||
dt = get_utc_datetime()
|
||||
current_minute = dt.strftime("%H-%M")
|
||||
print(f"triggered new batch - {current_minute}")
|
||||
await loadtest_fn()
|
||||
await asyncio.sleep(10)
|
||||
|
||||
asyncio.run(parent_fn())
|
||||
```
|
||||
## Multi-Instance TPM/RPM Load Test (Proxy)
|
||||
|
||||
Test if your defined tpm/rpm limits are respected across multiple instances.
|
||||
|
||||
The quickest way to do this is by testing the [proxy](./proxy/quick_start.md). The proxy uses the [router](./routing.md) under the hood, so if you're using either of them, this test should work for you.
|
||||
|
||||
In our test:
|
||||
- Max RPM per deployment is = 100 requests per minute
|
||||
- Max Throughput / min on proxy = 200 requests per minute (2 deployments)
|
||||
- Load we'll send to proxy = 600 requests per minute
|
||||
|
||||
|
||||
So we'll send 600 requests per minute, but expect only 200 requests per minute to succeed.
|
||||
|
||||
:::info
|
||||
|
||||
If you don't want to call a real LLM API endpoint, you can setup a fake openai server. [See code](#extra---setup-fake-openai-server)
|
||||
|
||||
:::
|
||||
|
||||
### 1. Setup config
|
||||
|
||||
```yaml
|
||||
model_list:
|
||||
- litellm_params:
|
||||
api_base: http://0.0.0.0:8080
|
||||
api_key: my-fake-key
|
||||
model: openai/my-fake-model
|
||||
rpm: 100
|
||||
model_name: fake-openai-endpoint
|
||||
- litellm_params:
|
||||
api_base: http://0.0.0.0:8081
|
||||
api_key: my-fake-key
|
||||
model: openai/my-fake-model-2
|
||||
rpm: 100
|
||||
model_name: fake-openai-endpoint
|
||||
router_settings:
|
||||
num_retries: 0
|
||||
enable_pre_call_checks: true
|
||||
redis_host: os.environ/REDIS_HOST ## 👈 IMPORTANT! Setup the proxy w/ redis
|
||||
redis_password: os.environ/REDIS_PASSWORD
|
||||
redis_port: os.environ/REDIS_PORT
|
||||
routing_strategy: usage-based-routing-v2
|
||||
```
|
||||
|
||||
### 2. Start proxy 2 instances
|
||||
|
||||
**Instance 1**
|
||||
```bash
|
||||
litellm --config /path/to/config.yaml --port 4000
|
||||
|
||||
## RUNNING on http://0.0.0.0:4000
|
||||
```
|
||||
|
||||
**Instance 2**
|
||||
```bash
|
||||
litellm --config /path/to/config.yaml --port 4001
|
||||
|
||||
## RUNNING on http://0.0.0.0:4001
|
||||
```
|
||||
|
||||
### 3. Run Test
|
||||
|
||||
Let's hit the proxy with 600 requests per minute.
|
||||
|
||||
Copy this script 👇. Save it as `test_loadtest_proxy.py` AND run it with `python3 test_loadtest_proxy.py`
|
||||
|
||||
```python
|
||||
from openai import AsyncOpenAI, AsyncAzureOpenAI
|
||||
import random, uuid
|
||||
import time, asyncio, litellm
|
||||
# import logging
|
||||
# logging.basicConfig(level=logging.DEBUG)
|
||||
#### LITELLM PROXY ####
|
||||
litellm_client = AsyncOpenAI(
|
||||
api_key="sk-1234", # [CHANGE THIS]
|
||||
base_url="http://0.0.0.0:4000"
|
||||
)
|
||||
litellm_client_2 = AsyncOpenAI(
|
||||
api_key="sk-1234", # [CHANGE THIS]
|
||||
base_url="http://0.0.0.0:4001"
|
||||
)
|
||||
|
||||
async def proxy_completion_non_streaming():
|
||||
try:
|
||||
client = random.sample([litellm_client, litellm_client_2], 1)[0] # randomly pick b/w clients
|
||||
# print(f"client={client}")
|
||||
response = await client.chat.completions.create(
|
||||
model="fake-openai-endpoint", # [CHANGE THIS] (if you call it something else on your proxy)
|
||||
messages=[{"role": "user", "content": f"This is a test: {uuid.uuid4()}"}],
|
||||
)
|
||||
return response
|
||||
except Exception as e:
|
||||
# print(e)
|
||||
return None
|
||||
|
||||
async def loadtest_fn():
|
||||
start = time.time()
|
||||
n = 600 # Number of concurrent tasks
|
||||
tasks = [proxy_completion_non_streaming() for _ in range(n)]
|
||||
chat_completions = await asyncio.gather(*tasks)
|
||||
successful_completions = [c for c in chat_completions if c is not None]
|
||||
print(n, time.time() - start, len(successful_completions))
|
||||
|
||||
def get_utc_datetime():
|
||||
import datetime as dt
|
||||
from datetime import datetime
|
||||
|
||||
if hasattr(dt, "UTC"):
|
||||
return datetime.now(dt.UTC) # type: ignore
|
||||
else:
|
||||
return datetime.utcnow() # type: ignore
|
||||
|
||||
|
||||
# Run the event loop to execute the async function
|
||||
async def parent_fn():
|
||||
for _ in range(10):
|
||||
dt = get_utc_datetime()
|
||||
current_minute = dt.strftime("%H-%M")
|
||||
print(f"triggered new batch - {current_minute}")
|
||||
await loadtest_fn()
|
||||
await asyncio.sleep(10)
|
||||
|
||||
asyncio.run(parent_fn())
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Extra - Setup Fake OpenAI Server
|
||||
|
||||
Let's setup a fake openai server with a RPM limit of 100.
|
||||
|
||||
Let's call our file `fake_openai_server.py`.
|
||||
|
||||
```
|
||||
# import sys, os
|
||||
# sys.path.insert(
|
||||
# 0, os.path.abspath("../")
|
||||
# ) # Adds the parent directory to the system path
|
||||
from fastapi import FastAPI, Request, status, HTTPException, Depends
|
||||
from fastapi.responses import StreamingResponse
|
||||
from fastapi.security import OAuth2PasswordBearer
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.responses import JSONResponse
|
||||
from fastapi import FastAPI, Request, HTTPException, UploadFile, File
|
||||
import httpx, os, json
|
||||
from openai import AsyncOpenAI
|
||||
from typing import Optional
|
||||
from slowapi import Limiter
|
||||
from slowapi.util import get_remote_address
|
||||
from slowapi.errors import RateLimitExceeded
|
||||
from fastapi import FastAPI, Request, HTTPException
|
||||
from fastapi.responses import PlainTextResponse
|
||||
|
||||
|
||||
class ProxyException(Exception):
|
||||
# NOTE: DO NOT MODIFY THIS
|
||||
# This is used to map exactly to OPENAI Exceptions
|
||||
def __init__(
|
||||
self,
|
||||
message: str,
|
||||
type: str,
|
||||
param: Optional[str],
|
||||
code: Optional[int],
|
||||
):
|
||||
self.message = message
|
||||
self.type = type
|
||||
self.param = param
|
||||
self.code = code
|
||||
|
||||
def to_dict(self) -> dict:
|
||||
"""Converts the ProxyException instance to a dictionary."""
|
||||
return {
|
||||
"message": self.message,
|
||||
"type": self.type,
|
||||
"param": self.param,
|
||||
"code": self.code,
|
||||
}
|
||||
|
||||
|
||||
limiter = Limiter(key_func=get_remote_address)
|
||||
app = FastAPI()
|
||||
app.state.limiter = limiter
|
||||
|
||||
@app.exception_handler(RateLimitExceeded)
|
||||
async def _rate_limit_exceeded_handler(request: Request, exc: RateLimitExceeded):
|
||||
return JSONResponse(status_code=429,
|
||||
content={"detail": "Rate Limited!"})
|
||||
|
||||
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
|
||||
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# for completion
|
||||
@app.post("/chat/completions")
|
||||
@app.post("/v1/chat/completions")
|
||||
@limiter.limit("100/minute")
|
||||
async def completion(request: Request):
|
||||
# raise HTTPException(status_code=429, detail="Rate Limited!")
|
||||
return {
|
||||
"id": "chatcmpl-123",
|
||||
"object": "chat.completion",
|
||||
"created": 1677652288,
|
||||
"model": None,
|
||||
"system_fingerprint": "fp_44709d6fcb",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "\n\nHello there, how may I assist you today?",
|
||||
},
|
||||
"logprobs": None,
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {
|
||||
"prompt_tokens": 9,
|
||||
"completion_tokens": 12,
|
||||
"total_tokens": 21
|
||||
}
|
||||
}
|
||||
|
||||
if __name__ == "__main__":
|
||||
import socket
|
||||
import uvicorn
|
||||
port = 8080
|
||||
while True:
|
||||
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
result = sock.connect_ex(('0.0.0.0', port))
|
||||
if result != 0:
|
||||
print(f"Port {port} is available, starting server...")
|
||||
break
|
||||
else:
|
||||
port += 1
|
||||
|
||||
uvicorn.run(app, host="0.0.0.0", port=port)
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 fake_openai_server.py
|
||||
```
|
||||
|
|
|
@ -331,49 +331,25 @@ response = litellm.completion(model="gpt-3.5-turbo", messages=messages, metadata
|
|||
## Examples
|
||||
|
||||
### Custom Callback to track costs for Streaming + Non-Streaming
|
||||
By default, the response cost is accessible in the logging object via `kwargs["response_cost"]` on success (sync + async)
|
||||
```python
|
||||
|
||||
# Step 1. Write your custom callback function
|
||||
def track_cost_callback(
|
||||
kwargs, # kwargs to completion
|
||||
completion_response, # response from completion
|
||||
start_time, end_time # start/end time
|
||||
):
|
||||
try:
|
||||
# init logging config
|
||||
logging.basicConfig(
|
||||
filename='cost.log',
|
||||
level=logging.INFO,
|
||||
format='%(asctime)s - %(message)s',
|
||||
datefmt='%Y-%m-%d %H:%M:%S'
|
||||
)
|
||||
|
||||
# check if it has collected an entire stream response
|
||||
if "complete_streaming_response" in kwargs:
|
||||
# for tracking streaming cost we pass the "messages" and the output_text to litellm.completion_cost
|
||||
completion_response=kwargs["complete_streaming_response"]
|
||||
input_text = kwargs["messages"]
|
||||
output_text = completion_response["choices"][0]["message"]["content"]
|
||||
response_cost = litellm.completion_cost(
|
||||
model = kwargs["model"],
|
||||
messages = input_text,
|
||||
completion=output_text
|
||||
)
|
||||
print("streaming response_cost", response_cost)
|
||||
logging.info(f"Model {kwargs['model']} Cost: ${response_cost:.8f}")
|
||||
|
||||
# for non streaming responses
|
||||
else:
|
||||
# we pass the completion_response obj
|
||||
if kwargs["stream"] != True:
|
||||
response_cost = litellm.completion_cost(completion_response=completion_response)
|
||||
print("regular response_cost", response_cost)
|
||||
logging.info(f"Model {completion_response.model} Cost: ${response_cost:.8f}")
|
||||
response_cost = kwargs["response_cost"] # litellm calculates response cost for you
|
||||
print("regular response_cost", response_cost)
|
||||
except:
|
||||
pass
|
||||
|
||||
# Assign the custom callback function
|
||||
# Step 2. Assign the custom callback function
|
||||
litellm.success_callback = [track_cost_callback]
|
||||
|
||||
# Step 3. Make litellm.completion call
|
||||
response = completion(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[
|
||||
|
|
68
docs/my-website/docs/observability/greenscale_integration.md
Normal file
68
docs/my-website/docs/observability/greenscale_integration.md
Normal file
|
@ -0,0 +1,68 @@
|
|||
# Greenscale Tutorial
|
||||
|
||||
[Greenscale](https://greenscale.ai/) is a production monitoring platform for your LLM-powered app that provides you granular key insights into your GenAI spending and responsible usage. Greenscale only captures metadata to minimize the exposure risk of personally identifiable information (PII).
|
||||
|
||||
## Getting Started
|
||||
|
||||
Use Greenscale to log requests across all LLM Providers
|
||||
|
||||
liteLLM provides `callbacks`, making it easy for you to log data depending on the status of your responses.
|
||||
|
||||
## Using Callbacks
|
||||
|
||||
First, email `hello@greenscale.ai` to get an API_KEY.
|
||||
|
||||
Use just 1 line of code, to instantly log your responses **across all providers** with Greenscale:
|
||||
|
||||
```python
|
||||
litellm.success_callback = ["greenscale"]
|
||||
```
|
||||
|
||||
### Complete code
|
||||
|
||||
```python
|
||||
from litellm import completion
|
||||
|
||||
## set env variables
|
||||
os.environ['GREENSCALE_API_KEY'] = 'your-greenscale-api-key'
|
||||
os.environ['GREENSCALE_ENDPOINT'] = 'greenscale-endpoint'
|
||||
os.environ["OPENAI_API_KEY"]= ""
|
||||
|
||||
# set callback
|
||||
litellm.success_callback = ["greenscale"]
|
||||
|
||||
#openai call
|
||||
response = completion(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[{"role": "user", "content": "Hi 👋 - i'm openai"}]
|
||||
metadata={
|
||||
"greenscale_project": "acme-project",
|
||||
"greenscale_application": "acme-application"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Additional information in metadata
|
||||
|
||||
You can send any additional information to Greenscale by using the `metadata` field in completion and `greenscale_` prefix. This can be useful for sending metadata about the request, such as the project and application name, customer_id, enviornment, or any other information you want to track usage. `greenscale_project` and `greenscale_application` are required fields.
|
||||
|
||||
```python
|
||||
#openai call with additional metadata
|
||||
response = completion(
|
||||
model="gpt-3.5-turbo",
|
||||
messages=[
|
||||
{"role": "user", "content": "Hi 👋 - i'm openai"}
|
||||
],
|
||||
metadata={
|
||||
"greenscale_project": "acme-project",
|
||||
"greenscale_application": "acme-application",
|
||||
"greenscale_customer_id": "customer-123"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
## Support & Talk with Greenscale Team
|
||||
|
||||
- [Schedule Demo 👋](https://calendly.com/nandesh/greenscale)
|
||||
- [Website 💻](https://greenscale.ai)
|
||||
- Our email ✉️ `hello@greenscale.ai`
|
|
@ -121,10 +121,12 @@ response = completion(
|
|||
metadata={
|
||||
"generation_name": "ishaan-test-generation", # set langfuse Generation Name
|
||||
"generation_id": "gen-id22", # set langfuse Generation ID
|
||||
"trace_id": "trace-id22", # set langfuse Trace ID
|
||||
"trace_user_id": "user-id2", # set langfuse Trace User ID
|
||||
"session_id": "session-1", # set langfuse Session ID
|
||||
"tags": ["tag1", "tag2"] # set langfuse Tags
|
||||
"trace_id": "trace-id22", # set langfuse Trace ID
|
||||
### OR ###
|
||||
"existing_trace_id": "trace-id22", # if generation is continuation of past trace. This prevents default behaviour of setting a trace name
|
||||
},
|
||||
)
|
||||
|
||||
|
@ -167,6 +169,9 @@ messages = [
|
|||
chat(messages)
|
||||
```
|
||||
|
||||
## Redacting Messages, Response Content from Langfuse Logging
|
||||
|
||||
Set `litellm.turn_off_message_logging=True` This will prevent the messages and responses from being logged to langfuse, but request metadata will still be logged.
|
||||
|
||||
## Troubleshooting & Errors
|
||||
### Data not getting logged to Langfuse ?
|
||||
|
|
|
@ -57,7 +57,7 @@ os.environ["LANGSMITH_API_KEY"] = ""
|
|||
os.environ['OPENAI_API_KEY']=""
|
||||
|
||||
# set langfuse as a callback, litellm will send the data to langfuse
|
||||
litellm.success_callback = ["langfuse"]
|
||||
litellm.success_callback = ["langsmith"]
|
||||
|
||||
response = litellm.completion(
|
||||
model="gpt-3.5-turbo",
|
||||
|
@ -76,4 +76,4 @@ print(response)
|
|||
- [Schedule Demo 👋](https://calendly.com/d/4mp-gd3-k5k/berriai-1-1-onboarding-litellm-hosted-version)
|
||||
- [Community Discord 💭](https://discord.gg/wuPM9dRgDw)
|
||||
- Our numbers 📞 +1 (770) 8783-106 / +1 (412) 618-6238
|
||||
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
|
||||
- Our emails ✉️ ishaan@berri.ai / krrish@berri.ai
|
||||
|
|
|
@ -40,5 +40,9 @@ response = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content
|
|||
print(response)
|
||||
```
|
||||
|
||||
## Redacting Messages, Response Content from Sentry Logging
|
||||
|
||||
Set `litellm.turn_off_message_logging=True` This will prevent the messages and responses from being logged to sentry, but request metadata will still be logged.
|
||||
|
||||
[Let us know](https://github.com/BerriAI/litellm/issues/new?assignees=&labels=enhancement&projects=&template=feature_request.yml&title=%5BFeature%5D%3A+) if you need any additional options from Sentry.
|
||||
|
||||
|
|
|
@ -224,6 +224,91 @@ assert isinstance(
|
|||
```
|
||||
|
||||
|
||||
### Parallel Function Calling
|
||||
|
||||
Here's how to pass the result of a function call back to an anthropic model:
|
||||
|
||||
```python
|
||||
from litellm import completion
|
||||
import os
|
||||
|
||||
os.environ["ANTHROPIC_API_KEY"] = "sk-ant.."
|
||||
|
||||
|
||||
litellm.set_verbose = True
|
||||
|
||||
### 1ST FUNCTION CALL ###
|
||||
tools = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_current_weather",
|
||||
"description": "Get the current weather in a given location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
|
||||
},
|
||||
"required": ["location"],
|
||||
},
|
||||
},
|
||||
}
|
||||
]
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What's the weather like in Boston today in Fahrenheit?",
|
||||
}
|
||||
]
|
||||
try:
|
||||
# test without max tokens
|
||||
response = completion(
|
||||
model="anthropic/claude-3-opus-20240229",
|
||||
messages=messages,
|
||||
tools=tools,
|
||||
tool_choice="auto",
|
||||
)
|
||||
# Add any assertions, here to check response args
|
||||
print(response)
|
||||
assert isinstance(response.choices[0].message.tool_calls[0].function.name, str)
|
||||
assert isinstance(
|
||||
response.choices[0].message.tool_calls[0].function.arguments, str
|
||||
)
|
||||
|
||||
messages.append(
|
||||
response.choices[0].message.model_dump()
|
||||
) # Add assistant tool invokes
|
||||
tool_result = (
|
||||
'{"location": "Boston", "temperature": "72", "unit": "fahrenheit"}'
|
||||
)
|
||||
# Add user submitted tool results in the OpenAI format
|
||||
messages.append(
|
||||
{
|
||||
"tool_call_id": response.choices[0].message.tool_calls[0].id,
|
||||
"role": "tool",
|
||||
"name": response.choices[0].message.tool_calls[0].function.name,
|
||||
"content": tool_result,
|
||||
}
|
||||
)
|
||||
### 2ND FUNCTION CALL ###
|
||||
# In the second response, Claude should deduce answer from tool results
|
||||
second_response = completion(
|
||||
model="anthropic/claude-3-opus-20240229",
|
||||
messages=messages,
|
||||
tools=tools,
|
||||
tool_choice="auto",
|
||||
)
|
||||
print(second_response)
|
||||
except Exception as e:
|
||||
print(f"An error occurred - {str(e)}")
|
||||
```
|
||||
|
||||
s/o @[Shekhar Patnaik](https://www.linkedin.com/in/patnaikshekhar) for requesting this!
|
||||
|
||||
## Usage - Vision
|
||||
|
||||
```python
|
||||
|
|
|
@ -23,7 +23,7 @@ In certain use-cases you may need to make calls to the models and pass [safety s
|
|||
```python
|
||||
response = completion(
|
||||
model="gemini/gemini-pro",
|
||||
messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}]
|
||||
messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}],
|
||||
safety_settings=[
|
||||
{
|
||||
"category": "HARM_CATEGORY_HARASSMENT",
|
||||
|
|
|
@ -48,6 +48,8 @@ We support ALL Groq models, just set `groq/` as a prefix when sending completion
|
|||
|
||||
| Model Name | Function Call |
|
||||
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| llama3-8b-8192 | `completion(model="groq/llama3-8b-8192", messages)` |
|
||||
| llama3-70b-8192 | `completion(model="groq/llama3-70b-8192", messages)` |
|
||||
| llama2-70b-4096 | `completion(model="groq/llama2-70b-4096", messages)` |
|
||||
| mixtral-8x7b-32768 | `completion(model="groq/mixtral-8x7b-32768", messages)` |
|
||||
| gemma-7b-it | `completion(model="groq/gemma-7b-it", messages)` |
|
||||
|
|
|
@ -53,6 +53,50 @@ All models listed here https://docs.mistral.ai/platform/endpoints are supported.
|
|||
| open-mixtral-8x22b | `completion(model="mistral/open-mixtral-8x22b", messages)` |
|
||||
|
||||
|
||||
## Function Calling
|
||||
|
||||
```python
|
||||
from litellm import completion
|
||||
|
||||
# set env
|
||||
os.environ["MISTRAL_API_KEY"] = "your-api-key"
|
||||
|
||||
tools = [
|
||||
{
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": "get_current_weather",
|
||||
"description": "Get the current weather in a given location",
|
||||
"parameters": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"location": {
|
||||
"type": "string",
|
||||
"description": "The city and state, e.g. San Francisco, CA",
|
||||
},
|
||||
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
|
||||
},
|
||||
"required": ["location"],
|
||||
},
|
||||
},
|
||||
}
|
||||
]
|
||||
messages = [{"role": "user", "content": "What's the weather like in Boston today?"}]
|
||||
|
||||
response = completion(
|
||||
model="mistral/mistral-large-latest",
|
||||
messages=messages,
|
||||
tools=tools,
|
||||
tool_choice="auto",
|
||||
)
|
||||
# Add any assertions, here to check response args
|
||||
print(response)
|
||||
assert isinstance(response.choices[0].message.tool_calls[0].function.name, str)
|
||||
assert isinstance(
|
||||
response.choices[0].message.tool_calls[0].function.arguments, str
|
||||
)
|
||||
```
|
||||
|
||||
## Sample Usage - Embedding
|
||||
```python
|
||||
from litellm import embedding
|
||||
|
|
|
@ -1,7 +1,16 @@
|
|||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
# Replicate
|
||||
|
||||
LiteLLM supports all models on Replicate
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="sdk" label="SDK">
|
||||
|
||||
### API KEYS
|
||||
```python
|
||||
import os
|
||||
|
@ -16,14 +25,175 @@ import os
|
|||
## set ENV variables
|
||||
os.environ["REPLICATE_API_KEY"] = "replicate key"
|
||||
|
||||
# replicate llama-2 call
|
||||
# replicate llama-3 call
|
||||
response = completion(
|
||||
model="replicate/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf",
|
||||
model="replicate/meta/meta-llama-3-8b-instruct",
|
||||
messages = [{ "content": "Hello, how are you?","role": "user"}]
|
||||
)
|
||||
```
|
||||
|
||||
### Example - Calling Replicate Deployments
|
||||
</TabItem>
|
||||
<TabItem value="proxy" label="PROXY">
|
||||
|
||||
1. Add models to your config.yaml
|
||||
|
||||
```yaml
|
||||
model_list:
|
||||
- model_name: llama-3
|
||||
litellm_params:
|
||||
model: replicate/meta/meta-llama-3-8b-instruct
|
||||
api_key: os.environ/REPLICATE_API_KEY
|
||||
```
|
||||
|
||||
|
||||
|
||||
2. Start the proxy
|
||||
|
||||
```bash
|
||||
$ litellm --config /path/to/config.yaml --debug
|
||||
```
|
||||
|
||||
3. Send Request to LiteLLM Proxy Server
|
||||
|
||||
<Tabs>
|
||||
|
||||
<TabItem value="openai" label="OpenAI Python v1.0.0+">
|
||||
|
||||
```python
|
||||
import openai
|
||||
client = openai.OpenAI(
|
||||
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
|
||||
base_url="http://0.0.0.0:4000" # litellm-proxy-base url
|
||||
)
|
||||
|
||||
response = client.chat.completions.create(
|
||||
model="llama-3",
|
||||
messages = [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "Be a good human!"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What do you know about earth?"
|
||||
}
|
||||
]
|
||||
)
|
||||
|
||||
print(response)
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
|
||||
<TabItem value="curl" label="curl">
|
||||
|
||||
```shell
|
||||
curl --location 'http://0.0.0.0:4000/chat/completions' \
|
||||
--header 'Authorization: Bearer sk-1234' \
|
||||
--header 'Content-Type: application/json' \
|
||||
--data '{
|
||||
"model": "llama-3",
|
||||
"messages": [
|
||||
{
|
||||
"role": "system",
|
||||
"content": "Be a good human!"
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": "What do you know about earth?"
|
||||
}
|
||||
],
|
||||
}'
|
||||
```
|
||||
</TabItem>
|
||||
|
||||
</Tabs>
|
||||
|
||||
|
||||
### Expected Replicate Call
|
||||
|
||||
This is the call litellm will make to replicate, from the above example:
|
||||
|
||||
```bash
|
||||
|
||||
POST Request Sent from LiteLLM:
|
||||
curl -X POST \
|
||||
https://api.replicate.com/v1/models/meta/meta-llama-3-8b-instruct \
|
||||
-H 'Authorization: Token your-api-key' -H 'Content-Type: application/json' \
|
||||
-d '{'version': 'meta/meta-llama-3-8b-instruct', 'input': {'prompt': '<|start_header_id|>system<|end_header_id|>\n\nBe a good human!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat do you know about earth?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'}}'
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
|
||||
</Tabs>
|
||||
|
||||
## Advanced Usage - Prompt Formatting
|
||||
|
||||
LiteLLM has prompt template mappings for all `meta-llama` llama3 instruct models. [**See Code**](https://github.com/BerriAI/litellm/blob/4f46b4c3975cd0f72b8c5acb2cb429d23580c18a/litellm/llms/prompt_templates/factory.py#L1360)
|
||||
|
||||
To apply a custom prompt template:
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="sdk" label="SDK">
|
||||
|
||||
```python
|
||||
import litellm
|
||||
|
||||
import os
|
||||
os.environ["REPLICATE_API_KEY"] = ""
|
||||
|
||||
# Create your own custom prompt template
|
||||
litellm.register_prompt_template(
|
||||
model="togethercomputer/LLaMA-2-7B-32K",
|
||||
initial_prompt_value="You are a good assistant" # [OPTIONAL]
|
||||
roles={
|
||||
"system": {
|
||||
"pre_message": "[INST] <<SYS>>\n", # [OPTIONAL]
|
||||
"post_message": "\n<</SYS>>\n [/INST]\n" # [OPTIONAL]
|
||||
},
|
||||
"user": {
|
||||
"pre_message": "[INST] ", # [OPTIONAL]
|
||||
"post_message": " [/INST]" # [OPTIONAL]
|
||||
},
|
||||
"assistant": {
|
||||
"pre_message": "\n" # [OPTIONAL]
|
||||
"post_message": "\n" # [OPTIONAL]
|
||||
}
|
||||
}
|
||||
final_prompt_value="Now answer as best you can:" # [OPTIONAL]
|
||||
)
|
||||
|
||||
def test_replicate_custom_model():
|
||||
model = "replicate/togethercomputer/LLaMA-2-7B-32K"
|
||||
response = completion(model=model, messages=messages)
|
||||
print(response['choices'][0]['message']['content'])
|
||||
return response
|
||||
|
||||
test_replicate_custom_model()
|
||||
```
|
||||
</TabItem>
|
||||
<TabItem value="proxy" label="PROXY">
|
||||
|
||||
```yaml
|
||||
# Model-specific parameters
|
||||
model_list:
|
||||
- model_name: mistral-7b # model alias
|
||||
litellm_params: # actual params for litellm.completion()
|
||||
model: "replicate/mistralai/Mistral-7B-Instruct-v0.1"
|
||||
api_key: os.environ/REPLICATE_API_KEY
|
||||
initial_prompt_value: "\n"
|
||||
roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
|
||||
final_prompt_value: "\n"
|
||||
bos_token: "<s>"
|
||||
eos_token: "</s>"
|
||||
max_tokens: 4096
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
|
||||
</Tabs>
|
||||
|
||||
## Advanced Usage - Calling Replicate Deployments
|
||||
Calling a [deployed replicate LLM](https://replicate.com/deployments)
|
||||
Add the `replicate/deployments/` prefix to your model, so litellm will call the `deployments` endpoint. This will call `ishaan-jaff/ishaan-mistral` deployment on replicate
|
||||
|
||||
|
@ -40,7 +210,7 @@ Replicate responses can take 3-5 mins due to replicate cold boots, if you're try
|
|||
|
||||
:::
|
||||
|
||||
### Replicate Models
|
||||
## Replicate Models
|
||||
liteLLM supports all replicate LLMs
|
||||
|
||||
For replicate models ensure to add a `replicate/` prefix to the `model` arg. liteLLM detects it using this arg.
|
||||
|
@ -49,15 +219,15 @@ Below are examples on how to call replicate LLMs using liteLLM
|
|||
|
||||
Model Name | Function Call | Required OS Variables |
|
||||
-----------------------------|----------------------------------------------------------------|--------------------------------------|
|
||||
replicate/llama-2-70b-chat | `completion(model='replicate/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf', messages, supports_system_prompt=True)` | `os.environ['REPLICATE_API_KEY']` |
|
||||
a16z-infra/llama-2-13b-chat| `completion(model='replicate/a16z-infra/llama-2-13b-chat:2a7f981751ec7fdf87b5b91ad4db53683a98082e9ff7bfd12c8cd5ea85980a52', messages, supports_system_prompt=True)`| `os.environ['REPLICATE_API_KEY']` |
|
||||
replicate/llama-2-70b-chat | `completion(model='replicate/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf', messages)` | `os.environ['REPLICATE_API_KEY']` |
|
||||
a16z-infra/llama-2-13b-chat| `completion(model='replicate/a16z-infra/llama-2-13b-chat:2a7f981751ec7fdf87b5b91ad4db53683a98082e9ff7bfd12c8cd5ea85980a52', messages)`| `os.environ['REPLICATE_API_KEY']` |
|
||||
replicate/vicuna-13b | `completion(model='replicate/vicuna-13b:6282abe6a492de4145d7bb601023762212f9ddbbe78278bd6771c8b3b2f2a13b', messages)` | `os.environ['REPLICATE_API_KEY']` |
|
||||
daanelson/flan-t5-large | `completion(model='replicate/daanelson/flan-t5-large:ce962b3f6792a57074a601d3979db5839697add2e4e02696b3ced4c022d4767f', messages)` | `os.environ['REPLICATE_API_KEY']` |
|
||||
custom-llm | `completion(model='replicate/custom-llm-version-id', messages)` | `os.environ['REPLICATE_API_KEY']` |
|
||||
replicate deployment | `completion(model='replicate/deployments/ishaan-jaff/ishaan-mistral', messages)` | `os.environ['REPLICATE_API_KEY']` |
|
||||
|
||||
|
||||
### Passing additional params - max_tokens, temperature
|
||||
## Passing additional params - max_tokens, temperature
|
||||
See all litellm.completion supported params [here](https://docs.litellm.ai/docs/completion/input)
|
||||
|
||||
```python
|
||||
|
@ -73,11 +243,22 @@ response = completion(
|
|||
messages = [{ "content": "Hello, how are you?","role": "user"}],
|
||||
max_tokens=20,
|
||||
temperature=0.5
|
||||
|
||||
)
|
||||
```
|
||||
|
||||
### Passings Replicate specific params
|
||||
**proxy**
|
||||
|
||||
```yaml
|
||||
model_list:
|
||||
- model_name: llama-3
|
||||
litellm_params:
|
||||
model: replicate/meta/meta-llama-3-8b-instruct
|
||||
api_key: os.environ/REPLICATE_API_KEY
|
||||
max_tokens: 20
|
||||
temperature: 0.5
|
||||
```
|
||||
|
||||
## Passings Replicate specific params
|
||||
Send params [not supported by `litellm.completion()`](https://docs.litellm.ai/docs/completion/input) but supported by Replicate by passing them to `litellm.completion`
|
||||
|
||||
Example `seed`, `min_tokens` are Replicate specific param
|
||||
|
@ -98,3 +279,15 @@ response = completion(
|
|||
top_k=20,
|
||||
)
|
||||
```
|
||||
|
||||
**proxy**
|
||||
|
||||
```yaml
|
||||
model_list:
|
||||
- model_name: llama-3
|
||||
litellm_params:
|
||||
model: replicate/meta/meta-llama-3-8b-instruct
|
||||
api_key: os.environ/REPLICATE_API_KEY
|
||||
min_tokens: 2
|
||||
top_k: 20
|
||||
```
|
||||
|
|
|
@ -4,6 +4,13 @@ LiteLLM supports all models on VLLM.
|
|||
|
||||
🚀[Code Tutorial](https://github.com/BerriAI/litellm/blob/main/cookbook/VLLM_Model_Testing.ipynb)
|
||||
|
||||
|
||||
:::info
|
||||
|
||||
To call a HOSTED VLLM Endpoint use [these docs](./openai_compatible.md)
|
||||
|
||||
:::
|
||||
|
||||
### Quick Start
|
||||
```
|
||||
pip install litellm vllm
|
||||
|
|
284
docs/my-website/docs/providers/watsonx.md
Normal file
284
docs/my-website/docs/providers/watsonx.md
Normal file
|
@ -0,0 +1,284 @@
|
|||
import Tabs from '@theme/Tabs';
|
||||
import TabItem from '@theme/TabItem';
|
||||
|
||||
# IBM watsonx.ai
|
||||
|
||||
LiteLLM supports all IBM [watsonx.ai](https://watsonx.ai/) foundational models and embeddings.
|
||||
|
||||
## Environment Variables
|
||||
```python
|
||||
os.environ["WATSONX_URL"] = "" # (required) Base URL of your WatsonX instance
|
||||
# (required) either one of the following:
|
||||
os.environ["WATSONX_APIKEY"] = "" # IBM cloud API key
|
||||
os.environ["WATSONX_TOKEN"] = "" # IAM auth token
|
||||
# optional - can also be passed as params to completion() or embedding()
|
||||
os.environ["WATSONX_PROJECT_ID"] = "" # Project ID of your WatsonX instance
|
||||
os.environ["WATSONX_DEPLOYMENT_SPACE_ID"] = "" # ID of your deployment space to use deployed models
|
||||
```
|
||||
|
||||
See [here](https://cloud.ibm.com/apidocs/watsonx-ai#api-authentication) for more information on how to get an access token to authenticate to watsonx.ai.
|
||||
|
||||
## Usage
|
||||
|
||||
<a target="_blank" href="https://colab.research.google.com/github/BerriAI/litellm/blob/main/cookbook/liteLLM_IBM_Watsonx.ipynb">
|
||||
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
|
||||
</a>
|
||||
|
||||
```python
|
||||
import os
|
||||
from litellm import completion
|
||||
|
||||
os.environ["WATSONX_URL"] = ""
|
||||
os.environ["WATSONX_APIKEY"] = ""
|
||||
|
||||
response = completion(
|
||||
model="watsonx/ibm/granite-13b-chat-v2",
|
||||
messages=[{ "content": "what is your favorite colour?","role": "user"}],
|
||||
project_id="<my-project-id>" # or pass with os.environ["WATSONX_PROJECT_ID"]
|
||||
)
|
||||
|
||||
response = completion(
|
||||
model="watsonx/meta-llama/llama-3-8b-instruct",
|
||||
messages=[{ "content": "what is your favorite colour?","role": "user"}],
|
||||
project_id="<my-project-id>"
|
||||
)
|
||||
```
|
||||
|
||||
## Usage - Streaming
|
||||
```python
|
||||
import os
|
||||
from litellm import completion
|
||||
|
||||
os.environ["WATSONX_URL"] = ""
|
||||
os.environ["WATSONX_APIKEY"] = ""
|
||||
os.environ["WATSONX_PROJECT_ID"] = ""
|
||||
|
||||
response = completion(
|
||||
model="watsonx/ibm/granite-13b-chat-v2",
|
||||
messages=[{ "content": "what is your favorite colour?","role": "user"}],
|
||||
stream=True
|
||||
)
|
||||
for chunk in response:
|
||||
print(chunk)
|
||||
```
|
||||
|
||||
#### Example Streaming Output Chunk
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"finish_reason": null,
|
||||
"index": 0,
|
||||
"delta": {
|
||||
"content": "I don't have a favorite color, but I do like the color blue. What's your favorite color?"
|
||||
}
|
||||
}
|
||||
],
|
||||
"created": null,
|
||||
"model": "watsonx/ibm/granite-13b-chat-v2",
|
||||
"usage": {
|
||||
"prompt_tokens": null,
|
||||
"completion_tokens": null,
|
||||
"total_tokens": null
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Usage - Models in deployment spaces
|
||||
|
||||
Models that have been deployed to a deployment space (e.g.: tuned models) can be called using the `deployment/<deployment_id>` format (where `<deployment_id>` is the ID of the deployed model in your deployment space).
|
||||
|
||||
The ID of your deployment space must also be set in the environment variable `WATSONX_DEPLOYMENT_SPACE_ID` or passed to the function as `space_id=<deployment_space_id>`.
|
||||
|
||||
```python
|
||||
import litellm
|
||||
response = litellm.completion(
|
||||
model="watsonx/deployment/<deployment_id>",
|
||||
messages=[{"content": "Hello, how are you?", "role": "user"}],
|
||||
space_id="<deployment_space_id>"
|
||||
)
|
||||
```
|
||||
|
||||
## Usage - Embeddings
|
||||
|
||||
LiteLLM also supports making requests to IBM watsonx.ai embedding models. The credential needed for this is the same as for completion.
|
||||
|
||||
```python
|
||||
from litellm import embedding
|
||||
|
||||
response = embedding(
|
||||
model="watsonx/ibm/slate-30m-english-rtrvr",
|
||||
input=["What is the capital of France?"],
|
||||
project_id="<my-project-id>"
|
||||
)
|
||||
print(response)
|
||||
# EmbeddingResponse(model='ibm/slate-30m-english-rtrvr', data=[{'object': 'embedding', 'index': 0, 'embedding': [-0.037463713, -0.02141933, -0.02851813, 0.015519324, ..., -0.0021367231, -0.01704561, -0.001425816, 0.0035238306]}], object='list', usage=Usage(prompt_tokens=8, total_tokens=8))
|
||||
```
|
||||
|
||||
## OpenAI Proxy Usage
|
||||
|
||||
Here's how to call IBM watsonx.ai with the LiteLLM Proxy Server
|
||||
|
||||
### 1. Save keys in your environment
|
||||
|
||||
```bash
|
||||
export WATSONX_URL=""
|
||||
export WATSONX_APIKEY=""
|
||||
export WATSONX_PROJECT_ID=""
|
||||
```
|
||||
|
||||
### 2. Start the proxy
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="cli" label="CLI">
|
||||
|
||||
```bash
|
||||
$ litellm --model watsonx/meta-llama/llama-3-8b-instruct
|
||||
|
||||
# Server running on http://0.0.0.0:4000
|
||||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="config" label="config.yaml">
|
||||
|
||||
```yaml
|
||||
model_list:
|
||||
- model_name: llama-3-8b
|
||||
litellm_params:
|
||||
# all params accepted by litellm.completion()
|
||||
model: watsonx/meta-llama/llama-3-8b-instruct
|
||||
api_key: "os.environ/WATSONX_API_KEY" # does os.getenv("WATSONX_API_KEY")
|
||||
```
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
### 3. Test it
|
||||
|
||||
|
||||
<Tabs>
|
||||
<TabItem value="Curl" label="Curl Request">
|
||||
|
||||
```shell
|
||||
curl --location 'http://0.0.0.0:4000/chat/completions' \
|
||||
--header 'Content-Type: application/json' \
|
||||
--data ' {
|
||||
"model": "llama-3-8b",
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "what is your favorite colour?"
|
||||
}
|
||||
]
|
||||
}
|
||||
'
|
||||
```
|
||||
</TabItem>
|
||||
<TabItem value="openai" label="OpenAI v1.0.0+">
|
||||
|
||||
```python
|
||||
import openai
|
||||
client = openai.OpenAI(
|
||||
api_key="anything",
|
||||
base_url="http://0.0.0.0:4000"
|
||||
)
|
||||
|
||||
# request sent to model set on litellm proxy, `litellm --model`
|
||||
response = client.chat.completions.create(model="llama-3-8b", messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": "what is your favorite colour?"
|
||||
}
|
||||
])
|
||||
|
||||
print(response)
|
||||
|
||||
```
|
||||
</TabItem>
|
||||
<TabItem value="langchain" label="Langchain">
|
||||
|
||||
```python
|
||||
from langchain.chat_models import ChatOpenAI
|
||||
from langchain.prompts.chat import (
|
||||
ChatPromptTemplate,
|
||||
HumanMessagePromptTemplate,
|
||||
SystemMessagePromptTemplate,
|
||||
)
|
||||
from langchain.schema import HumanMessage, SystemMessage
|
||||
|
||||
chat = ChatOpenAI(
|
||||
openai_api_base="http://0.0.0.0:4000", # set openai_api_base to the LiteLLM Proxy
|
||||
model = "llama-3-8b",
|
||||
temperature=0.1
|
||||
)
|
||||
|
||||
messages = [
|
||||
SystemMessage(
|
||||
content="You are a helpful assistant that im using to make a test request to."
|
||||
),
|
||||
HumanMessage(
|
||||
content="test from litellm. tell me why it's amazing in 1 sentence"
|
||||
),
|
||||
]
|
||||
response = chat(messages)
|
||||
|
||||
print(response)
|
||||
```
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
||||
|
||||
## Authentication
|
||||
|
||||
### Passing credentials as parameters
|
||||
|
||||
You can also pass the credentials as parameters to the completion and embedding functions.
|
||||
|
||||
```python
|
||||
import os
|
||||
from litellm import completion
|
||||
|
||||
response = completion(
|
||||
model="watsonx/ibm/granite-13b-chat-v2",
|
||||
messages=[{ "content": "What is your favorite color?","role": "user"}],
|
||||
url="",
|
||||
api_key="",
|
||||
project_id=""
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
## Supported IBM watsonx.ai Models
|
||||
|
||||
Here are some examples of models available in IBM watsonx.ai that you can use with LiteLLM:
|
||||
|
||||
| Mode Name | Command |
|
||||
| ---------- | --------- |
|
||||
| Flan T5 XXL | `completion(model=watsonx/google/flan-t5-xxl, messages=messages)` |
|
||||
| Flan Ul2 | `completion(model=watsonx/google/flan-ul2, messages=messages)` |
|
||||
| Mt0 XXL | `completion(model=watsonx/bigscience/mt0-xxl, messages=messages)` |
|
||||
| Gpt Neox | `completion(model=watsonx/eleutherai/gpt-neox-20b, messages=messages)` |
|
||||
| Mpt 7B Instruct2 | `completion(model=watsonx/ibm/mpt-7b-instruct2, messages=messages)` |
|
||||
| Starcoder | `completion(model=watsonx/bigcode/starcoder, messages=messages)` |
|
||||
| Llama 2 70B Chat | `completion(model=watsonx/meta-llama/llama-2-70b-chat, messages=messages)` |
|
||||
| Llama 2 13B Chat | `completion(model=watsonx/meta-llama/llama-2-13b-chat, messages=messages)` |
|
||||
| Granite 13B Instruct | `completion(model=watsonx/ibm/granite-13b-instruct-v1, messages=messages)` |
|
||||
| Granite 13B Chat | `completion(model=watsonx/ibm/granite-13b-chat-v1, messages=messages)` |
|
||||
| Flan T5 XL | `completion(model=watsonx/google/flan-t5-xl, messages=messages)` |
|
||||
| Granite 13B Chat V2 | `completion(model=watsonx/ibm/granite-13b-chat-v2, messages=messages)` |
|
||||
| Granite 13B Instruct V2 | `completion(model=watsonx/ibm/granite-13b-instruct-v2, messages=messages)` |
|
||||
| Elyza Japanese Llama 2 7B Instruct | `completion(model=watsonx/elyza/elyza-japanese-llama-2-7b-instruct, messages=messages)` |
|
||||
| Mixtral 8X7B Instruct V01 Q | `completion(model=watsonx/ibm-mistralai/mixtral-8x7b-instruct-v01-q, messages=messages)` |
|
||||
|
||||
|
||||
For a list of all available models in watsonx.ai, see [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models.html?context=wx&locale=en&audience=wdp).
|
||||
|
||||
|
||||
## Supported IBM watsonx.ai Embedding Models
|
||||
|
||||
| Model Name | Function Call |
|
||||
|----------------------|---------------------------------------------|
|
||||
| Slate 30m | `embedding(model="watsonx/ibm/slate-30m-english-rtrvr", input=input)` |
|
||||
| Slate 125m | `embedding(model="watsonx/ibm/slate-125m-english-rtrvr", input=input)` |
|
||||
|
||||
|
||||
For a list of all available embedding models in watsonx.ai, see [here](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/fm-models-embed.html?context=wx).
|
|
@ -1,13 +1,13 @@
|
|||
# Slack Alerting
|
||||
# 🚨 Alerting
|
||||
|
||||
Get alerts for:
|
||||
- hanging LLM api calls
|
||||
- failed LLM api calls
|
||||
- slow LLM api calls
|
||||
- budget Tracking per key/user:
|
||||
- Hanging LLM api calls
|
||||
- Failed LLM api calls
|
||||
- Slow LLM api calls
|
||||
- Budget Tracking per key/user:
|
||||
- When a User/Key crosses their Budget
|
||||
- When a User/Key is 15% away from crossing their Budget
|
||||
- failed db read/writes
|
||||
- Failed db read/writes
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
|
|
@ -61,6 +61,22 @@ litellm_settings:
|
|||
ttl: 600 # will be cached on redis for 600s
|
||||
```
|
||||
|
||||
|
||||
## SSL
|
||||
|
||||
just set `REDIS_SSL="True"` in your .env, and LiteLLM will pick this up.
|
||||
|
||||
```env
|
||||
REDIS_SSL="True"
|
||||
```
|
||||
|
||||
For quick testing, you can also use REDIS_URL, eg.:
|
||||
|
||||
```
|
||||
REDIS_URL="rediss://.."
|
||||
```
|
||||
|
||||
but we **don't** recommend using REDIS_URL in prod. We've noticed a performance difference between using it vs. redis_host, port, etc.
|
||||
#### Step 2: Add Redis Credentials to .env
|
||||
Set either `REDIS_URL` or the `REDIS_HOST` in your os environment, to enable caching.
|
||||
|
||||
|
|
|
@ -62,9 +62,11 @@ model_list:
|
|||
|
||||
litellm_settings: # module level litellm settings - https://github.com/BerriAI/litellm/blob/main/litellm/__init__.py
|
||||
drop_params: True
|
||||
success_callback: ["langfuse"] # OPTIONAL - if you want to start sending LLM Logs to Langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your env
|
||||
|
||||
general_settings:
|
||||
master_key: sk-1234 # [OPTIONAL] Only use this if you to require all calls to contain this key (Authorization: Bearer sk-1234)
|
||||
alerting: ["slack"] # [OPTIONAL] If you want Slack Alerts for Hanging LLM requests, Slow llm responses, Budget Alerts. Make sure to set `SLACK_WEBHOOK_URL` in your env
|
||||
```
|
||||
:::info
|
||||
|
||||
|
@ -600,6 +602,7 @@ general_settings:
|
|||
"general_settings": {
|
||||
"completion_model": "string",
|
||||
"disable_spend_logs": "boolean", # turn off writing each transaction to the db
|
||||
"disable_master_key_return": "boolean", # turn off returning master key on UI (checked on '/user/info' endpoint)
|
||||
"disable_reset_budget": "boolean", # turn off reset budget scheduled task
|
||||
"enable_jwt_auth": "boolean", # allow proxy admin to auth in via jwt tokens with 'litellm_proxy_admin' in claims
|
||||
"enforce_user_param": "boolean", # requires all openai endpoint requests to have a 'user' param
|
||||
|
|
|
@ -231,13 +231,16 @@ Your OpenAI proxy server is now running on `http://127.0.0.1:4000`.
|
|||
| Docs | When to Use |
|
||||
| --- | --- |
|
||||
| [Quick Start](#quick-start) | call 100+ LLMs + Load Balancing |
|
||||
| [Deploy with Database](#deploy-with-database) | + use Virtual Keys + Track Spend |
|
||||
| [Deploy with Database](#deploy-with-database) | + use Virtual Keys + Track Spend (Note: When deploying with a database providing a `DATABASE_URL` and `LITELLM_MASTER_KEY` are required in your env ) |
|
||||
| [LiteLLM container + Redis](#litellm-container--redis) | + load balance across multiple litellm containers |
|
||||
| [LiteLLM Database container + PostgresDB + Redis](#litellm-database-container--postgresdb--redis) | + use Virtual Keys + Track Spend + load balance across multiple litellm containers |
|
||||
|
||||
## Deploy with Database
|
||||
### Docker, Kubernetes, Helm Chart
|
||||
|
||||
Requirements:
|
||||
- Need a postgres database (e.g. [Supabase](https://supabase.com/), [Neon](https://neon.tech/), etc) Set `DATABASE_URL=postgresql://<user>:<password>@<host>:<port>/<dbname>` in your env
|
||||
- Set a `LITELLM_MASTER_KEY`, this is your Proxy Admin key - you can use this to create other keys (🚨 must start with `sk-`)
|
||||
|
||||
<Tabs>
|
||||
|
||||
|
@ -252,6 +255,8 @@ docker pull ghcr.io/berriai/litellm-database:main-latest
|
|||
```shell
|
||||
docker run \
|
||||
-v $(pwd)/litellm_config.yaml:/app/config.yaml \
|
||||
-e LITELLM_MASTER_KEY=sk-1234 \
|
||||
-e DATABASE_URL=postgresql://<user>:<password>@<host>:<port>/<dbname> \
|
||||
-e AZURE_API_KEY=d6*********** \
|
||||
-e AZURE_API_BASE=https://openai-***********/ \
|
||||
-p 4000:4000 \
|
||||
|
@ -267,26 +272,63 @@ Your OpenAI proxy server is now running on `http://0.0.0.0:4000`.
|
|||
#### Step 1. Create deployment.yaml
|
||||
|
||||
```yaml
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: litellm-deployment
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: litellm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: litellm
|
||||
spec:
|
||||
containers:
|
||||
- name: litellm-container
|
||||
image: ghcr.io/berriai/litellm-database:main-latest
|
||||
env:
|
||||
- name: DATABASE_URL
|
||||
value: postgresql://<user>:<password>@<host>:<port>/<dbname>
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: litellm-deployment
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: litellm
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: litellm
|
||||
spec:
|
||||
containers:
|
||||
- name: litellm-container
|
||||
image: ghcr.io/berriai/litellm:main-latest
|
||||
imagePullPolicy: Always
|
||||
env:
|
||||
- name: AZURE_API_KEY
|
||||
value: "d6******"
|
||||
- name: AZURE_API_BASE
|
||||
value: "https://ope******"
|
||||
- name: LITELLM_MASTER_KEY
|
||||
value: "sk-1234"
|
||||
- name: DATABASE_URL
|
||||
value: "po**********"
|
||||
args:
|
||||
- "--config"
|
||||
- "/app/proxy_config.yaml" # Update the path to mount the config file
|
||||
volumeMounts: # Define volume mount for proxy_config.yaml
|
||||
- name: config-volume
|
||||
mountPath: /app
|
||||
readOnly: true
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health/liveliness
|
||||
port: 4000
|
||||
initialDelaySeconds: 120
|
||||
periodSeconds: 15
|
||||
successThreshold: 1
|
||||
failureThreshold: 3
|
||||
timeoutSeconds: 10
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health/readiness
|
||||
port: 4000
|
||||
initialDelaySeconds: 120
|
||||
periodSeconds: 15
|
||||
successThreshold: 1
|
||||
failureThreshold: 3
|
||||
timeoutSeconds: 10
|
||||
volumes: # Define volume to mount proxy_config.yaml
|
||||
- name: config-volume
|
||||
configMap:
|
||||
name: litellm-config
|
||||
|
||||
```
|
||||
|
||||
```bash
|
||||
|
|
|
@ -401,7 +401,7 @@ litellm_settings:
|
|||
Start the LiteLLM Proxy and make a test request to verify the logs reached your callback API
|
||||
|
||||
## Logging Proxy Input/Output - Langfuse
|
||||
We will use the `--config` to set `litellm.success_callback = ["langfuse"]` this will log all successfull LLM calls to langfuse
|
||||
We will use the `--config` to set `litellm.success_callback = ["langfuse"]` this will log all successfull LLM calls to langfuse. Make sure to set `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` in your environment
|
||||
|
||||
**Step 1** Install langfuse
|
||||
|
||||
|
@ -419,7 +419,13 @@ litellm_settings:
|
|||
success_callback: ["langfuse"]
|
||||
```
|
||||
|
||||
**Step 3**: Start the proxy, make a test request
|
||||
**Step 3**: Set required env variables for logging to langfuse
|
||||
```shell
|
||||
export LANGFUSE_PUBLIC_KEY="pk_kk"
|
||||
export LANGFUSE_SECRET_KEY="sk_ss
|
||||
```
|
||||
|
||||
**Step 4**: Start the proxy, make a test request
|
||||
|
||||
Start proxy
|
||||
```shell
|
||||
|
@ -569,6 +575,22 @@ curl -X POST 'http://0.0.0.0:4000/key/generate' \
|
|||
|
||||
All requests made with these keys will log data to their team-specific logging.
|
||||
|
||||
### Redacting Messages, Response Content from Langfuse Logging
|
||||
|
||||
Set `litellm.turn_off_message_logging=True` This will prevent the messages and responses from being logged to langfuse, but request metadata will still be logged.
|
||||
|
||||
```yaml
|
||||
model_list:
|
||||
- model_name: gpt-3.5-turbo
|
||||
litellm_params:
|
||||
model: gpt-3.5-turbo
|
||||
litellm_settings:
|
||||
success_callback: ["langfuse"]
|
||||
turn_off_message_logging: True
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Logging Proxy Input/Output - DataDog
|
||||
We will use the `--config` to set `litellm.success_callback = ["datadog"]` this will log all successfull LLM calls to DataDog
|
||||
|
||||
|
|
|
@ -16,7 +16,7 @@ Expected Performance in Production
|
|||
| `/chat/completions` Requests/hour | `126K` |
|
||||
|
||||
|
||||
## 1. Switch of Debug Logging
|
||||
## 1. Switch off Debug Logging
|
||||
|
||||
Remove `set_verbose: True` from your config.yaml
|
||||
```yaml
|
||||
|
@ -40,7 +40,7 @@ Use this Docker `CMD`. This will start the proxy with 1 Uvicorn Async Worker
|
|||
CMD ["--port", "4000", "--config", "./proxy_server_config.yaml"]
|
||||
```
|
||||
|
||||
## 2. Batch write spend updates every 60s
|
||||
## 3. Batch write spend updates every 60s
|
||||
|
||||
The default proxy batch write is 10s. This is to make it easy to see spend when debugging locally.
|
||||
|
||||
|
@ -49,11 +49,35 @@ In production, we recommend using a longer interval period of 60s. This reduces
|
|||
```yaml
|
||||
general_settings:
|
||||
master_key: sk-1234
|
||||
proxy_batch_write_at: 5 # 👈 Frequency of batch writing logs to server (in seconds)
|
||||
proxy_batch_write_at: 60 # 👈 Frequency of batch writing logs to server (in seconds)
|
||||
```
|
||||
|
||||
## 4. use Redis 'port','host', 'password'. NOT 'redis_url'
|
||||
|
||||
## 3. Move spend logs to separate server
|
||||
When connecting to Redis use redis port, host, and password params. Not 'redis_url'. We've seen a 80 RPS difference between these 2 approaches when using the async redis client.
|
||||
|
||||
This is still something we're investigating. Keep track of it [here](https://github.com/BerriAI/litellm/issues/3188)
|
||||
|
||||
Recommended to do this for prod:
|
||||
|
||||
```yaml
|
||||
router_settings:
|
||||
routing_strategy: usage-based-routing-v2
|
||||
# redis_url: "os.environ/REDIS_URL"
|
||||
redis_host: os.environ/REDIS_HOST
|
||||
redis_port: os.environ/REDIS_PORT
|
||||
redis_password: os.environ/REDIS_PASSWORD
|
||||
```
|
||||
|
||||
## 5. Switch off resetting budgets
|
||||
|
||||
Add this to your config.yaml. (Only spend per Key, User and Team will be tracked - spend per API Call will not be written to the LiteLLM Database)
|
||||
```yaml
|
||||
general_settings:
|
||||
disable_reset_budget: true
|
||||
```
|
||||
|
||||
## 6. Move spend logs to separate server (BETA)
|
||||
|
||||
Writing each spend log to the db can slow down your proxy. In testing we saw a 70% improvement in median response time, by moving writing spend logs to a separate server.
|
||||
|
||||
|
@ -141,24 +165,6 @@ A t2.micro should be sufficient to handle 1k logs / minute on this server.
|
|||
|
||||
This consumes at max 120MB, and <0.1 vCPU.
|
||||
|
||||
## 4. Switch off resetting budgets
|
||||
|
||||
Add this to your config.yaml. (Only spend per Key, User and Team will be tracked - spend per API Call will not be written to the LiteLLM Database)
|
||||
```yaml
|
||||
general_settings:
|
||||
disable_spend_logs: true
|
||||
disable_reset_budget: true
|
||||
```
|
||||
|
||||
## 5. Switch of `litellm.telemetry`
|
||||
|
||||
Switch of all telemetry tracking done by litellm
|
||||
|
||||
```yaml
|
||||
litellm_settings:
|
||||
telemetry: False
|
||||
```
|
||||
|
||||
## Machine Specifications to Deploy LiteLLM
|
||||
|
||||
| Service | Spec | CPUs | Memory | Architecture | Version|
|
||||
|
|
|
@ -14,6 +14,7 @@ model_list:
|
|||
model: gpt-3.5-turbo
|
||||
litellm_settings:
|
||||
success_callback: ["prometheus"]
|
||||
failure_callback: ["prometheus"]
|
||||
```
|
||||
|
||||
Start the proxy
|
||||
|
@ -48,9 +49,10 @@ http://localhost:4000/metrics
|
|||
|
||||
| Metric Name | Description |
|
||||
|----------------------|--------------------------------------|
|
||||
| `litellm_requests_metric` | Number of requests made, per `"user", "key", "model"` |
|
||||
| `litellm_spend_metric` | Total Spend, per `"user", "key", "model"` |
|
||||
| `litellm_total_tokens` | input + output tokens per `"user", "key", "model"` |
|
||||
| `litellm_requests_metric` | Number of requests made, per `"user", "key", "model", "team", "end-user"` |
|
||||
| `litellm_spend_metric` | Total Spend, per `"user", "key", "model", "team", "end-user"` |
|
||||
| `litellm_total_tokens` | input + output tokens per `"user", "key", "model", "team", "end-user"` |
|
||||
| `litellm_llm_api_failed_requests_metric` | Number of failed LLM API requests per `"user", "key", "model", "team", "end-user"` |
|
||||
|
||||
## Monitor System Health
|
||||
|
||||
|
@ -69,3 +71,4 @@ litellm_settings:
|
|||
|----------------------|--------------------------------------|
|
||||
| `litellm_redis_latency` | histogram latency for redis calls |
|
||||
| `litellm_redis_fails` | Number of failed redis calls |
|
||||
| `litellm_self_latency` | Histogram latency for successful litellm api call |
|
|
@ -348,6 +348,29 @@ query_result = embeddings.embed_query(text)
|
|||
|
||||
print(f"TITAN EMBEDDINGS")
|
||||
print(query_result[:5])
|
||||
```
|
||||
</TabItem>
|
||||
<TabItem value="litellm" label="LiteLLM SDK">
|
||||
|
||||
This is **not recommended**. There is duplicate logic as the proxy also uses the sdk, which might lead to unexpected errors.
|
||||
|
||||
```python
|
||||
from litellm import completion
|
||||
|
||||
response = completion(
|
||||
model="openai/gpt-3.5-turbo",
|
||||
messages = [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "this is a test request, write a short poem"
|
||||
}
|
||||
],
|
||||
api_key="anything",
|
||||
base_url="http://0.0.0.0:4000"
|
||||
)
|
||||
|
||||
print(response)
|
||||
|
||||
```
|
||||
</TabItem>
|
||||
</Tabs>
|
||||
|
|
|
@ -136,6 +136,21 @@ curl --location 'http://0.0.0.0:4000/chat/completions' \
|
|||
'
|
||||
```
|
||||
|
||||
### Test it!
|
||||
|
||||
|
||||
```bash
|
||||
curl --location 'http://0.0.0.0:4000/chat/completions' \
|
||||
--header 'Content-Type: application/json' \
|
||||
--data-raw '{
|
||||
"model": "zephyr-beta", # 👈 MODEL NAME to fallback from
|
||||
"messages": [
|
||||
{"role": "user", "content": "what color is red"}
|
||||
],
|
||||
"mock_testing_fallbacks": true
|
||||
}'
|
||||
```
|
||||
|
||||
## Advanced - Context Window Fallbacks
|
||||
|
||||
**Before call is made** check if a call is within model context window with **`enable_pre_call_checks: true`**.
|
||||
|
|
|
@ -121,6 +121,9 @@ from langchain.prompts.chat import (
|
|||
SystemMessagePromptTemplate,
|
||||
)
|
||||
from langchain.schema import HumanMessage, SystemMessage
|
||||
import os
|
||||
|
||||
os.environ["OPENAI_API_KEY"] = "anything"
|
||||
|
||||
chat = ChatOpenAI(
|
||||
openai_api_base="http://0.0.0.0:4000",
|
||||
|
|
|
@ -95,7 +95,7 @@ print(response)
|
|||
- `router.image_generation()` - completion calls in OpenAI `/v1/images/generations` endpoint format
|
||||
- `router.aimage_generation()` - async image generation calls
|
||||
|
||||
### Advanced - Routing Strategies
|
||||
## Advanced - Routing Strategies
|
||||
#### Routing Strategies - Weighted Pick, Rate Limit Aware, Least Busy, Latency Based
|
||||
|
||||
Router provides 4 strategies for routing your calls across multiple deployments:
|
||||
|
@ -279,7 +279,7 @@ router_settings:
|
|||
```
|
||||
|
||||
</TabItem>
|
||||
<TabItem value="simple-shuffle" label="(Default) Weighted Pick">
|
||||
<TabItem value="simple-shuffle" label="(Default) Weighted Pick (Async)">
|
||||
|
||||
**Default** Picks a deployment based on the provided **Requests per minute (rpm) or Tokens per minute (tpm)**
|
||||
|
||||
|
@ -443,6 +443,35 @@ asyncio.run(router_acompletion())
|
|||
|
||||
## Basic Reliability
|
||||
|
||||
### Max Parallel Requests (ASYNC)
|
||||
|
||||
Used in semaphore for async requests on router. Limit the max concurrent calls made to a deployment. Useful in high-traffic scenarios.
|
||||
|
||||
If tpm/rpm is set, and no max parallel request limit given, we use the RPM or calculated RPM (tpm/1000/6) as the max parallel request limit.
|
||||
|
||||
|
||||
```python
|
||||
from litellm import Router
|
||||
|
||||
model_list = [{
|
||||
"model_name": "gpt-4",
|
||||
"litellm_params": {
|
||||
"model": "azure/gpt-4",
|
||||
...
|
||||
"max_parallel_requests": 10 # 👈 SET PER DEPLOYMENT
|
||||
}
|
||||
}]
|
||||
|
||||
### OR ###
|
||||
|
||||
router = Router(model_list=model_list, default_max_parallel_requests=20) # 👈 SET DEFAULT MAX PARALLEL REQUESTS
|
||||
|
||||
|
||||
# deployment max parallel requests > default max parallel requests
|
||||
```
|
||||
|
||||
[**See Code**](https://github.com/BerriAI/litellm/blob/a978f2d8813c04dad34802cb95e0a0e35a3324bc/litellm/utils.py#L5605)
|
||||
|
||||
### Timeouts
|
||||
|
||||
The timeout set in router is for the entire length of the call, and is passed down to the completion() call level as well.
|
||||
|
|
|
@ -5,6 +5,9 @@ LiteLLM allows you to specify the following:
|
|||
* API Base
|
||||
* API Version
|
||||
* API Type
|
||||
* Project
|
||||
* Location
|
||||
* Token
|
||||
|
||||
Useful Helper functions:
|
||||
* [`check_valid_key()`](#check_valid_key)
|
||||
|
@ -43,6 +46,24 @@ os.environ['AZURE_API_TYPE'] = "azure" # [OPTIONAL]
|
|||
os.environ['OPENAI_API_BASE'] = "https://openai-gpt-4-test2-v-12.openai.azure.com/"
|
||||
```
|
||||
|
||||
### Setting Project, Location, Token
|
||||
|
||||
For cloud providers:
|
||||
- Azure
|
||||
- Bedrock
|
||||
- GCP
|
||||
- Watson AI
|
||||
|
||||
you might need to set additional parameters. LiteLLM provides a common set of params, that we map across all providers.
|
||||
|
||||
| | LiteLLM param | Watson | Vertex AI | Azure | Bedrock |
|
||||
|------|--------------|--------------|--------------|--------------|--------------|
|
||||
| Project | project | watsonx_project | vertex_project | n/a | n/a |
|
||||
| Region | region_name | watsonx_region_name | vertex_location | n/a | aws_region_name |
|
||||
| Token | token | watsonx_token or token | n/a | azure_ad_token | n/a |
|
||||
|
||||
If you want, you can call them by their provider-specific params as well.
|
||||
|
||||
## litellm variables
|
||||
|
||||
### litellm.api_key
|
||||
|
|
|
@ -105,6 +105,12 @@ const config = {
|
|||
label: 'Enterprise',
|
||||
to: "docs/enterprise"
|
||||
},
|
||||
{
|
||||
sidebarId: 'tutorialSidebar',
|
||||
position: 'left',
|
||||
label: '🚀 Hosted',
|
||||
to: "docs/hosted"
|
||||
},
|
||||
{
|
||||
href: 'https://github.com/BerriAI/litellm',
|
||||
label: 'GitHub',
|
||||
|
|
BIN
docs/my-website/img/litellm_hosted_ui_add_models.png
Normal file
BIN
docs/my-website/img/litellm_hosted_ui_add_models.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 398 KiB |
BIN
docs/my-website/img/litellm_hosted_ui_create_key.png
Normal file
BIN
docs/my-website/img/litellm_hosted_ui_create_key.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 496 KiB |
BIN
docs/my-website/img/litellm_hosted_ui_router.png
Normal file
BIN
docs/my-website/img/litellm_hosted_ui_router.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 348 KiB |
BIN
docs/my-website/img/litellm_hosted_usage_dashboard.png
Normal file
BIN
docs/my-website/img/litellm_hosted_usage_dashboard.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 460 KiB |
|
@ -43,6 +43,12 @@ const sidebars = {
|
|||
"proxy/user_keys",
|
||||
"proxy/enterprise",
|
||||
"proxy/virtual_keys",
|
||||
"proxy/alerting",
|
||||
{
|
||||
type: "category",
|
||||
label: "Logging",
|
||||
items: ["proxy/logging", "proxy/streaming_logging"],
|
||||
},
|
||||
"proxy/team_based_routing",
|
||||
"proxy/ui",
|
||||
"proxy/cost_tracking",
|
||||
|
@ -58,12 +64,7 @@ const sidebars = {
|
|||
"proxy/pii_masking",
|
||||
"proxy/prompt_injection",
|
||||
"proxy/caching",
|
||||
{
|
||||
type: "category",
|
||||
label: "Logging, Alerting",
|
||||
items: ["proxy/logging", "proxy/alerting", "proxy/streaming_logging"],
|
||||
},
|
||||
"proxy/grafana_metrics",
|
||||
"proxy/prometheus",
|
||||
"proxy/call_hooks",
|
||||
"proxy/rules",
|
||||
"proxy/cli",
|
||||
|
@ -148,6 +149,7 @@ const sidebars = {
|
|||
"providers/openrouter",
|
||||
"providers/custom_openai_proxy",
|
||||
"providers/petals",
|
||||
"providers/watsonx",
|
||||
],
|
||||
},
|
||||
"proxy/custom_pricing",
|
||||
|
@ -173,6 +175,7 @@ const sidebars = {
|
|||
"observability/langsmith_integration",
|
||||
"observability/slack_integration",
|
||||
"observability/traceloop_integration",
|
||||
"observability/athina_integration",
|
||||
"observability/lunary_integration",
|
||||
"observability/athina_integration",
|
||||
"observability/helicone_integration",
|
||||
|
|
|
@ -16,7 +16,7 @@ However, we also expose 3 public helper functions to calculate token usage acros
|
|||
```python
|
||||
from litellm import token_counter
|
||||
|
||||
messages = [{"user": "role", "content": "Hey, how's it going"}]
|
||||
messages = [{"role": "user", "content": "Hey, how's it going"}]
|
||||
print(token_counter(model="gpt-3.5-turbo", messages=messages))
|
||||
```
|
||||
|
||||
|
|
8
litellm-js/spend-logs/package-lock.json
generated
8
litellm-js/spend-logs/package-lock.json
generated
|
@ -6,7 +6,7 @@
|
|||
"": {
|
||||
"dependencies": {
|
||||
"@hono/node-server": "^1.9.0",
|
||||
"hono": "^4.1.5"
|
||||
"hono": "^4.2.7"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/node": "^20.11.17",
|
||||
|
@ -463,9 +463,9 @@
|
|||
}
|
||||
},
|
||||
"node_modules/hono": {
|
||||
"version": "4.1.5",
|
||||
"resolved": "https://registry.npmjs.org/hono/-/hono-4.1.5.tgz",
|
||||
"integrity": "sha512-3ChJiIoeCxvkt6vnkxJagplrt1YZg3NyNob7ssVeK2PUqEINp4q1F94HzFnvY9QE8asVmbW5kkTDlyWylfg2vg==",
|
||||
"version": "4.2.7",
|
||||
"resolved": "https://registry.npmjs.org/hono/-/hono-4.2.7.tgz",
|
||||
"integrity": "sha512-k1xHi86tJnRIVvqhFMBDGFKJ8r5O+bEsT4P59ZK59r0F300Xd910/r237inVfuT/VmE86RQQffX4OYNda6dLXw==",
|
||||
"engines": {
|
||||
"node": ">=16.0.0"
|
||||
}
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
},
|
||||
"dependencies": {
|
||||
"@hono/node-server": "^1.9.0",
|
||||
"hono": "^4.1.5"
|
||||
"hono": "^4.2.7"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/node": "^20.11.17",
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
import threading, requests, os
|
||||
from typing import Callable, List, Optional, Dict, Union, Any, Literal
|
||||
from litellm.caching import Cache
|
||||
from litellm._logging import set_verbose, _turn_on_debug, verbose_logger
|
||||
from litellm._logging import set_verbose, _turn_on_debug, verbose_logger, json_logs
|
||||
from litellm.proxy._types import (
|
||||
KeyManagementSystem,
|
||||
KeyManagementSettings,
|
||||
|
@ -16,11 +16,24 @@ dotenv.load_dotenv()
|
|||
if set_verbose == True:
|
||||
_turn_on_debug()
|
||||
#############################################
|
||||
### Callbacks /Logging / Success / Failure Handlers ###
|
||||
input_callback: List[Union[str, Callable]] = []
|
||||
success_callback: List[Union[str, Callable]] = []
|
||||
failure_callback: List[Union[str, Callable]] = []
|
||||
service_callback: List[Union[str, Callable]] = []
|
||||
callbacks: List[Callable] = []
|
||||
_langfuse_default_tags: Optional[
|
||||
List[
|
||||
Literal[
|
||||
"user_api_key_alias",
|
||||
"user_api_key_user_id",
|
||||
"user_api_key_user_email",
|
||||
"user_api_key_team_alias",
|
||||
"semantic-similarity",
|
||||
"proxy_base_url",
|
||||
]
|
||||
]
|
||||
] = None
|
||||
_async_input_callback: List[Callable] = (
|
||||
[]
|
||||
) # internal variable - async custom callbacks are routed here.
|
||||
|
@ -32,6 +45,9 @@ _async_failure_callback: List[Callable] = (
|
|||
) # internal variable - async custom callbacks are routed here.
|
||||
pre_call_rules: List[Callable] = []
|
||||
post_call_rules: List[Callable] = []
|
||||
turn_off_message_logging: Optional[bool] = False
|
||||
## end of callbacks #############
|
||||
|
||||
email: Optional[str] = (
|
||||
None # Not used anymore, will be removed in next MAJOR release - https://github.com/BerriAI/litellm/discussions/648
|
||||
)
|
||||
|
@ -43,6 +59,7 @@ max_tokens = 256 # OpenAI Defaults
|
|||
drop_params = False
|
||||
modify_params = False
|
||||
retry = True
|
||||
### AUTH ###
|
||||
api_key: Optional[str] = None
|
||||
openai_key: Optional[str] = None
|
||||
azure_key: Optional[str] = None
|
||||
|
@ -52,6 +69,7 @@ cohere_key: Optional[str] = None
|
|||
clarifai_key: Optional[str] = None
|
||||
maritalk_key: Optional[str] = None
|
||||
ai21_key: Optional[str] = None
|
||||
ollama_key: Optional[str] = None
|
||||
openrouter_key: Optional[str] = None
|
||||
huggingface_key: Optional[str] = None
|
||||
vertex_project: Optional[str] = None
|
||||
|
@ -61,7 +79,12 @@ cloudflare_api_key: Optional[str] = None
|
|||
baseten_key: Optional[str] = None
|
||||
aleph_alpha_key: Optional[str] = None
|
||||
nlp_cloud_key: Optional[str] = None
|
||||
common_cloud_provider_auth_params: dict = {
|
||||
"params": ["project", "region_name", "token"],
|
||||
"providers": ["vertex_ai", "bedrock", "watsonx", "azure"],
|
||||
}
|
||||
use_client: bool = False
|
||||
ssl_verify: bool = True
|
||||
disable_streaming_logging: bool = False
|
||||
### GUARDRAILS ###
|
||||
llamaguard_model_name: Optional[str] = None
|
||||
|
@ -283,6 +306,7 @@ aleph_alpha_models: List = []
|
|||
bedrock_models: List = []
|
||||
deepinfra_models: List = []
|
||||
perplexity_models: List = []
|
||||
watsonx_models: List = []
|
||||
for key, value in model_cost.items():
|
||||
if value.get("litellm_provider") == "openai":
|
||||
open_ai_chat_completion_models.append(key)
|
||||
|
@ -327,6 +351,8 @@ for key, value in model_cost.items():
|
|||
deepinfra_models.append(key)
|
||||
elif value.get("litellm_provider") == "perplexity":
|
||||
perplexity_models.append(key)
|
||||
elif value.get("litellm_provider") == "watsonx":
|
||||
watsonx_models.append(key)
|
||||
|
||||
# known openai compatible endpoints - we'll eventually move this list to the model_prices_and_context_window.json dictionary
|
||||
openai_compatible_endpoints: List = [
|
||||
|
@ -530,6 +556,7 @@ model_list = (
|
|||
+ perplexity_models
|
||||
+ maritalk_models
|
||||
+ vertex_language_models
|
||||
+ watsonx_models
|
||||
)
|
||||
|
||||
provider_list: List = [
|
||||
|
@ -569,6 +596,7 @@ provider_list: List = [
|
|||
"cloudflare",
|
||||
"xinference",
|
||||
"fireworks_ai",
|
||||
"watsonx",
|
||||
"custom", # custom apis
|
||||
]
|
||||
|
||||
|
@ -590,6 +618,7 @@ models_by_provider: dict = {
|
|||
"deepinfra": deepinfra_models,
|
||||
"perplexity": perplexity_models,
|
||||
"maritalk": maritalk_models,
|
||||
"watsonx": watsonx_models,
|
||||
}
|
||||
|
||||
# mapping for those models which have larger equivalents
|
||||
|
@ -701,9 +730,11 @@ from .llms.bedrock import (
|
|||
AmazonLlamaConfig,
|
||||
AmazonStabilityConfig,
|
||||
AmazonMistralConfig,
|
||||
AmazonBedrockGlobalConfig,
|
||||
)
|
||||
from .llms.openai import OpenAIConfig, OpenAITextCompletionConfig
|
||||
from .llms.azure import AzureOpenAIConfig, AzureOpenAIError
|
||||
from .llms.watsonx import IBMWatsonXAIConfig
|
||||
from .main import * # type: ignore
|
||||
from .integrations import *
|
||||
from .exceptions import (
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
import logging
|
||||
|
||||
set_verbose = False
|
||||
|
||||
json_logs = False
|
||||
# Create a handler for the logger (you may need to adapt this based on your needs)
|
||||
handler = logging.StreamHandler()
|
||||
handler.setLevel(logging.DEBUG)
|
||||
|
|
|
@ -32,6 +32,25 @@ def _get_redis_kwargs():
|
|||
return available_args
|
||||
|
||||
|
||||
def _get_redis_url_kwargs(client=None):
|
||||
if client is None:
|
||||
client = redis.Redis.from_url
|
||||
arg_spec = inspect.getfullargspec(redis.Redis.from_url)
|
||||
|
||||
# Only allow primitive arguments
|
||||
exclude_args = {
|
||||
"self",
|
||||
"connection_pool",
|
||||
"retry",
|
||||
}
|
||||
|
||||
include_args = ["url"]
|
||||
|
||||
available_args = [x for x in arg_spec.args if x not in exclude_args] + include_args
|
||||
|
||||
return available_args
|
||||
|
||||
|
||||
def _get_redis_env_kwarg_mapping():
|
||||
PREFIX = "REDIS_"
|
||||
|
||||
|
@ -91,27 +110,39 @@ def _get_redis_client_logic(**env_overrides):
|
|||
redis_kwargs.pop("password", None)
|
||||
elif "host" not in redis_kwargs or redis_kwargs["host"] is None:
|
||||
raise ValueError("Either 'host' or 'url' must be specified for redis.")
|
||||
litellm.print_verbose(f"redis_kwargs: {redis_kwargs}")
|
||||
# litellm.print_verbose(f"redis_kwargs: {redis_kwargs}")
|
||||
return redis_kwargs
|
||||
|
||||
|
||||
def get_redis_client(**env_overrides):
|
||||
redis_kwargs = _get_redis_client_logic(**env_overrides)
|
||||
if "url" in redis_kwargs and redis_kwargs["url"] is not None:
|
||||
redis_kwargs.pop(
|
||||
"connection_pool", None
|
||||
) # redis.from_url doesn't support setting your own connection pool
|
||||
return redis.Redis.from_url(**redis_kwargs)
|
||||
args = _get_redis_url_kwargs()
|
||||
url_kwargs = {}
|
||||
for arg in redis_kwargs:
|
||||
if arg in args:
|
||||
url_kwargs[arg] = redis_kwargs[arg]
|
||||
|
||||
return redis.Redis.from_url(**url_kwargs)
|
||||
return redis.Redis(**redis_kwargs)
|
||||
|
||||
|
||||
def get_redis_async_client(**env_overrides):
|
||||
redis_kwargs = _get_redis_client_logic(**env_overrides)
|
||||
if "url" in redis_kwargs and redis_kwargs["url"] is not None:
|
||||
redis_kwargs.pop(
|
||||
"connection_pool", None
|
||||
) # redis.from_url doesn't support setting your own connection pool
|
||||
return async_redis.Redis.from_url(**redis_kwargs)
|
||||
args = _get_redis_url_kwargs(client=async_redis.Redis.from_url)
|
||||
url_kwargs = {}
|
||||
for arg in redis_kwargs:
|
||||
if arg in args:
|
||||
url_kwargs[arg] = redis_kwargs[arg]
|
||||
else:
|
||||
litellm.print_verbose(
|
||||
"REDIS: ignoring argument: {}. Not an allowed async_redis.Redis.from_url arg.".format(
|
||||
arg
|
||||
)
|
||||
)
|
||||
return async_redis.Redis.from_url(**url_kwargs)
|
||||
|
||||
return async_redis.Redis(
|
||||
socket_timeout=5,
|
||||
**redis_kwargs,
|
||||
|
@ -124,4 +155,9 @@ def get_redis_connection_pool(**env_overrides):
|
|||
return async_redis.BlockingConnectionPool.from_url(
|
||||
timeout=5, url=redis_kwargs["url"]
|
||||
)
|
||||
connection_class = async_redis.Connection
|
||||
if "ssl" in redis_kwargs and redis_kwargs["ssl"] is not None:
|
||||
connection_class = async_redis.SSLConnection
|
||||
redis_kwargs.pop("ssl", None)
|
||||
redis_kwargs["connection_class"] = connection_class
|
||||
return async_redis.BlockingConnectionPool(timeout=5, **redis_kwargs)
|
||||
|
|
|
@ -1,9 +1,13 @@
|
|||
import litellm
|
||||
import litellm, traceback
|
||||
from litellm.proxy._types import UserAPIKeyAuth
|
||||
from .types.services import ServiceTypes, ServiceLoggerPayload
|
||||
from .integrations.prometheus_services import PrometheusServicesLogger
|
||||
from .integrations.custom_logger import CustomLogger
|
||||
from datetime import timedelta
|
||||
from typing import Union
|
||||
|
||||
|
||||
class ServiceLogging:
|
||||
class ServiceLogging(CustomLogger):
|
||||
"""
|
||||
Separate class used for monitoring health of litellm-adjacent services (redis/postgres).
|
||||
"""
|
||||
|
@ -14,11 +18,12 @@ class ServiceLogging:
|
|||
self.mock_testing_async_success_hook = 0
|
||||
self.mock_testing_sync_failure_hook = 0
|
||||
self.mock_testing_async_failure_hook = 0
|
||||
|
||||
if "prometheus_system" in litellm.service_callback:
|
||||
self.prometheusServicesLogger = PrometheusServicesLogger()
|
||||
|
||||
def service_success_hook(self, service: ServiceTypes, duration: float):
|
||||
def service_success_hook(
|
||||
self, service: ServiceTypes, duration: float, call_type: str
|
||||
):
|
||||
"""
|
||||
[TODO] Not implemented for sync calls yet. V0 is focused on async monitoring (used by proxy).
|
||||
"""
|
||||
|
@ -26,7 +31,7 @@ class ServiceLogging:
|
|||
self.mock_testing_sync_success_hook += 1
|
||||
|
||||
def service_failure_hook(
|
||||
self, service: ServiceTypes, duration: float, error: Exception
|
||||
self, service: ServiceTypes, duration: float, error: Exception, call_type: str
|
||||
):
|
||||
"""
|
||||
[TODO] Not implemented for sync calls yet. V0 is focused on async monitoring (used by proxy).
|
||||
|
@ -34,7 +39,9 @@ class ServiceLogging:
|
|||
if self.mock_testing:
|
||||
self.mock_testing_sync_failure_hook += 1
|
||||
|
||||
async def async_service_success_hook(self, service: ServiceTypes, duration: float):
|
||||
async def async_service_success_hook(
|
||||
self, service: ServiceTypes, duration: float, call_type: str
|
||||
):
|
||||
"""
|
||||
- For counting if the redis, postgres call is successful
|
||||
"""
|
||||
|
@ -42,7 +49,11 @@ class ServiceLogging:
|
|||
self.mock_testing_async_success_hook += 1
|
||||
|
||||
payload = ServiceLoggerPayload(
|
||||
is_error=False, error=None, service=service, duration=duration
|
||||
is_error=False,
|
||||
error=None,
|
||||
service=service,
|
||||
duration=duration,
|
||||
call_type=call_type,
|
||||
)
|
||||
for callback in litellm.service_callback:
|
||||
if callback == "prometheus_system":
|
||||
|
@ -51,7 +62,11 @@ class ServiceLogging:
|
|||
)
|
||||
|
||||
async def async_service_failure_hook(
|
||||
self, service: ServiceTypes, duration: float, error: Exception
|
||||
self,
|
||||
service: ServiceTypes,
|
||||
duration: float,
|
||||
error: Union[str, Exception],
|
||||
call_type: str,
|
||||
):
|
||||
"""
|
||||
- For counting if the redis, postgres call is unsuccessful
|
||||
|
@ -59,8 +74,18 @@ class ServiceLogging:
|
|||
if self.mock_testing:
|
||||
self.mock_testing_async_failure_hook += 1
|
||||
|
||||
error_message = ""
|
||||
if isinstance(error, Exception):
|
||||
error_message = str(error)
|
||||
elif isinstance(error, str):
|
||||
error_message = error
|
||||
|
||||
payload = ServiceLoggerPayload(
|
||||
is_error=True, error=str(error), service=service, duration=duration
|
||||
is_error=True,
|
||||
error=error_message,
|
||||
service=service,
|
||||
duration=duration,
|
||||
call_type=call_type,
|
||||
)
|
||||
for callback in litellm.service_callback:
|
||||
if callback == "prometheus_system":
|
||||
|
@ -69,3 +94,37 @@ class ServiceLogging:
|
|||
await self.prometheusServicesLogger.async_service_failure_hook(
|
||||
payload=payload
|
||||
)
|
||||
|
||||
async def async_post_call_failure_hook(
|
||||
self, original_exception: Exception, user_api_key_dict: UserAPIKeyAuth
|
||||
):
|
||||
"""
|
||||
Hook to track failed litellm-service calls
|
||||
"""
|
||||
return await super().async_post_call_failure_hook(
|
||||
original_exception, user_api_key_dict
|
||||
)
|
||||
|
||||
async def async_log_success_event(self, kwargs, response_obj, start_time, end_time):
|
||||
"""
|
||||
Hook to track latency for litellm proxy llm api calls
|
||||
"""
|
||||
try:
|
||||
_duration = end_time - start_time
|
||||
if isinstance(_duration, timedelta):
|
||||
_duration = _duration.total_seconds()
|
||||
elif isinstance(_duration, float):
|
||||
pass
|
||||
else:
|
||||
raise Exception(
|
||||
"Duration={} is not a float or timedelta object. type={}".format(
|
||||
_duration, type(_duration)
|
||||
)
|
||||
) # invalid _duration value
|
||||
await self.async_service_success_hook(
|
||||
service=ServiceTypes.LITELLM,
|
||||
duration=_duration,
|
||||
call_type=kwargs["call_type"],
|
||||
)
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
|
|
@ -13,7 +13,6 @@ import json, traceback, ast, hashlib
|
|||
from typing import Optional, Literal, List, Union, Any, BinaryIO
|
||||
from openai._models import BaseModel as OpenAIObject
|
||||
from litellm._logging import verbose_logger
|
||||
from litellm._service_logger import ServiceLogging
|
||||
from litellm.types.services import ServiceLoggerPayload, ServiceTypes
|
||||
import traceback
|
||||
|
||||
|
@ -90,6 +89,13 @@ class InMemoryCache(BaseCache):
|
|||
return_val.append(val)
|
||||
return return_val
|
||||
|
||||
def increment_cache(self, key, value: int, **kwargs) -> int:
|
||||
# get the value
|
||||
init_value = self.get_cache(key=key) or 0
|
||||
value = init_value + value
|
||||
self.set_cache(key, value, **kwargs)
|
||||
return value
|
||||
|
||||
async def async_get_cache(self, key, **kwargs):
|
||||
return self.get_cache(key=key, **kwargs)
|
||||
|
||||
|
@ -132,6 +138,7 @@ class RedisCache(BaseCache):
|
|||
**kwargs,
|
||||
):
|
||||
from ._redis import get_redis_client, get_redis_connection_pool
|
||||
from litellm._service_logger import ServiceLogging
|
||||
import redis
|
||||
|
||||
redis_kwargs = {}
|
||||
|
@ -142,18 +149,19 @@ class RedisCache(BaseCache):
|
|||
if password is not None:
|
||||
redis_kwargs["password"] = password
|
||||
|
||||
### HEALTH MONITORING OBJECT ###
|
||||
if kwargs.get("service_logger_obj", None) is not None and isinstance(
|
||||
kwargs["service_logger_obj"], ServiceLogging
|
||||
):
|
||||
self.service_logger_obj = kwargs.pop("service_logger_obj")
|
||||
else:
|
||||
self.service_logger_obj = ServiceLogging()
|
||||
|
||||
redis_kwargs.update(kwargs)
|
||||
self.redis_client = get_redis_client(**redis_kwargs)
|
||||
self.redis_kwargs = redis_kwargs
|
||||
self.async_redis_conn_pool = get_redis_connection_pool(**redis_kwargs)
|
||||
|
||||
if "url" in redis_kwargs and redis_kwargs["url"] is not None:
|
||||
parsed_kwargs = redis.connection.parse_url(redis_kwargs["url"])
|
||||
redis_kwargs.update(parsed_kwargs)
|
||||
self.redis_kwargs.update(parsed_kwargs)
|
||||
# pop url
|
||||
self.redis_kwargs.pop("url")
|
||||
|
||||
# redis namespaces
|
||||
self.namespace = namespace
|
||||
# for high traffic, we store the redis results in memory and then batch write to redis
|
||||
|
@ -165,8 +173,15 @@ class RedisCache(BaseCache):
|
|||
except Exception as e:
|
||||
pass
|
||||
|
||||
### HEALTH MONITORING OBJECT ###
|
||||
self.service_logger_obj = ServiceLogging()
|
||||
### ASYNC HEALTH PING ###
|
||||
try:
|
||||
# asyncio.get_running_loop().create_task(self.ping())
|
||||
result = asyncio.get_running_loop().create_task(self.ping())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
### SYNC HEALTH PING ###
|
||||
self.redis_client.ping()
|
||||
|
||||
def init_async_client(self):
|
||||
from ._redis import get_redis_async_client
|
||||
|
@ -198,6 +213,42 @@ class RedisCache(BaseCache):
|
|||
f"LiteLLM Caching: set() - Got exception from REDIS : {str(e)}"
|
||||
)
|
||||
|
||||
def increment_cache(self, key, value: int, **kwargs) -> int:
|
||||
_redis_client = self.redis_client
|
||||
start_time = time.time()
|
||||
try:
|
||||
result = _redis_client.incr(name=key, amount=value)
|
||||
## LOGGING ##
|
||||
end_time = time.time()
|
||||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.service_success_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="increment_cache",
|
||||
)
|
||||
)
|
||||
return result
|
||||
except Exception as e:
|
||||
## LOGGING ##
|
||||
end_time = time.time()
|
||||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="increment_cache",
|
||||
)
|
||||
)
|
||||
verbose_logger.error(
|
||||
"LiteLLM Redis Caching: increment_cache() - Got exception from REDIS %s, Writing value=%s",
|
||||
str(e),
|
||||
value,
|
||||
)
|
||||
traceback.print_exc()
|
||||
raise e
|
||||
|
||||
async def async_scan_iter(self, pattern: str, count: int = 100) -> list:
|
||||
start_time = time.time()
|
||||
try:
|
||||
|
@ -216,7 +267,9 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_scan_iter",
|
||||
)
|
||||
) # DO NOT SLOW DOWN CALL B/C OF THIS
|
||||
return keys
|
||||
|
@ -227,7 +280,10 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration, error=e
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_scan_iter",
|
||||
)
|
||||
)
|
||||
raise e
|
||||
|
@ -267,7 +323,9 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_set_cache",
|
||||
)
|
||||
)
|
||||
except Exception as e:
|
||||
|
@ -275,7 +333,10 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration, error=e
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_set_cache",
|
||||
)
|
||||
)
|
||||
# NON blocking - notify users Redis is throwing an exception
|
||||
|
@ -292,6 +353,10 @@ class RedisCache(BaseCache):
|
|||
"""
|
||||
_redis_client = self.init_async_client()
|
||||
start_time = time.time()
|
||||
|
||||
print_verbose(
|
||||
f"Set Async Redis Cache: key list: {cache_list}\nttl={ttl}, redis_version={self.redis_version}"
|
||||
)
|
||||
try:
|
||||
async with _redis_client as redis_client:
|
||||
async with redis_client.pipeline(transaction=True) as pipe:
|
||||
|
@ -316,7 +381,9 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_set_cache_pipeline",
|
||||
)
|
||||
)
|
||||
return results
|
||||
|
@ -326,7 +393,10 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration, error=e
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_set_cache_pipeline",
|
||||
)
|
||||
)
|
||||
|
||||
|
@ -359,6 +429,7 @@ class RedisCache(BaseCache):
|
|||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_increment",
|
||||
)
|
||||
)
|
||||
return result
|
||||
|
@ -368,7 +439,10 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration, error=e
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_increment",
|
||||
)
|
||||
)
|
||||
verbose_logger.error(
|
||||
|
@ -459,7 +533,9 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_get_cache",
|
||||
)
|
||||
)
|
||||
return response
|
||||
|
@ -469,7 +545,10 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration, error=e
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_get_cache",
|
||||
)
|
||||
)
|
||||
# NON blocking - notify users Redis is throwing an exception
|
||||
|
@ -497,7 +576,9 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_batch_get_cache",
|
||||
)
|
||||
)
|
||||
|
||||
|
@ -519,21 +600,81 @@ class RedisCache(BaseCache):
|
|||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS, duration=_duration, error=e
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_batch_get_cache",
|
||||
)
|
||||
)
|
||||
print_verbose(f"Error occurred in pipeline read - {str(e)}")
|
||||
return key_value_dict
|
||||
|
||||
async def ping(self):
|
||||
def sync_ping(self) -> bool:
|
||||
"""
|
||||
Tests if the sync redis client is correctly setup.
|
||||
"""
|
||||
print_verbose(f"Pinging Sync Redis Cache")
|
||||
start_time = time.time()
|
||||
try:
|
||||
response = self.redis_client.ping()
|
||||
print_verbose(f"Redis Cache PING: {response}")
|
||||
## LOGGING ##
|
||||
end_time = time.time()
|
||||
_duration = end_time - start_time
|
||||
self.service_logger_obj.service_success_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="sync_ping",
|
||||
)
|
||||
return response
|
||||
except Exception as e:
|
||||
# NON blocking - notify users Redis is throwing an exception
|
||||
## LOGGING ##
|
||||
end_time = time.time()
|
||||
_duration = end_time - start_time
|
||||
self.service_logger_obj.service_failure_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="sync_ping",
|
||||
)
|
||||
print_verbose(
|
||||
f"LiteLLM Redis Cache PING: - Got exception from REDIS : {str(e)}"
|
||||
)
|
||||
traceback.print_exc()
|
||||
raise e
|
||||
|
||||
async def ping(self) -> bool:
|
||||
_redis_client = self.init_async_client()
|
||||
start_time = time.time()
|
||||
async with _redis_client as redis_client:
|
||||
print_verbose(f"Pinging Async Redis Cache")
|
||||
try:
|
||||
response = await redis_client.ping()
|
||||
print_verbose(f"Redis Cache PING: {response}")
|
||||
## LOGGING ##
|
||||
end_time = time.time()
|
||||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_success_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
call_type="async_ping",
|
||||
)
|
||||
)
|
||||
return response
|
||||
except Exception as e:
|
||||
# NON blocking - notify users Redis is throwing an exception
|
||||
## LOGGING ##
|
||||
end_time = time.time()
|
||||
_duration = end_time - start_time
|
||||
asyncio.create_task(
|
||||
self.service_logger_obj.async_service_failure_hook(
|
||||
service=ServiceTypes.REDIS,
|
||||
duration=_duration,
|
||||
error=e,
|
||||
call_type="async_ping",
|
||||
)
|
||||
)
|
||||
print_verbose(
|
||||
f"LiteLLM Redis Cache PING: - Got exception from REDIS : {str(e)}"
|
||||
)
|
||||
|
@ -1064,6 +1205,30 @@ class DualCache(BaseCache):
|
|||
except Exception as e:
|
||||
print_verbose(e)
|
||||
|
||||
def increment_cache(
|
||||
self, key, value: int, local_only: bool = False, **kwargs
|
||||
) -> int:
|
||||
"""
|
||||
Key - the key in cache
|
||||
|
||||
Value - int - the value you want to increment by
|
||||
|
||||
Returns - int - the incremented value
|
||||
"""
|
||||
try:
|
||||
result: int = value
|
||||
if self.in_memory_cache is not None:
|
||||
result = self.in_memory_cache.increment_cache(key, value, **kwargs)
|
||||
|
||||
if self.redis_cache is not None and local_only == False:
|
||||
result = self.redis_cache.increment_cache(key, value, **kwargs)
|
||||
|
||||
return result
|
||||
except Exception as e:
|
||||
print_verbose(f"LiteLLM Cache: Excepton async add_cache: {str(e)}")
|
||||
traceback.print_exc()
|
||||
raise e
|
||||
|
||||
def get_cache(self, key, local_only: bool = False, **kwargs):
|
||||
# Try to fetch from in-memory cache first
|
||||
try:
|
||||
|
@ -1116,7 +1281,7 @@ class DualCache(BaseCache):
|
|||
self.in_memory_cache.set_cache(key, redis_result[key], **kwargs)
|
||||
|
||||
for key, value in redis_result.items():
|
||||
result[sublist_keys.index(key)] = value
|
||||
result[keys.index(key)] = value
|
||||
|
||||
print_verbose(f"async batch get cache: cache result: {result}")
|
||||
return result
|
||||
|
@ -1166,10 +1331,8 @@ class DualCache(BaseCache):
|
|||
keys, **kwargs
|
||||
)
|
||||
|
||||
print_verbose(f"in_memory_result: {in_memory_result}")
|
||||
if in_memory_result is not None:
|
||||
result = in_memory_result
|
||||
|
||||
if None in result and self.redis_cache is not None and local_only == False:
|
||||
"""
|
||||
- for the none values in the result
|
||||
|
@ -1185,22 +1348,23 @@ class DualCache(BaseCache):
|
|||
|
||||
if redis_result is not None:
|
||||
# Update in-memory cache with the value from Redis
|
||||
for key in redis_result:
|
||||
await self.in_memory_cache.async_set_cache(
|
||||
key, redis_result[key], **kwargs
|
||||
)
|
||||
for key, value in redis_result.items():
|
||||
if value is not None:
|
||||
await self.in_memory_cache.async_set_cache(
|
||||
key, redis_result[key], **kwargs
|
||||
)
|
||||
for key, value in redis_result.items():
|
||||
index = keys.index(key)
|
||||
result[index] = value
|
||||
|
||||
sublist_dict = dict(zip(sublist_keys, redis_result))
|
||||
|
||||
for key, value in sublist_dict.items():
|
||||
result[sublist_keys.index(key)] = value
|
||||
|
||||
print_verbose(f"async batch get cache: cache result: {result}")
|
||||
return result
|
||||
except Exception as e:
|
||||
traceback.print_exc()
|
||||
|
||||
async def async_set_cache(self, key, value, local_only: bool = False, **kwargs):
|
||||
print_verbose(
|
||||
f"async set cache: cache key: {key}; local_only: {local_only}; value: {value}"
|
||||
)
|
||||
try:
|
||||
if self.in_memory_cache is not None:
|
||||
await self.in_memory_cache.async_set_cache(key, value, **kwargs)
|
||||
|
|
|
@ -82,14 +82,18 @@ class UnprocessableEntityError(UnprocessableEntityError): # type: ignore
|
|||
|
||||
class Timeout(APITimeoutError): # type: ignore
|
||||
def __init__(self, message, model, llm_provider):
|
||||
self.status_code = 408
|
||||
self.message = message
|
||||
self.model = model
|
||||
self.llm_provider = llm_provider
|
||||
request = httpx.Request(method="POST", url="https://api.openai.com/v1")
|
||||
super().__init__(
|
||||
request=request
|
||||
) # Call the base class constructor with the parameters it needs
|
||||
self.status_code = 408
|
||||
self.message = message
|
||||
self.model = model
|
||||
self.llm_provider = llm_provider
|
||||
|
||||
# custom function to convert to str
|
||||
def __str__(self):
|
||||
return str(self.message)
|
||||
|
||||
|
||||
class PermissionDeniedError(PermissionDeniedError): # type:ignore
|
||||
|
|
|
@ -6,7 +6,7 @@ import requests
|
|||
from litellm.proxy._types import UserAPIKeyAuth
|
||||
from litellm.caching import DualCache
|
||||
|
||||
from typing import Literal, Union
|
||||
from typing import Literal, Union, Optional
|
||||
|
||||
dotenv.load_dotenv() # Loading env variables using dotenv
|
||||
import traceback
|
||||
|
@ -46,6 +46,17 @@ class CustomLogger: # https://docs.litellm.ai/docs/observability/custom_callbac
|
|||
async def async_log_failure_event(self, kwargs, response_obj, start_time, end_time):
|
||||
pass
|
||||
|
||||
#### PRE-CALL CHECKS - router/proxy only ####
|
||||
"""
|
||||
Allows usage-based-routing-v2 to run pre-call rpm checks within the picked deployment's semaphore (concurrency-safe tpm/rpm checks).
|
||||
"""
|
||||
|
||||
async def async_pre_call_check(self, deployment: dict) -> Optional[dict]:
|
||||
pass
|
||||
|
||||
def pre_call_check(self, deployment: dict) -> Optional[dict]:
|
||||
pass
|
||||
|
||||
#### CALL HOOKS - proxy only ####
|
||||
"""
|
||||
Control the modify incoming / outgoung data before calling the model
|
||||
|
|
51
litellm/integrations/greenscale.py
Normal file
51
litellm/integrations/greenscale.py
Normal file
|
@ -0,0 +1,51 @@
|
|||
import requests
|
||||
import json
|
||||
import traceback
|
||||
from datetime import datetime, timezone
|
||||
|
||||
class GreenscaleLogger:
|
||||
def __init__(self):
|
||||
import os
|
||||
self.greenscale_api_key = os.getenv("GREENSCALE_API_KEY")
|
||||
self.headers = {
|
||||
"api-key": self.greenscale_api_key,
|
||||
"Content-Type": "application/json"
|
||||
}
|
||||
self.greenscale_logging_url = os.getenv("GREENSCALE_ENDPOINT")
|
||||
|
||||
def log_event(self, kwargs, response_obj, start_time, end_time, print_verbose):
|
||||
try:
|
||||
response_json = response_obj.model_dump() if response_obj else {}
|
||||
data = {
|
||||
"modelId": kwargs.get("model"),
|
||||
"inputTokenCount": response_json.get("usage", {}).get("prompt_tokens"),
|
||||
"outputTokenCount": response_json.get("usage", {}).get("completion_tokens"),
|
||||
}
|
||||
data["timestamp"] = datetime.now(timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ')
|
||||
|
||||
if type(end_time) == datetime and type(start_time) == datetime:
|
||||
data["invocationLatency"] = int((end_time - start_time).total_seconds() * 1000)
|
||||
|
||||
|
||||
# Add additional metadata keys to tags
|
||||
tags = []
|
||||
metadata = kwargs.get("litellm_params", {}).get("metadata", {})
|
||||
for key, value in metadata.items():
|
||||
if key.startswith("greenscale"):
|
||||
if key == "greenscale_project":
|
||||
data["project"] = value
|
||||
elif key == "greenscale_application":
|
||||
data["application"] = value
|
||||
else:
|
||||
tags.append({"key": key.replace("greenscale_", ""), "value": str(value)})
|
||||
|
||||
data["tags"] = tags
|
||||
|
||||
response = requests.post(self.greenscale_logging_url, headers=self.headers, data=json.dumps(data, default=str))
|
||||
if response.status_code != 200:
|
||||
print_verbose(f"Greenscale Logger Error - {response.text}, {response.status_code}")
|
||||
else:
|
||||
print_verbose(f"Greenscale Logger Succeeded - {response.text}")
|
||||
except Exception as e:
|
||||
print_verbose(f"Greenscale Logger Error - {e}, Stack trace: {traceback.format_exc()}")
|
||||
pass
|
|
@ -12,7 +12,9 @@ import litellm
|
|||
|
||||
class LangFuseLogger:
|
||||
# Class variables or attributes
|
||||
def __init__(self, langfuse_public_key=None, langfuse_secret=None):
|
||||
def __init__(
|
||||
self, langfuse_public_key=None, langfuse_secret=None, flush_interval=1
|
||||
):
|
||||
try:
|
||||
from langfuse import Langfuse
|
||||
except Exception as e:
|
||||
|
@ -31,9 +33,17 @@ class LangFuseLogger:
|
|||
host=self.langfuse_host,
|
||||
release=self.langfuse_release,
|
||||
debug=self.langfuse_debug,
|
||||
flush_interval=1, # flush interval in seconds
|
||||
flush_interval=flush_interval, # flush interval in seconds
|
||||
)
|
||||
|
||||
# set the current langfuse project id in the environ
|
||||
# this is used by Alerting to link to the correct project
|
||||
try:
|
||||
project_id = self.Langfuse.client.projects.get().data[0].id
|
||||
os.environ["LANGFUSE_PROJECT_ID"] = project_id
|
||||
except:
|
||||
project_id = None
|
||||
|
||||
if os.getenv("UPSTREAM_LANGFUSE_SECRET_KEY") is not None:
|
||||
self.upstream_langfuse_secret_key = os.getenv(
|
||||
"UPSTREAM_LANGFUSE_SECRET_KEY"
|
||||
|
@ -76,6 +86,7 @@ class LangFuseLogger:
|
|||
print_verbose(
|
||||
f"Langfuse Logging - Enters logging function for model {kwargs}"
|
||||
)
|
||||
|
||||
litellm_params = kwargs.get("litellm_params", {})
|
||||
metadata = (
|
||||
litellm_params.get("metadata", {}) or {}
|
||||
|
@ -133,6 +144,7 @@ class LangFuseLogger:
|
|||
self._log_langfuse_v2(
|
||||
user_id,
|
||||
metadata,
|
||||
litellm_params,
|
||||
output,
|
||||
start_time,
|
||||
end_time,
|
||||
|
@ -224,6 +236,7 @@ class LangFuseLogger:
|
|||
self,
|
||||
user_id,
|
||||
metadata,
|
||||
litellm_params,
|
||||
output,
|
||||
start_time,
|
||||
end_time,
|
||||
|
@ -252,15 +265,18 @@ class LangFuseLogger:
|
|||
tags = metadata_tags
|
||||
|
||||
trace_name = metadata.get("trace_name", None)
|
||||
if trace_name is None:
|
||||
trace_id = metadata.get("trace_id", None)
|
||||
existing_trace_id = metadata.get("existing_trace_id", None)
|
||||
if trace_name is None and existing_trace_id is None:
|
||||
# just log `litellm-{call_type}` as the trace name
|
||||
## DO NOT SET TRACE_NAME if trace-id set. this can lead to overwriting of past traces.
|
||||
trace_name = f"litellm-{kwargs.get('call_type', 'completion')}"
|
||||
|
||||
trace_params = {
|
||||
"name": trace_name,
|
||||
"input": input,
|
||||
"user_id": metadata.get("trace_user_id", user_id),
|
||||
"id": metadata.get("trace_id", None),
|
||||
"id": trace_id or existing_trace_id,
|
||||
"session_id": metadata.get("session_id", None),
|
||||
}
|
||||
|
||||
|
@ -278,13 +294,13 @@ class LangFuseLogger:
|
|||
clean_metadata = {}
|
||||
if isinstance(metadata, dict):
|
||||
for key, value in metadata.items():
|
||||
# generate langfuse tags
|
||||
if key in [
|
||||
"user_api_key",
|
||||
"user_api_key_user_id",
|
||||
"user_api_key_team_id",
|
||||
"semantic-similarity",
|
||||
]:
|
||||
|
||||
# generate langfuse tags - Default Tags sent to Langfuse from LiteLLM Proxy
|
||||
if (
|
||||
litellm._langfuse_default_tags is not None
|
||||
and isinstance(litellm._langfuse_default_tags, list)
|
||||
and key in litellm._langfuse_default_tags
|
||||
):
|
||||
tags.append(f"{key}:{value}")
|
||||
|
||||
# clean litellm metadata before logging
|
||||
|
@ -298,13 +314,53 @@ class LangFuseLogger:
|
|||
else:
|
||||
clean_metadata[key] = value
|
||||
|
||||
if (
|
||||
litellm._langfuse_default_tags is not None
|
||||
and isinstance(litellm._langfuse_default_tags, list)
|
||||
and "proxy_base_url" in litellm._langfuse_default_tags
|
||||
):
|
||||
proxy_base_url = os.environ.get("PROXY_BASE_URL", None)
|
||||
if proxy_base_url is not None:
|
||||
tags.append(f"proxy_base_url:{proxy_base_url}")
|
||||
|
||||
api_base = litellm_params.get("api_base", None)
|
||||
if api_base:
|
||||
clean_metadata["api_base"] = api_base
|
||||
|
||||
vertex_location = kwargs.get("vertex_location", None)
|
||||
if vertex_location:
|
||||
clean_metadata["vertex_location"] = vertex_location
|
||||
|
||||
aws_region_name = kwargs.get("aws_region_name", None)
|
||||
if aws_region_name:
|
||||
clean_metadata["aws_region_name"] = aws_region_name
|
||||
|
||||
if supports_tags:
|
||||
if "cache_hit" in kwargs:
|
||||
if kwargs["cache_hit"] is None:
|
||||
kwargs["cache_hit"] = False
|
||||
tags.append(f"cache_hit:{kwargs['cache_hit']}")
|
||||
clean_metadata["cache_hit"] = kwargs["cache_hit"]
|
||||
trace_params.update({"tags": tags})
|
||||
|
||||
proxy_server_request = litellm_params.get("proxy_server_request", None)
|
||||
if proxy_server_request:
|
||||
method = proxy_server_request.get("method", None)
|
||||
url = proxy_server_request.get("url", None)
|
||||
headers = proxy_server_request.get("headers", None)
|
||||
clean_headers = {}
|
||||
if headers:
|
||||
for key, value in headers.items():
|
||||
# these headers can leak our API keys and/or JWT tokens
|
||||
if key.lower() not in ["authorization", "cookie", "referer"]:
|
||||
clean_headers[key] = value
|
||||
|
||||
clean_metadata["request"] = {
|
||||
"method": method,
|
||||
"url": url,
|
||||
"headers": clean_headers,
|
||||
}
|
||||
|
||||
print_verbose(f"trace_params: {trace_params}")
|
||||
|
||||
trace = self.Langfuse.trace(**trace_params)
|
||||
|
@ -323,7 +379,11 @@ class LangFuseLogger:
|
|||
# just log `litellm-{call_type}` as the generation name
|
||||
generation_name = f"litellm-{kwargs.get('call_type', 'completion')}"
|
||||
|
||||
system_fingerprint = response_obj.get("system_fingerprint", None)
|
||||
if response_obj is not None and "system_fingerprint" in response_obj:
|
||||
system_fingerprint = response_obj.get("system_fingerprint", None)
|
||||
else:
|
||||
system_fingerprint = None
|
||||
|
||||
if system_fingerprint is not None:
|
||||
optional_params["system_fingerprint"] = system_fingerprint
|
||||
|
||||
|
|
|
@ -7,6 +7,19 @@ from datetime import datetime
|
|||
|
||||
dotenv.load_dotenv() # Loading env variables using dotenv
|
||||
import traceback
|
||||
import asyncio
|
||||
import types
|
||||
from pydantic import BaseModel
|
||||
|
||||
|
||||
def is_serializable(value):
|
||||
non_serializable_types = (
|
||||
types.CoroutineType,
|
||||
types.FunctionType,
|
||||
types.GeneratorType,
|
||||
BaseModel,
|
||||
)
|
||||
return not isinstance(value, non_serializable_types)
|
||||
|
||||
|
||||
class LangsmithLogger:
|
||||
|
@ -21,7 +34,9 @@ class LangsmithLogger:
|
|||
def log_event(self, kwargs, response_obj, start_time, end_time, print_verbose):
|
||||
# Method definition
|
||||
# inspired by Langsmith http api here: https://github.com/langchain-ai/langsmith-cookbook/blob/main/tracing-examples/rest/rest.ipynb
|
||||
metadata = kwargs.get('litellm_params', {}).get("metadata", {}) or {} # if metadata is None
|
||||
metadata = (
|
||||
kwargs.get("litellm_params", {}).get("metadata", {}) or {}
|
||||
) # if metadata is None
|
||||
|
||||
# set project name and run_name for langsmith logging
|
||||
# users can pass project_name and run name to litellm.completion()
|
||||
|
@ -51,26 +66,42 @@ class LangsmithLogger:
|
|||
new_kwargs = {}
|
||||
for key in kwargs:
|
||||
value = kwargs[key]
|
||||
if key == "start_time" or key == "end_time":
|
||||
if key == "start_time" or key == "end_time" or value is None:
|
||||
pass
|
||||
elif type(value) == datetime.datetime:
|
||||
new_kwargs[key] = value.isoformat()
|
||||
elif type(value) != dict:
|
||||
elif type(value) != dict and is_serializable(value=value):
|
||||
new_kwargs[key] = value
|
||||
|
||||
requests.post(
|
||||
if isinstance(response_obj, BaseModel):
|
||||
try:
|
||||
response_obj = response_obj.model_dump()
|
||||
except:
|
||||
response_obj = response_obj.dict() # type: ignore
|
||||
|
||||
print(f"response_obj: {response_obj}")
|
||||
|
||||
data = {
|
||||
"name": run_name,
|
||||
"run_type": "llm", # this should always be llm, since litellm always logs llm calls. Langsmith allow us to log "chain"
|
||||
"inputs": new_kwargs,
|
||||
"outputs": response_obj,
|
||||
"session_name": project_name,
|
||||
"start_time": start_time,
|
||||
"end_time": end_time,
|
||||
}
|
||||
print(f"data: {data}")
|
||||
|
||||
response = requests.post(
|
||||
"https://api.smith.langchain.com/runs",
|
||||
json={
|
||||
"name": run_name,
|
||||
"run_type": "llm", # this should always be llm, since litellm always logs llm calls. Langsmith allow us to log "chain"
|
||||
"inputs": {**new_kwargs},
|
||||
"outputs": response_obj.json(),
|
||||
"session_name": project_name,
|
||||
"start_time": start_time,
|
||||
"end_time": end_time,
|
||||
},
|
||||
json=data,
|
||||
headers={"x-api-key": self.langsmith_api_key},
|
||||
)
|
||||
|
||||
if response.status_code >= 300:
|
||||
print_verbose(f"Error: {response.status_code}")
|
||||
else:
|
||||
print_verbose("Run successfully created")
|
||||
print_verbose(
|
||||
f"Langsmith Layer Logging - final response object: {response_obj}"
|
||||
)
|
||||
|
|
|
@ -19,27 +19,33 @@ class PrometheusLogger:
|
|||
**kwargs,
|
||||
):
|
||||
try:
|
||||
verbose_logger.debug(f"in init prometheus metrics")
|
||||
print(f"in init prometheus metrics")
|
||||
from prometheus_client import Counter
|
||||
|
||||
self.litellm_llm_api_failed_requests_metric = Counter(
|
||||
name="litellm_llm_api_failed_requests_metric",
|
||||
documentation="Total number of failed LLM API calls via litellm",
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
|
||||
)
|
||||
|
||||
self.litellm_requests_metric = Counter(
|
||||
name="litellm_requests_metric",
|
||||
documentation="Total number of LLM calls to litellm",
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team"],
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
|
||||
)
|
||||
|
||||
# Counter for spend
|
||||
self.litellm_spend_metric = Counter(
|
||||
"litellm_spend_metric",
|
||||
"Total spend on LLM requests",
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team"],
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
|
||||
)
|
||||
|
||||
# Counter for total_output_tokens
|
||||
self.litellm_tokens_metric = Counter(
|
||||
"litellm_total_tokens",
|
||||
"Total number of input + output tokens from LLM requests",
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team"],
|
||||
labelnames=["end_user", "hashed_api_key", "model", "team", "user"],
|
||||
)
|
||||
except Exception as e:
|
||||
print_verbose(f"Got exception on init prometheus client {str(e)}")
|
||||
|
@ -61,15 +67,21 @@ class PrometheusLogger:
|
|||
|
||||
# unpack kwargs
|
||||
model = kwargs.get("model", "")
|
||||
response_cost = kwargs.get("response_cost", 0.0)
|
||||
response_cost = kwargs.get("response_cost", 0.0) or 0
|
||||
litellm_params = kwargs.get("litellm_params", {}) or {}
|
||||
proxy_server_request = litellm_params.get("proxy_server_request") or {}
|
||||
end_user_id = proxy_server_request.get("body", {}).get("user", None)
|
||||
user_id = litellm_params.get("metadata", {}).get(
|
||||
"user_api_key_user_id", None
|
||||
)
|
||||
user_api_key = litellm_params.get("metadata", {}).get("user_api_key", None)
|
||||
user_api_team = litellm_params.get("metadata", {}).get(
|
||||
"user_api_key_team_id", None
|
||||
)
|
||||
tokens_used = response_obj.get("usage", {}).get("total_tokens", 0)
|
||||
if response_obj is not None:
|
||||
tokens_used = response_obj.get("usage", {}).get("total_tokens", 0)
|
||||
else:
|
||||
tokens_used = 0
|
||||
|
||||
print_verbose(
|
||||
f"inside track_prometheus_metrics, model {model}, response_cost {response_cost}, tokens_used {tokens_used}, end_user_id {end_user_id}, user_api_key {user_api_key}"
|
||||
|
@ -85,14 +97,20 @@ class PrometheusLogger:
|
|||
user_api_key = hash_token(user_api_key)
|
||||
|
||||
self.litellm_requests_metric.labels(
|
||||
end_user_id, user_api_key, model, user_api_team
|
||||
end_user_id, user_api_key, model, user_api_team, user_id
|
||||
).inc()
|
||||
self.litellm_spend_metric.labels(
|
||||
end_user_id, user_api_key, model, user_api_team
|
||||
end_user_id, user_api_key, model, user_api_team, user_id
|
||||
).inc(response_cost)
|
||||
self.litellm_tokens_metric.labels(
|
||||
end_user_id, user_api_key, model, user_api_team
|
||||
end_user_id, user_api_key, model, user_api_team, user_id
|
||||
).inc(tokens_used)
|
||||
|
||||
### FAILURE INCREMENT ###
|
||||
if "exception" in kwargs:
|
||||
self.litellm_llm_api_failed_requests_metric.labels(
|
||||
end_user_id, user_api_key, model, user_api_team, user_id
|
||||
).inc()
|
||||
except Exception as e:
|
||||
traceback.print_exc()
|
||||
verbose_logger.debug(
|
||||
|
|
|
@ -44,9 +44,18 @@ class PrometheusServicesLogger:
|
|||
) # store the prometheus histogram/counter we need to call for each field in payload
|
||||
|
||||
for service in self.services:
|
||||
histogram = self.create_histogram(service)
|
||||
counter = self.create_counter(service)
|
||||
self.payload_to_prometheus_map[service] = [histogram, counter]
|
||||
histogram = self.create_histogram(service, type_of_request="latency")
|
||||
counter_failed_request = self.create_counter(
|
||||
service, type_of_request="failed_requests"
|
||||
)
|
||||
counter_total_requests = self.create_counter(
|
||||
service, type_of_request="total_requests"
|
||||
)
|
||||
self.payload_to_prometheus_map[service] = [
|
||||
histogram,
|
||||
counter_failed_request,
|
||||
counter_total_requests,
|
||||
]
|
||||
|
||||
self.prometheus_to_amount_map: dict = (
|
||||
{}
|
||||
|
@ -74,26 +83,26 @@ class PrometheusServicesLogger:
|
|||
return metric
|
||||
return None
|
||||
|
||||
def create_histogram(self, label: str):
|
||||
metric_name = "litellm_{}_latency".format(label)
|
||||
def create_histogram(self, service: str, type_of_request: str):
|
||||
metric_name = "litellm_{}_{}".format(service, type_of_request)
|
||||
is_registered = self.is_metric_registered(metric_name)
|
||||
if is_registered:
|
||||
return self.get_metric(metric_name)
|
||||
return self.Histogram(
|
||||
metric_name,
|
||||
"Latency for {} service".format(label),
|
||||
labelnames=[label],
|
||||
"Latency for {} service".format(service),
|
||||
labelnames=[service],
|
||||
)
|
||||
|
||||
def create_counter(self, label: str):
|
||||
metric_name = "litellm_{}_failed_requests".format(label)
|
||||
def create_counter(self, service: str, type_of_request: str):
|
||||
metric_name = "litellm_{}_{}".format(service, type_of_request)
|
||||
is_registered = self.is_metric_registered(metric_name)
|
||||
if is_registered:
|
||||
return self.get_metric(metric_name)
|
||||
return self.Counter(
|
||||
metric_name,
|
||||
"Total failed requests for {} service".format(label),
|
||||
labelnames=[label],
|
||||
"Total {} for {} service".format(type_of_request, service),
|
||||
labelnames=[service],
|
||||
)
|
||||
|
||||
def observe_histogram(
|
||||
|
@ -129,6 +138,12 @@ class PrometheusServicesLogger:
|
|||
labels=payload.service.value,
|
||||
amount=payload.duration,
|
||||
)
|
||||
elif isinstance(obj, self.Counter) and "total_requests" in obj._name:
|
||||
self.increment_counter(
|
||||
counter=obj,
|
||||
labels=payload.service.value,
|
||||
amount=1, # LOG TOTAL REQUESTS TO PROMETHEUS
|
||||
)
|
||||
|
||||
def service_failure_hook(self, payload: ServiceLoggerPayload):
|
||||
if self.mock_testing:
|
||||
|
@ -141,7 +156,7 @@ class PrometheusServicesLogger:
|
|||
self.increment_counter(
|
||||
counter=obj,
|
||||
labels=payload.service.value,
|
||||
amount=1, # LOG ERROR COUNT TO PROMETHEUS
|
||||
amount=1, # LOG ERROR COUNT / TOTAL REQUESTS TO PROMETHEUS
|
||||
)
|
||||
|
||||
async def async_service_success_hook(self, payload: ServiceLoggerPayload):
|
||||
|
@ -160,6 +175,12 @@ class PrometheusServicesLogger:
|
|||
labels=payload.service.value,
|
||||
amount=payload.duration,
|
||||
)
|
||||
elif isinstance(obj, self.Counter) and "total_requests" in obj._name:
|
||||
self.increment_counter(
|
||||
counter=obj,
|
||||
labels=payload.service.value,
|
||||
amount=1, # LOG TOTAL REQUESTS TO PROMETHEUS
|
||||
)
|
||||
|
||||
async def async_service_failure_hook(self, payload: ServiceLoggerPayload):
|
||||
print(f"received error payload: {payload.error}")
|
||||
|
|
541
litellm/integrations/slack_alerting.py
Normal file
541
litellm/integrations/slack_alerting.py
Normal file
|
@ -0,0 +1,541 @@
|
|||
#### What this does ####
|
||||
# Class for sending Slack Alerts #
|
||||
import dotenv, os
|
||||
|
||||
dotenv.load_dotenv() # Loading env variables using dotenv
|
||||
import copy
|
||||
import traceback
|
||||
from litellm._logging import verbose_logger, verbose_proxy_logger
|
||||
import litellm
|
||||
from typing import List, Literal, Any, Union, Optional, Dict
|
||||
from litellm.caching import DualCache
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from litellm.llms.custom_httpx.http_handler import AsyncHTTPHandler
|
||||
import datetime
|
||||
|
||||
|
||||
class SlackAlerting:
|
||||
# Class variables or attributes
|
||||
def __init__(
|
||||
self,
|
||||
alerting_threshold: float = 300,
|
||||
alerting: Optional[List] = [],
|
||||
alert_types: Optional[
|
||||
List[
|
||||
Literal[
|
||||
"llm_exceptions",
|
||||
"llm_too_slow",
|
||||
"llm_requests_hanging",
|
||||
"budget_alerts",
|
||||
"db_exceptions",
|
||||
]
|
||||
]
|
||||
] = [
|
||||
"llm_exceptions",
|
||||
"llm_too_slow",
|
||||
"llm_requests_hanging",
|
||||
"budget_alerts",
|
||||
"db_exceptions",
|
||||
],
|
||||
alert_to_webhook_url: Optional[
|
||||
Dict
|
||||
] = None, # if user wants to separate alerts to diff channels
|
||||
):
|
||||
self.alerting_threshold = alerting_threshold
|
||||
self.alerting = alerting
|
||||
self.alert_types = alert_types
|
||||
self.internal_usage_cache = DualCache()
|
||||
self.async_http_handler = AsyncHTTPHandler()
|
||||
self.alert_to_webhook_url = alert_to_webhook_url
|
||||
self.langfuse_logger = None
|
||||
|
||||
try:
|
||||
from litellm.integrations.langfuse import LangFuseLogger
|
||||
|
||||
self.langfuse_logger = LangFuseLogger(
|
||||
os.getenv("LANGFUSE_PUBLIC_KEY"),
|
||||
os.getenv("LANGFUSE_SECRET_KEY"),
|
||||
flush_interval=1,
|
||||
)
|
||||
except:
|
||||
pass
|
||||
|
||||
pass
|
||||
|
||||
def update_values(
|
||||
self,
|
||||
alerting: Optional[List] = None,
|
||||
alerting_threshold: Optional[float] = None,
|
||||
alert_types: Optional[List] = None,
|
||||
alert_to_webhook_url: Optional[Dict] = None,
|
||||
):
|
||||
if alerting is not None:
|
||||
self.alerting = alerting
|
||||
if alerting_threshold is not None:
|
||||
self.alerting_threshold = alerting_threshold
|
||||
if alert_types is not None:
|
||||
self.alert_types = alert_types
|
||||
|
||||
if alert_to_webhook_url is not None:
|
||||
# update the dict
|
||||
if self.alert_to_webhook_url is None:
|
||||
self.alert_to_webhook_url = alert_to_webhook_url
|
||||
else:
|
||||
self.alert_to_webhook_url.update(alert_to_webhook_url)
|
||||
|
||||
async def deployment_in_cooldown(self):
|
||||
pass
|
||||
|
||||
async def deployment_removed_from_cooldown(self):
|
||||
pass
|
||||
|
||||
def _all_possible_alert_types(self):
|
||||
# used by the UI to show all supported alert types
|
||||
# Note: This is not the alerts the user has configured, instead it's all possible alert types a user can select
|
||||
return [
|
||||
"llm_exceptions",
|
||||
"llm_too_slow",
|
||||
"llm_requests_hanging",
|
||||
"budget_alerts",
|
||||
"db_exceptions",
|
||||
]
|
||||
|
||||
def _add_langfuse_trace_id_to_alert(
|
||||
self,
|
||||
request_info: str,
|
||||
request_data: Optional[dict] = None,
|
||||
kwargs: Optional[dict] = None,
|
||||
type: Literal["hanging_request", "slow_response"] = "hanging_request",
|
||||
start_time: Optional[datetime.datetime] = None,
|
||||
end_time: Optional[datetime.datetime] = None,
|
||||
):
|
||||
import uuid
|
||||
|
||||
# For now: do nothing as we're debugging why this is not working as expected
|
||||
if request_data is not None:
|
||||
trace_id = request_data.get("metadata", {}).get(
|
||||
"trace_id", None
|
||||
) # get langfuse trace id
|
||||
if trace_id is None:
|
||||
trace_id = "litellm-alert-trace-" + str(uuid.uuid4())
|
||||
request_data["metadata"]["trace_id"] = trace_id
|
||||
elif kwargs is not None:
|
||||
_litellm_params = kwargs.get("litellm_params", {})
|
||||
trace_id = _litellm_params.get("metadata", {}).get(
|
||||
"trace_id", None
|
||||
) # get langfuse trace id
|
||||
if trace_id is None:
|
||||
trace_id = "litellm-alert-trace-" + str(uuid.uuid4())
|
||||
_litellm_params["metadata"]["trace_id"] = trace_id
|
||||
|
||||
# Log hanging request as an error on langfuse
|
||||
if type == "hanging_request":
|
||||
if self.langfuse_logger is not None:
|
||||
_logging_kwargs = copy.deepcopy(request_data)
|
||||
if _logging_kwargs is None:
|
||||
_logging_kwargs = {}
|
||||
_logging_kwargs["litellm_params"] = {}
|
||||
request_data = request_data or {}
|
||||
_logging_kwargs["litellm_params"]["metadata"] = request_data.get(
|
||||
"metadata", {}
|
||||
)
|
||||
# log to langfuse in a separate thread
|
||||
import threading
|
||||
|
||||
threading.Thread(
|
||||
target=self.langfuse_logger.log_event,
|
||||
args=(
|
||||
_logging_kwargs,
|
||||
None,
|
||||
start_time,
|
||||
end_time,
|
||||
None,
|
||||
print,
|
||||
"ERROR",
|
||||
"Requests is hanging",
|
||||
),
|
||||
).start()
|
||||
|
||||
_langfuse_host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
|
||||
_langfuse_project_id = os.environ.get("LANGFUSE_PROJECT_ID")
|
||||
|
||||
# langfuse urls look like: https://us.cloud.langfuse.com/project/************/traces/litellm-alert-trace-ididi9dk-09292-************
|
||||
|
||||
_langfuse_url = (
|
||||
f"{_langfuse_host}/project/{_langfuse_project_id}/traces/{trace_id}"
|
||||
)
|
||||
request_info += f"\n🪢 Langfuse Trace: {_langfuse_url}"
|
||||
return request_info
|
||||
|
||||
def _response_taking_too_long_callback(
|
||||
self,
|
||||
kwargs, # kwargs to completion
|
||||
start_time,
|
||||
end_time, # start/end time
|
||||
):
|
||||
try:
|
||||
time_difference = end_time - start_time
|
||||
# Convert the timedelta to float (in seconds)
|
||||
time_difference_float = time_difference.total_seconds()
|
||||
litellm_params = kwargs.get("litellm_params", {})
|
||||
model = kwargs.get("model", "")
|
||||
api_base = litellm.get_api_base(model=model, optional_params=litellm_params)
|
||||
messages = kwargs.get("messages", None)
|
||||
# if messages does not exist fallback to "input"
|
||||
if messages is None:
|
||||
messages = kwargs.get("input", None)
|
||||
|
||||
# only use first 100 chars for alerting
|
||||
_messages = str(messages)[:100]
|
||||
|
||||
return time_difference_float, model, api_base, _messages
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
||||
def _get_deployment_latencies_to_alert(self, metadata=None):
|
||||
if metadata is None:
|
||||
return None
|
||||
|
||||
if "_latency_per_deployment" in metadata:
|
||||
# Translate model_id to -> api_base
|
||||
# _latency_per_deployment is a dictionary that looks like this:
|
||||
"""
|
||||
_latency_per_deployment: {
|
||||
api_base: 0.01336697916666667
|
||||
}
|
||||
"""
|
||||
_message_to_send = ""
|
||||
_deployment_latencies = metadata["_latency_per_deployment"]
|
||||
if len(_deployment_latencies) == 0:
|
||||
return None
|
||||
try:
|
||||
# try sorting deployments by latency
|
||||
_deployment_latencies = sorted(
|
||||
_deployment_latencies.items(), key=lambda x: x[1]
|
||||
)
|
||||
_deployment_latencies = dict(_deployment_latencies)
|
||||
except:
|
||||
pass
|
||||
for api_base, latency in _deployment_latencies.items():
|
||||
_message_to_send += f"\n{api_base}: {round(latency,2)}s"
|
||||
_message_to_send = "```" + _message_to_send + "```"
|
||||
return _message_to_send
|
||||
|
||||
async def response_taking_too_long_callback(
|
||||
self,
|
||||
kwargs, # kwargs to completion
|
||||
completion_response, # response from completion
|
||||
start_time,
|
||||
end_time, # start/end time
|
||||
):
|
||||
if self.alerting is None or self.alert_types is None:
|
||||
return
|
||||
|
||||
time_difference_float, model, api_base, messages = (
|
||||
self._response_taking_too_long_callback(
|
||||
kwargs=kwargs,
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
)
|
||||
)
|
||||
request_info = f"\nRequest Model: `{model}`\nAPI Base: `{api_base}`\nMessages: `{messages}`"
|
||||
slow_message = f"`Responses are slow - {round(time_difference_float,2)}s response time > Alerting threshold: {self.alerting_threshold}s`"
|
||||
if time_difference_float > self.alerting_threshold:
|
||||
if "langfuse" in litellm.success_callback:
|
||||
request_info = self._add_langfuse_trace_id_to_alert(
|
||||
request_info=request_info, kwargs=kwargs, type="slow_response"
|
||||
)
|
||||
# add deployment latencies to alert
|
||||
if (
|
||||
kwargs is not None
|
||||
and "litellm_params" in kwargs
|
||||
and "metadata" in kwargs["litellm_params"]
|
||||
):
|
||||
_metadata = kwargs["litellm_params"]["metadata"]
|
||||
|
||||
_deployment_latency_map = self._get_deployment_latencies_to_alert(
|
||||
metadata=_metadata
|
||||
)
|
||||
if _deployment_latency_map is not None:
|
||||
request_info += (
|
||||
f"\nAvailable Deployment Latencies\n{_deployment_latency_map}"
|
||||
)
|
||||
await self.send_alert(
|
||||
message=slow_message + request_info,
|
||||
level="Low",
|
||||
alert_type="llm_too_slow",
|
||||
)
|
||||
|
||||
async def log_failure_event(self, original_exception: Exception):
|
||||
pass
|
||||
|
||||
async def response_taking_too_long(
|
||||
self,
|
||||
start_time: Optional[datetime.datetime] = None,
|
||||
end_time: Optional[datetime.datetime] = None,
|
||||
type: Literal["hanging_request", "slow_response"] = "hanging_request",
|
||||
request_data: Optional[dict] = None,
|
||||
):
|
||||
if self.alerting is None or self.alert_types is None:
|
||||
return
|
||||
if request_data is not None:
|
||||
model = request_data.get("model", "")
|
||||
messages = request_data.get("messages", None)
|
||||
if messages is None:
|
||||
# if messages does not exist fallback to "input"
|
||||
messages = request_data.get("input", None)
|
||||
|
||||
# try casting messages to str and get the first 100 characters, else mark as None
|
||||
try:
|
||||
messages = str(messages)
|
||||
messages = messages[:100]
|
||||
except:
|
||||
messages = ""
|
||||
request_info = f"\nRequest Model: `{model}`\nMessages: `{messages}`"
|
||||
else:
|
||||
request_info = ""
|
||||
|
||||
if type == "hanging_request":
|
||||
await asyncio.sleep(
|
||||
self.alerting_threshold
|
||||
) # Set it to 5 minutes - i'd imagine this might be different for streaming, non-streaming, non-completion (embedding + img) requests
|
||||
if (
|
||||
request_data is not None
|
||||
and request_data.get("litellm_status", "") != "success"
|
||||
and request_data.get("litellm_status", "") != "fail"
|
||||
):
|
||||
if request_data.get("deployment", None) is not None and isinstance(
|
||||
request_data["deployment"], dict
|
||||
):
|
||||
_api_base = litellm.get_api_base(
|
||||
model=model,
|
||||
optional_params=request_data["deployment"].get(
|
||||
"litellm_params", {}
|
||||
),
|
||||
)
|
||||
|
||||
if _api_base is None:
|
||||
_api_base = ""
|
||||
|
||||
request_info += f"\nAPI Base: {_api_base}"
|
||||
elif request_data.get("metadata", None) is not None and isinstance(
|
||||
request_data["metadata"], dict
|
||||
):
|
||||
# In hanging requests sometime it has not made it to the point where the deployment is passed to the `request_data``
|
||||
# in that case we fallback to the api base set in the request metadata
|
||||
_metadata = request_data["metadata"]
|
||||
_api_base = _metadata.get("api_base", "")
|
||||
if _api_base is None:
|
||||
_api_base = ""
|
||||
request_info += f"\nAPI Base: `{_api_base}`"
|
||||
# only alert hanging responses if they have not been marked as success
|
||||
alerting_message = (
|
||||
f"`Requests are hanging - {self.alerting_threshold}s+ request time`"
|
||||
)
|
||||
|
||||
if "langfuse" in litellm.success_callback:
|
||||
request_info = self._add_langfuse_trace_id_to_alert(
|
||||
request_info=request_info,
|
||||
request_data=request_data,
|
||||
type="hanging_request",
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
)
|
||||
|
||||
# add deployment latencies to alert
|
||||
_deployment_latency_map = self._get_deployment_latencies_to_alert(
|
||||
metadata=request_data.get("metadata", {})
|
||||
)
|
||||
if _deployment_latency_map is not None:
|
||||
request_info += f"\nDeployment Latencies\n{_deployment_latency_map}"
|
||||
|
||||
await self.send_alert(
|
||||
message=alerting_message + request_info,
|
||||
level="Medium",
|
||||
alert_type="llm_requests_hanging",
|
||||
)
|
||||
|
||||
async def budget_alerts(
|
||||
self,
|
||||
type: Literal[
|
||||
"token_budget",
|
||||
"user_budget",
|
||||
"user_and_proxy_budget",
|
||||
"failed_budgets",
|
||||
"failed_tracking",
|
||||
"projected_limit_exceeded",
|
||||
],
|
||||
user_max_budget: float,
|
||||
user_current_spend: float,
|
||||
user_info=None,
|
||||
error_message="",
|
||||
):
|
||||
if self.alerting is None or self.alert_types is None:
|
||||
# do nothing if alerting is not switched on
|
||||
return
|
||||
if "budget_alerts" not in self.alert_types:
|
||||
return
|
||||
_id: str = "default_id" # used for caching
|
||||
if type == "user_and_proxy_budget":
|
||||
user_info = dict(user_info)
|
||||
user_id = user_info["user_id"]
|
||||
_id = user_id
|
||||
max_budget = user_info["max_budget"]
|
||||
spend = user_info["spend"]
|
||||
user_email = user_info["user_email"]
|
||||
user_info = f"""\nUser ID: {user_id}\nMax Budget: ${max_budget}\nSpend: ${spend}\nUser Email: {user_email}"""
|
||||
elif type == "token_budget":
|
||||
token_info = dict(user_info)
|
||||
token = token_info["token"]
|
||||
_id = token
|
||||
spend = token_info["spend"]
|
||||
max_budget = token_info["max_budget"]
|
||||
user_id = token_info["user_id"]
|
||||
user_info = f"""\nToken: {token}\nSpend: ${spend}\nMax Budget: ${max_budget}\nUser ID: {user_id}"""
|
||||
elif type == "failed_tracking":
|
||||
user_id = str(user_info)
|
||||
_id = user_id
|
||||
user_info = f"\nUser ID: {user_id}\n Error {error_message}"
|
||||
message = "Failed Tracking Cost for" + user_info
|
||||
await self.send_alert(
|
||||
message=message, level="High", alert_type="budget_alerts"
|
||||
)
|
||||
return
|
||||
elif type == "projected_limit_exceeded" and user_info is not None:
|
||||
"""
|
||||
Input variables:
|
||||
user_info = {
|
||||
"key_alias": key_alias,
|
||||
"projected_spend": projected_spend,
|
||||
"projected_exceeded_date": projected_exceeded_date,
|
||||
}
|
||||
user_max_budget=soft_limit,
|
||||
user_current_spend=new_spend
|
||||
"""
|
||||
message = f"""\n🚨 `ProjectedLimitExceededError` 💸\n\n`Key Alias:` {user_info["key_alias"]} \n`Expected Day of Error`: {user_info["projected_exceeded_date"]} \n`Current Spend`: {user_current_spend} \n`Projected Spend at end of month`: {user_info["projected_spend"]} \n`Soft Limit`: {user_max_budget}"""
|
||||
await self.send_alert(
|
||||
message=message, level="High", alert_type="budget_alerts"
|
||||
)
|
||||
return
|
||||
else:
|
||||
user_info = str(user_info)
|
||||
|
||||
# percent of max_budget left to spend
|
||||
if user_max_budget > 0:
|
||||
percent_left = (user_max_budget - user_current_spend) / user_max_budget
|
||||
else:
|
||||
percent_left = 0
|
||||
verbose_proxy_logger.debug(
|
||||
f"Budget Alerts: Percent left: {percent_left} for {user_info}"
|
||||
)
|
||||
|
||||
## PREVENTITIVE ALERTING ## - https://github.com/BerriAI/litellm/issues/2727
|
||||
# - Alert once within 28d period
|
||||
# - Cache this information
|
||||
# - Don't re-alert, if alert already sent
|
||||
_cache: DualCache = self.internal_usage_cache
|
||||
|
||||
# check if crossed budget
|
||||
if user_current_spend >= user_max_budget:
|
||||
verbose_proxy_logger.debug("Budget Crossed for %s", user_info)
|
||||
message = "Budget Crossed for" + user_info
|
||||
result = await _cache.async_get_cache(key=message)
|
||||
if result is None:
|
||||
await self.send_alert(
|
||||
message=message, level="High", alert_type="budget_alerts"
|
||||
)
|
||||
await _cache.async_set_cache(key=message, value="SENT", ttl=2419200)
|
||||
return
|
||||
|
||||
# check if 5% of max budget is left
|
||||
if percent_left <= 0.05:
|
||||
message = "5% budget left for" + user_info
|
||||
cache_key = "alerting:{}".format(_id)
|
||||
result = await _cache.async_get_cache(key=cache_key)
|
||||
if result is None:
|
||||
await self.send_alert(
|
||||
message=message, level="Medium", alert_type="budget_alerts"
|
||||
)
|
||||
|
||||
await _cache.async_set_cache(key=cache_key, value="SENT", ttl=2419200)
|
||||
|
||||
return
|
||||
|
||||
# check if 15% of max budget is left
|
||||
if percent_left <= 0.15:
|
||||
message = "15% budget left for" + user_info
|
||||
result = await _cache.async_get_cache(key=message)
|
||||
if result is None:
|
||||
await self.send_alert(
|
||||
message=message, level="Low", alert_type="budget_alerts"
|
||||
)
|
||||
await _cache.async_set_cache(key=message, value="SENT", ttl=2419200)
|
||||
return
|
||||
|
||||
return
|
||||
|
||||
async def send_alert(
|
||||
self,
|
||||
message: str,
|
||||
level: Literal["Low", "Medium", "High"],
|
||||
alert_type: Literal[
|
||||
"llm_exceptions",
|
||||
"llm_too_slow",
|
||||
"llm_requests_hanging",
|
||||
"budget_alerts",
|
||||
"db_exceptions",
|
||||
],
|
||||
):
|
||||
"""
|
||||
Alerting based on thresholds: - https://github.com/BerriAI/litellm/issues/1298
|
||||
|
||||
- Responses taking too long
|
||||
- Requests are hanging
|
||||
- Calls are failing
|
||||
- DB Read/Writes are failing
|
||||
- Proxy Close to max budget
|
||||
- Key Close to max budget
|
||||
|
||||
Parameters:
|
||||
level: str - Low|Medium|High - if calls might fail (Medium) or are failing (High); Currently, no alerts would be 'Low'.
|
||||
message: str - what is the alert about
|
||||
"""
|
||||
if self.alerting is None:
|
||||
return
|
||||
|
||||
from datetime import datetime
|
||||
import json
|
||||
|
||||
# Get the current timestamp
|
||||
current_time = datetime.now().strftime("%H:%M:%S")
|
||||
_proxy_base_url = os.getenv("PROXY_BASE_URL", None)
|
||||
formatted_message = (
|
||||
f"Level: `{level}`\nTimestamp: `{current_time}`\n\nMessage: {message}"
|
||||
)
|
||||
if _proxy_base_url is not None:
|
||||
formatted_message += f"\n\nProxy URL: `{_proxy_base_url}`"
|
||||
|
||||
# check if we find the slack webhook url in self.alert_to_webhook_url
|
||||
if (
|
||||
self.alert_to_webhook_url is not None
|
||||
and alert_type in self.alert_to_webhook_url
|
||||
):
|
||||
slack_webhook_url = self.alert_to_webhook_url[alert_type]
|
||||
else:
|
||||
slack_webhook_url = os.getenv("SLACK_WEBHOOK_URL", None)
|
||||
|
||||
if slack_webhook_url is None:
|
||||
raise Exception("Missing SLACK_WEBHOOK_URL from environment")
|
||||
payload = {"text": formatted_message}
|
||||
headers = {"Content-type": "application/json"}
|
||||
|
||||
response = await self.async_http_handler.post(
|
||||
url=slack_webhook_url,
|
||||
headers=headers,
|
||||
data=json.dumps(payload),
|
||||
)
|
||||
if response.status_code == 200:
|
||||
pass
|
||||
else:
|
||||
print("Error sending slack alert. Error=", response.text) # noqa
|
|
@ -298,7 +298,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -258,8 +258,9 @@ class AnthropicChatCompletion(BaseLLM):
|
|||
self.async_handler = AsyncHTTPHandler(
|
||||
timeout=httpx.Timeout(timeout=600.0, connect=5.0)
|
||||
)
|
||||
data["stream"] = True
|
||||
response = await self.async_handler.post(
|
||||
api_base, headers=headers, data=json.dumps(data)
|
||||
api_base, headers=headers, data=json.dumps(data), stream=True
|
||||
)
|
||||
|
||||
if response.status_code != 200:
|
||||
|
|
|
@ -137,7 +137,8 @@ class AnthropicTextCompletion(BaseLLM):
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
|
||||
setattr(model_response, "usage", usage)
|
||||
|
||||
return model_response
|
||||
|
||||
|
|
|
@ -96,6 +96,15 @@ class AzureOpenAIConfig(OpenAIConfig):
|
|||
top_p,
|
||||
)
|
||||
|
||||
def get_mapped_special_auth_params(self) -> dict:
|
||||
return {"token": "azure_ad_token"}
|
||||
|
||||
def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
|
||||
for param, value in non_default_params.items():
|
||||
if param == "token":
|
||||
optional_params["azure_ad_token"] = value
|
||||
return optional_params
|
||||
|
||||
|
||||
def select_azure_base_url_or_endpoint(azure_client_params: dict):
|
||||
# azure_client_params = {
|
||||
|
|
|
@ -55,9 +55,11 @@ def completion(
|
|||
"inputs": prompt,
|
||||
"prompt": prompt,
|
||||
"parameters": optional_params,
|
||||
"stream": True
|
||||
if "stream" in optional_params and optional_params["stream"] == True
|
||||
else False,
|
||||
"stream": (
|
||||
True
|
||||
if "stream" in optional_params and optional_params["stream"] == True
|
||||
else False
|
||||
),
|
||||
}
|
||||
|
||||
## LOGGING
|
||||
|
@ -71,9 +73,11 @@ def completion(
|
|||
completion_url_fragment_1 + model + completion_url_fragment_2,
|
||||
headers=headers,
|
||||
data=json.dumps(data),
|
||||
stream=True
|
||||
if "stream" in optional_params and optional_params["stream"] == True
|
||||
else False,
|
||||
stream=(
|
||||
True
|
||||
if "stream" in optional_params and optional_params["stream"] == True
|
||||
else False
|
||||
),
|
||||
)
|
||||
if "text/event-stream" in response.headers["Content-Type"] or (
|
||||
"stream" in optional_params and optional_params["stream"] == True
|
||||
|
@ -102,28 +106,28 @@ def completion(
|
|||
and "data" in completion_response["model_output"]
|
||||
and isinstance(completion_response["model_output"]["data"], list)
|
||||
):
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response["model_output"]["data"][0]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response["model_output"]["data"][0]
|
||||
)
|
||||
elif isinstance(completion_response["model_output"], str):
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response["model_output"]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response["model_output"]
|
||||
)
|
||||
elif "completion" in completion_response and isinstance(
|
||||
completion_response["completion"], str
|
||||
):
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response["completion"]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response["completion"]
|
||||
)
|
||||
elif isinstance(completion_response, list) and len(completion_response) > 0:
|
||||
if "generated_text" not in completion_response:
|
||||
raise BasetenError(
|
||||
message=f"Unable to parse response. Original response: {response.text}",
|
||||
status_code=response.status_code,
|
||||
)
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response[0]["generated_text"]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response[0]["generated_text"]
|
||||
)
|
||||
## GETTING LOGPROBS
|
||||
if (
|
||||
"details" in completion_response[0]
|
||||
|
@ -155,7 +159,8 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -29,6 +29,24 @@ class BedrockError(Exception):
|
|||
) # Call the base class constructor with the parameters it needs
|
||||
|
||||
|
||||
class AmazonBedrockGlobalConfig:
|
||||
def __init__(self):
|
||||
pass
|
||||
|
||||
def get_mapped_special_auth_params(self) -> dict:
|
||||
"""
|
||||
Mapping of common auth params across bedrock/vertex/azure/watsonx
|
||||
"""
|
||||
return {"region_name": "aws_region_name"}
|
||||
|
||||
def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
|
||||
mapped_params = self.get_mapped_special_auth_params()
|
||||
for param, value in non_default_params.items():
|
||||
if param in mapped_params:
|
||||
optional_params[mapped_params[param]] = value
|
||||
return optional_params
|
||||
|
||||
|
||||
class AmazonTitanConfig:
|
||||
"""
|
||||
Reference: https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/providers?model=titan-text-express-v1
|
||||
|
@ -653,6 +671,10 @@ def convert_messages_to_prompt(model, messages, provider, custom_prompt_dict):
|
|||
prompt = prompt_factory(
|
||||
model=model, messages=messages, custom_llm_provider="bedrock"
|
||||
)
|
||||
elif provider == "meta":
|
||||
prompt = prompt_factory(
|
||||
model=model, messages=messages, custom_llm_provider="bedrock"
|
||||
)
|
||||
else:
|
||||
prompt = ""
|
||||
for message in messages:
|
||||
|
@ -1028,7 +1050,7 @@ def completion(
|
|||
total_tokens=response_body["usage"]["input_tokens"]
|
||||
+ response_body["usage"]["output_tokens"],
|
||||
)
|
||||
model_response.usage = _usage
|
||||
setattr(model_response, "usage", _usage)
|
||||
else:
|
||||
outputText = response_body["completion"]
|
||||
model_response["finish_reason"] = response_body["stop_reason"]
|
||||
|
@ -1071,8 +1093,10 @@ def completion(
|
|||
status_code=response_metadata.get("HTTPStatusCode", 500),
|
||||
)
|
||||
|
||||
## CALCULATING USAGE - baseten charges on time, not tokens - have some mapping of cost here.
|
||||
if getattr(model_response.usage, "total_tokens", None) is None:
|
||||
## CALCULATING USAGE - bedrock charges on time, not tokens - have some mapping of cost here.
|
||||
if not hasattr(model_response, "usage"):
|
||||
setattr(model_response, "usage", Usage())
|
||||
if getattr(model_response.usage, "total_tokens", None) is None: # type: ignore
|
||||
prompt_tokens = response_metadata.get(
|
||||
"x-amzn-bedrock-input-token-count", len(encoding.encode(prompt))
|
||||
)
|
||||
|
@ -1089,7 +1113,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
|
||||
model_response["created"] = int(time.time())
|
||||
model_response["model"] = model
|
||||
|
|
|
@ -167,7 +167,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -237,7 +237,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -43,6 +43,7 @@ class CohereChatConfig:
|
|||
presence_penalty (float, optional): Used to reduce repetitiveness of generated tokens.
|
||||
tools (List[Dict[str, str]], optional): A list of available tools (functions) that the model may suggest invoking.
|
||||
tool_results (List[Dict[str, Any]], optional): A list of results from invoking tools.
|
||||
seed (int, optional): A seed to assist reproducibility of the model's response.
|
||||
"""
|
||||
|
||||
preamble: Optional[str] = None
|
||||
|
@ -62,6 +63,7 @@ class CohereChatConfig:
|
|||
presence_penalty: Optional[int] = None
|
||||
tools: Optional[list] = None
|
||||
tool_results: Optional[list] = None
|
||||
seed: Optional[int] = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
|
@ -82,6 +84,7 @@ class CohereChatConfig:
|
|||
presence_penalty: Optional[int] = None,
|
||||
tools: Optional[list] = None,
|
||||
tool_results: Optional[list] = None,
|
||||
seed: Optional[int] = None,
|
||||
) -> None:
|
||||
locals_ = locals()
|
||||
for key, value in locals_.items():
|
||||
|
@ -302,5 +305,5 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
|
|
@ -41,13 +41,12 @@ class AsyncHTTPHandler:
|
|||
data: Optional[Union[dict, str]] = None, # type: ignore
|
||||
params: Optional[dict] = None,
|
||||
headers: Optional[dict] = None,
|
||||
stream: bool = False,
|
||||
):
|
||||
response = await self.client.post(
|
||||
url,
|
||||
data=data, # type: ignore
|
||||
params=params,
|
||||
headers=headers,
|
||||
req = self.client.build_request(
|
||||
"POST", url, data=data, params=params, headers=headers # type: ignore
|
||||
)
|
||||
response = await self.client.send(req, stream=stream)
|
||||
return response
|
||||
|
||||
def __del__(self) -> None:
|
||||
|
|
|
@ -311,7 +311,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -152,9 +152,9 @@ def completion(
|
|||
else:
|
||||
try:
|
||||
if len(completion_response["answer"]) > 0:
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response["answer"]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response["answer"]
|
||||
)
|
||||
except Exception as e:
|
||||
raise MaritalkError(
|
||||
message=response.text, status_code=response.status_code
|
||||
|
@ -174,7 +174,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -185,9 +185,9 @@ def completion(
|
|||
else:
|
||||
try:
|
||||
if len(completion_response["generated_text"]) > 0:
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response["generated_text"]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response["generated_text"]
|
||||
)
|
||||
except:
|
||||
raise NLPCloudError(
|
||||
message=json.dumps(completion_response),
|
||||
|
@ -205,7 +205,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -228,7 +228,7 @@ def get_ollama_response(
|
|||
model_response["choices"][0]["message"]["content"] = response_json["response"]
|
||||
model_response["created"] = int(time.time())
|
||||
model_response["model"] = "ollama/" + model
|
||||
prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(prompt))) # type: ignore
|
||||
prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(prompt, disallowed_special=()))) # type: ignore
|
||||
completion_tokens = response_json.get("eval_count", len(response_json.get("message",dict()).get("content", "")))
|
||||
model_response["usage"] = litellm.Usage(
|
||||
prompt_tokens=prompt_tokens,
|
||||
|
@ -330,7 +330,7 @@ async def ollama_acompletion(url, data, model_response, encoding, logging_obj):
|
|||
]
|
||||
model_response["created"] = int(time.time())
|
||||
model_response["model"] = "ollama/" + data["model"]
|
||||
prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(data["prompt"]))) # type: ignore
|
||||
prompt_tokens = response_json.get("prompt_eval_count", len(encoding.encode(data["prompt"], disallowed_special=()))) # type: ignore
|
||||
completion_tokens = response_json.get("eval_count", len(response_json.get("message",dict()).get("content", "")))
|
||||
model_response["usage"] = litellm.Usage(
|
||||
prompt_tokens=prompt_tokens,
|
||||
|
|
|
@ -148,7 +148,7 @@ class OllamaChatConfig:
|
|||
if param == "top_p":
|
||||
optional_params["top_p"] = value
|
||||
if param == "frequency_penalty":
|
||||
optional_params["repeat_penalty"] = param
|
||||
optional_params["repeat_penalty"] = value
|
||||
if param == "stop":
|
||||
optional_params["stop"] = value
|
||||
if param == "response_format" and value["type"] == "json_object":
|
||||
|
@ -184,6 +184,7 @@ class OllamaChatConfig:
|
|||
# ollama implementation
|
||||
def get_ollama_response(
|
||||
api_base="http://localhost:11434",
|
||||
api_key: Optional[str] = None,
|
||||
model="llama2",
|
||||
messages=None,
|
||||
optional_params=None,
|
||||
|
@ -236,6 +237,7 @@ def get_ollama_response(
|
|||
if stream == True:
|
||||
response = ollama_async_streaming(
|
||||
url=url,
|
||||
api_key=api_key,
|
||||
data=data,
|
||||
model_response=model_response,
|
||||
encoding=encoding,
|
||||
|
@ -244,6 +246,7 @@ def get_ollama_response(
|
|||
else:
|
||||
response = ollama_acompletion(
|
||||
url=url,
|
||||
api_key=api_key,
|
||||
data=data,
|
||||
model_response=model_response,
|
||||
encoding=encoding,
|
||||
|
@ -252,12 +255,17 @@ def get_ollama_response(
|
|||
)
|
||||
return response
|
||||
elif stream == True:
|
||||
return ollama_completion_stream(url=url, data=data, logging_obj=logging_obj)
|
||||
return ollama_completion_stream(
|
||||
url=url, api_key=api_key, data=data, logging_obj=logging_obj
|
||||
)
|
||||
|
||||
response = requests.post(
|
||||
url=f"{url}",
|
||||
json=data,
|
||||
)
|
||||
_request = {
|
||||
"url": f"{url}",
|
||||
"json": data,
|
||||
}
|
||||
if api_key is not None:
|
||||
_request["headers"] = "Bearer {}".format(api_key)
|
||||
response = requests.post(**_request) # type: ignore
|
||||
if response.status_code != 200:
|
||||
raise OllamaError(status_code=response.status_code, message=response.text)
|
||||
|
||||
|
@ -307,10 +315,16 @@ def get_ollama_response(
|
|||
return model_response
|
||||
|
||||
|
||||
def ollama_completion_stream(url, data, logging_obj):
|
||||
with httpx.stream(
|
||||
url=url, json=data, method="POST", timeout=litellm.request_timeout
|
||||
) as response:
|
||||
def ollama_completion_stream(url, api_key, data, logging_obj):
|
||||
_request = {
|
||||
"url": f"{url}",
|
||||
"json": data,
|
||||
"method": "POST",
|
||||
"timeout": litellm.request_timeout,
|
||||
}
|
||||
if api_key is not None:
|
||||
_request["headers"] = "Bearer {}".format(api_key)
|
||||
with httpx.stream(**_request) as response:
|
||||
try:
|
||||
if response.status_code != 200:
|
||||
raise OllamaError(
|
||||
|
@ -329,12 +343,20 @@ def ollama_completion_stream(url, data, logging_obj):
|
|||
raise e
|
||||
|
||||
|
||||
async def ollama_async_streaming(url, data, model_response, encoding, logging_obj):
|
||||
async def ollama_async_streaming(
|
||||
url, api_key, data, model_response, encoding, logging_obj
|
||||
):
|
||||
try:
|
||||
client = httpx.AsyncClient()
|
||||
async with client.stream(
|
||||
url=f"{url}", json=data, method="POST", timeout=litellm.request_timeout
|
||||
) as response:
|
||||
_request = {
|
||||
"url": f"{url}",
|
||||
"json": data,
|
||||
"method": "POST",
|
||||
"timeout": litellm.request_timeout,
|
||||
}
|
||||
if api_key is not None:
|
||||
_request["headers"] = "Bearer {}".format(api_key)
|
||||
async with client.stream(**_request) as response:
|
||||
if response.status_code != 200:
|
||||
raise OllamaError(
|
||||
status_code=response.status_code, message=response.text
|
||||
|
@ -353,13 +375,25 @@ async def ollama_async_streaming(url, data, model_response, encoding, logging_ob
|
|||
|
||||
|
||||
async def ollama_acompletion(
|
||||
url, data, model_response, encoding, logging_obj, function_name
|
||||
url,
|
||||
api_key: Optional[str],
|
||||
data,
|
||||
model_response,
|
||||
encoding,
|
||||
logging_obj,
|
||||
function_name,
|
||||
):
|
||||
data["stream"] = False
|
||||
try:
|
||||
timeout = aiohttp.ClientTimeout(total=litellm.request_timeout) # 10 minutes
|
||||
async with aiohttp.ClientSession(timeout=timeout) as session:
|
||||
resp = await session.post(url, json=data)
|
||||
_request = {
|
||||
"url": f"{url}",
|
||||
"json": data,
|
||||
}
|
||||
if api_key is not None:
|
||||
_request["headers"] = "Bearer {}".format(api_key)
|
||||
resp = await session.post(**_request)
|
||||
|
||||
if resp.status != 200:
|
||||
text = await resp.text()
|
||||
|
|
|
@ -99,9 +99,9 @@ def completion(
|
|||
)
|
||||
else:
|
||||
try:
|
||||
model_response["choices"][0]["message"][
|
||||
"content"
|
||||
] = completion_response["choices"][0]["message"]["content"]
|
||||
model_response["choices"][0]["message"]["content"] = (
|
||||
completion_response["choices"][0]["message"]["content"]
|
||||
)
|
||||
except:
|
||||
raise OobaboogaError(
|
||||
message=json.dumps(completion_response),
|
||||
|
@ -115,7 +115,7 @@ def completion(
|
|||
completion_tokens=completion_response["usage"]["completion_tokens"],
|
||||
total_tokens=completion_response["usage"]["total_tokens"],
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -223,7 +223,7 @@ class OpenAITextCompletionConfig:
|
|||
model_response_object.choices = choice_list
|
||||
|
||||
if "usage" in response_object:
|
||||
model_response_object.usage = response_object["usage"]
|
||||
setattr(model_response_object, "usage", response_object["usage"])
|
||||
|
||||
if "id" in response_object:
|
||||
model_response_object.id = response_object["id"]
|
||||
|
@ -447,6 +447,7 @@ class OpenAIChatCompletion(BaseLLM):
|
|||
)
|
||||
else:
|
||||
openai_aclient = client
|
||||
|
||||
## LOGGING
|
||||
logging_obj.pre_call(
|
||||
input=data["messages"],
|
||||
|
|
|
@ -191,7 +191,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -41,9 +41,9 @@ class PetalsConfig:
|
|||
"""
|
||||
|
||||
max_length: Optional[int] = None
|
||||
max_new_tokens: Optional[
|
||||
int
|
||||
] = litellm.max_tokens # petals requires max tokens to be set
|
||||
max_new_tokens: Optional[int] = (
|
||||
litellm.max_tokens
|
||||
) # petals requires max tokens to be set
|
||||
do_sample: Optional[bool] = None
|
||||
temperature: Optional[float] = None
|
||||
top_k: Optional[int] = None
|
||||
|
@ -203,7 +203,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -3,8 +3,14 @@ import requests, traceback
|
|||
import json, re, xml.etree.ElementTree as ET
|
||||
from jinja2 import Template, exceptions, meta, BaseLoader
|
||||
from jinja2.sandbox import ImmutableSandboxedEnvironment
|
||||
from typing import Optional, Any
|
||||
from typing import List
|
||||
from typing import (
|
||||
Any,
|
||||
List,
|
||||
Mapping,
|
||||
MutableMapping,
|
||||
Optional,
|
||||
Sequence,
|
||||
)
|
||||
import litellm
|
||||
|
||||
|
||||
|
@ -145,6 +151,12 @@ def mistral_api_pt(messages):
|
|||
elif isinstance(m["content"], str):
|
||||
texts = m["content"]
|
||||
new_m = {"role": m["role"], "content": texts}
|
||||
|
||||
if new_m["role"] == "tool" and m.get("name"):
|
||||
new_m["name"] = m["name"]
|
||||
if m.get("tool_calls"):
|
||||
new_m["tool_calls"] = m["tool_calls"]
|
||||
|
||||
new_messages.append(new_m)
|
||||
return new_messages
|
||||
|
||||
|
@ -218,6 +230,26 @@ def phind_codellama_pt(messages):
|
|||
return prompt
|
||||
|
||||
|
||||
known_tokenizer_config = {
|
||||
"mistralai/Mistral-7B-Instruct-v0.1": {
|
||||
"tokenizer": {
|
||||
"chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",
|
||||
"bos_token": "<s>",
|
||||
"eos_token": "</s>",
|
||||
},
|
||||
"status": "success",
|
||||
},
|
||||
"meta-llama/Meta-Llama-3-8B-Instruct": {
|
||||
"tokenizer": {
|
||||
"chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
|
||||
"bos_token": "<|begin_of_text|>",
|
||||
"eos_token": "",
|
||||
},
|
||||
"status": "success",
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def hf_chat_template(model: str, messages: list, chat_template: Optional[Any] = None):
|
||||
# Define Jinja2 environment
|
||||
env = ImmutableSandboxedEnvironment()
|
||||
|
@ -246,20 +278,23 @@ def hf_chat_template(model: str, messages: list, chat_template: Optional[Any] =
|
|||
else:
|
||||
return {"status": "failure"}
|
||||
|
||||
tokenizer_config = _get_tokenizer_config(model)
|
||||
if model in known_tokenizer_config:
|
||||
tokenizer_config = known_tokenizer_config[model]
|
||||
else:
|
||||
tokenizer_config = _get_tokenizer_config(model)
|
||||
if (
|
||||
tokenizer_config["status"] == "failure"
|
||||
or "chat_template" not in tokenizer_config["tokenizer"]
|
||||
):
|
||||
raise Exception("No chat template found")
|
||||
## read the bos token, eos token and chat template from the json
|
||||
tokenizer_config = tokenizer_config["tokenizer"]
|
||||
bos_token = tokenizer_config["bos_token"]
|
||||
eos_token = tokenizer_config["eos_token"]
|
||||
chat_template = tokenizer_config["chat_template"]
|
||||
tokenizer_config = tokenizer_config["tokenizer"] # type: ignore
|
||||
|
||||
bos_token = tokenizer_config["bos_token"] # type: ignore
|
||||
eos_token = tokenizer_config["eos_token"] # type: ignore
|
||||
chat_template = tokenizer_config["chat_template"] # type: ignore
|
||||
try:
|
||||
template = env.from_string(chat_template)
|
||||
template = env.from_string(chat_template) # type: ignore
|
||||
except Exception as e:
|
||||
raise e
|
||||
|
||||
|
@ -402,6 +437,35 @@ def format_prompt_togetherai(messages, prompt_format, chat_template):
|
|||
return prompt
|
||||
|
||||
|
||||
### IBM Granite
|
||||
|
||||
|
||||
def ibm_granite_pt(messages: list):
|
||||
"""
|
||||
IBM's Granite models uses the template:
|
||||
<|system|> {system_message} <|user|> {user_message} <|assistant|> {assistant_message}
|
||||
|
||||
See: https://www.ibm.com/docs/en/watsonx-as-a-service?topic=solutions-supported-foundation-models
|
||||
"""
|
||||
return custom_prompt(
|
||||
messages=messages,
|
||||
role_dict={
|
||||
"system": {
|
||||
"pre_message": "<|system|>\n",
|
||||
"post_message": "\n",
|
||||
},
|
||||
"user": {
|
||||
"pre_message": "<|user|>\n",
|
||||
"post_message": "\n",
|
||||
},
|
||||
"assistant": {
|
||||
"pre_message": "<|assistant|>\n",
|
||||
"post_message": "\n",
|
||||
},
|
||||
},
|
||||
).strip()
|
||||
|
||||
|
||||
### ANTHROPIC ###
|
||||
|
||||
|
||||
|
@ -466,10 +530,11 @@ def construct_tool_use_system_prompt(
|
|||
): # from https://github.com/anthropics/anthropic-cookbook/blob/main/function_calling/function_calling.ipynb
|
||||
tool_str_list = []
|
||||
for tool in tools:
|
||||
tool_function = get_attribute_or_key(tool, "function")
|
||||
tool_str = construct_format_tool_for_claude_prompt(
|
||||
tool["function"]["name"],
|
||||
tool["function"].get("description", ""),
|
||||
tool["function"].get("parameters", {}),
|
||||
get_attribute_or_key(tool_function, "name"),
|
||||
get_attribute_or_key(tool_function, "description", ""),
|
||||
get_attribute_or_key(tool_function, "parameters", {}),
|
||||
)
|
||||
tool_str_list.append(tool_str)
|
||||
tool_use_system_prompt = (
|
||||
|
@ -593,7 +658,8 @@ def convert_to_anthropic_tool_result_xml(message: dict) -> str:
|
|||
</function_results>
|
||||
"""
|
||||
name = message.get("name")
|
||||
content = message.get("content")
|
||||
content = message.get("content", "")
|
||||
content = content.replace("<", "<").replace(">", ">").replace("&", "&")
|
||||
|
||||
# We can't determine from openai message format whether it's a successful or
|
||||
# error call result so default to the successful result template
|
||||
|
@ -614,13 +680,15 @@ def convert_to_anthropic_tool_result_xml(message: dict) -> str:
|
|||
def convert_to_anthropic_tool_invoke_xml(tool_calls: list) -> str:
|
||||
invokes = ""
|
||||
for tool in tool_calls:
|
||||
if tool["type"] != "function":
|
||||
if get_attribute_or_key(tool, "type") != "function":
|
||||
continue
|
||||
|
||||
tool_name = tool["function"]["name"]
|
||||
tool_function = get_attribute_or_key(tool, "function")
|
||||
tool_name = get_attribute_or_key(tool_function, "name")
|
||||
tool_arguments = get_attribute_or_key(tool_function, "arguments")
|
||||
parameters = "".join(
|
||||
f"<{param}>{val}</{param}>\n"
|
||||
for param, val in json.loads(tool["function"]["arguments"]).items()
|
||||
for param, val in json.loads(tool_arguments).items()
|
||||
)
|
||||
invokes += (
|
||||
"<invoke>\n"
|
||||
|
@ -674,7 +742,7 @@ def anthropic_messages_pt_xml(messages: list):
|
|||
{
|
||||
"type": "text",
|
||||
"text": (
|
||||
convert_to_anthropic_tool_result(messages[msg_i])
|
||||
convert_to_anthropic_tool_result_xml(messages[msg_i])
|
||||
if messages[msg_i]["role"] == "tool"
|
||||
else messages[msg_i]["content"]
|
||||
),
|
||||
|
@ -695,7 +763,7 @@ def anthropic_messages_pt_xml(messages: list):
|
|||
if messages[msg_i].get(
|
||||
"tool_calls", []
|
||||
): # support assistant tool invoke convertion
|
||||
assistant_text += convert_to_anthropic_tool_invoke( # type: ignore
|
||||
assistant_text += convert_to_anthropic_tool_invoke_xml( # type: ignore
|
||||
messages[msg_i]["tool_calls"]
|
||||
)
|
||||
|
||||
|
@ -807,12 +875,18 @@ def convert_to_anthropic_tool_invoke(tool_calls: list) -> list:
|
|||
anthropic_tool_invoke = [
|
||||
{
|
||||
"type": "tool_use",
|
||||
"id": tool["id"],
|
||||
"name": tool["function"]["name"],
|
||||
"input": json.loads(tool["function"]["arguments"]),
|
||||
"id": get_attribute_or_key(tool, "id"),
|
||||
"name": get_attribute_or_key(
|
||||
get_attribute_or_key(tool, "function"), "name"
|
||||
),
|
||||
"input": json.loads(
|
||||
get_attribute_or_key(
|
||||
get_attribute_or_key(tool, "function"), "arguments"
|
||||
)
|
||||
),
|
||||
}
|
||||
for tool in tool_calls
|
||||
if tool["type"] == "function"
|
||||
if get_attribute_or_key(tool, "type") == "function"
|
||||
]
|
||||
|
||||
return anthropic_tool_invoke
|
||||
|
@ -978,6 +1052,30 @@ def get_system_prompt(messages):
|
|||
return system_prompt, messages
|
||||
|
||||
|
||||
def convert_to_documents(
|
||||
observations: Any,
|
||||
) -> List[MutableMapping]:
|
||||
"""Converts observations into a 'document' dict"""
|
||||
documents: List[MutableMapping] = []
|
||||
if isinstance(observations, str):
|
||||
# strings are turned into a key/value pair and a key of 'output' is added.
|
||||
observations = [{"output": observations}]
|
||||
elif isinstance(observations, Mapping):
|
||||
# single mappings are transformed into a list to simplify the rest of the code.
|
||||
observations = [observations]
|
||||
elif not isinstance(observations, Sequence):
|
||||
# all other types are turned into a key/value pair within a list
|
||||
observations = [{"output": observations}]
|
||||
|
||||
for doc in observations:
|
||||
if not isinstance(doc, Mapping):
|
||||
# types that aren't Mapping are turned into a key/value pair.
|
||||
doc = {"output": doc}
|
||||
documents.append(doc)
|
||||
|
||||
return documents
|
||||
|
||||
|
||||
def convert_openai_message_to_cohere_tool_result(message):
|
||||
"""
|
||||
OpenAI message with a tool result looks like:
|
||||
|
@ -1019,7 +1117,7 @@ def convert_openai_message_to_cohere_tool_result(message):
|
|||
"parameters": {"location": "San Francisco, CA"},
|
||||
"generation_id": tool_call_id,
|
||||
},
|
||||
"outputs": [content],
|
||||
"outputs": convert_to_documents(content),
|
||||
}
|
||||
return cohere_tool_result
|
||||
|
||||
|
@ -1032,8 +1130,9 @@ def cohere_message_pt(messages: list):
|
|||
if message["role"] == "tool":
|
||||
tool_result = convert_openai_message_to_cohere_tool_result(message)
|
||||
tool_results.append(tool_result)
|
||||
else:
|
||||
prompt += message["content"]
|
||||
elif message.get("content"):
|
||||
prompt += message["content"] + "\n\n"
|
||||
prompt = prompt.rstrip()
|
||||
return prompt, tool_results
|
||||
|
||||
|
||||
|
@ -1107,12 +1206,6 @@ def _gemini_vision_convert_messages(messages: list):
|
|||
Returns:
|
||||
tuple: A tuple containing the prompt (a string) and the processed images (a list of objects representing the images).
|
||||
"""
|
||||
try:
|
||||
from PIL import Image
|
||||
except:
|
||||
raise Exception(
|
||||
"gemini image conversion failed please run `pip install Pillow`"
|
||||
)
|
||||
|
||||
try:
|
||||
# given messages for gpt-4 vision, convert them for gemini
|
||||
|
@ -1139,6 +1232,12 @@ def _gemini_vision_convert_messages(messages: list):
|
|||
image = _load_image_from_url(img)
|
||||
processed_images.append(image)
|
||||
else:
|
||||
try:
|
||||
from PIL import Image
|
||||
except:
|
||||
raise Exception(
|
||||
"gemini image conversion failed please run `pip install Pillow`"
|
||||
)
|
||||
# Case 2: Image filepath (e.g. temp.jpeg) given
|
||||
image = Image.open(img)
|
||||
processed_images.append(image)
|
||||
|
@ -1306,18 +1405,62 @@ def prompt_factory(
|
|||
return anthropic_pt(messages=messages)
|
||||
elif "mistral." in model:
|
||||
return mistral_instruct_pt(messages=messages)
|
||||
elif "llama2" in model and "chat" in model:
|
||||
return llama_2_chat_pt(messages=messages)
|
||||
elif "llama3" in model and "instruct" in model:
|
||||
return hf_chat_template(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
messages=messages,
|
||||
)
|
||||
|
||||
elif custom_llm_provider == "clarifai":
|
||||
if "claude" in model:
|
||||
return anthropic_pt(messages=messages)
|
||||
|
||||
elif custom_llm_provider == "perplexity":
|
||||
for message in messages:
|
||||
message.pop("name", None)
|
||||
return messages
|
||||
elif custom_llm_provider == "azure_text":
|
||||
return azure_text_pt(messages=messages)
|
||||
elif custom_llm_provider == "watsonx":
|
||||
if "granite" in model and "chat" in model:
|
||||
# granite-13b-chat-v1 and granite-13b-chat-v2 use a specific prompt template
|
||||
return ibm_granite_pt(messages=messages)
|
||||
elif "ibm-mistral" in model and "instruct" in model:
|
||||
# models like ibm-mistral/mixtral-8x7b-instruct-v01-q use the mistral instruct prompt template
|
||||
return mistral_instruct_pt(messages=messages)
|
||||
elif "meta-llama/llama-3" in model and "instruct" in model:
|
||||
# https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/
|
||||
return custom_prompt(
|
||||
role_dict={
|
||||
"system": {
|
||||
"pre_message": "<|start_header_id|>system<|end_header_id|>\n",
|
||||
"post_message": "<|eot_id|>",
|
||||
},
|
||||
"user": {
|
||||
"pre_message": "<|start_header_id|>user<|end_header_id|>\n",
|
||||
"post_message": "<|eot_id|>",
|
||||
},
|
||||
"assistant": {
|
||||
"pre_message": "<|start_header_id|>assistant<|end_header_id|>\n",
|
||||
"post_message": "<|eot_id|>",
|
||||
},
|
||||
},
|
||||
messages=messages,
|
||||
initial_prompt_value="<|begin_of_text|>",
|
||||
final_prompt_value="<|start_header_id|>assistant<|end_header_id|>\n",
|
||||
)
|
||||
try:
|
||||
if "meta-llama/llama-2" in model and "chat" in model:
|
||||
return llama_2_chat_pt(messages=messages)
|
||||
elif (
|
||||
"meta-llama/llama-3" in model or "meta-llama-3" in model
|
||||
) and "instruct" in model:
|
||||
return hf_chat_template(
|
||||
model="meta-llama/Meta-Llama-3-8B-Instruct",
|
||||
messages=messages,
|
||||
)
|
||||
elif (
|
||||
"tiiuae/falcon" in model
|
||||
): # Note: for the instruct models, it's best to use a User: .., Assistant:.. approach in your prompt template.
|
||||
|
@ -1358,3 +1501,9 @@ def prompt_factory(
|
|||
return default_pt(
|
||||
messages=messages
|
||||
) # default that covers Bloom, T-5, any non-chat tuned model (e.g. base Llama2)
|
||||
|
||||
|
||||
def get_attribute_or_key(tool_or_function, attribute, default=None):
|
||||
if hasattr(tool_or_function, attribute):
|
||||
return getattr(tool_or_function, attribute)
|
||||
return tool_or_function.get(attribute, default)
|
||||
|
|
|
@ -112,10 +112,16 @@ def start_prediction(
|
|||
}
|
||||
|
||||
initial_prediction_data = {
|
||||
"version": version_id,
|
||||
"input": input_data,
|
||||
}
|
||||
|
||||
if ":" in version_id and len(version_id) > 64:
|
||||
model_parts = version_id.split(":")
|
||||
if (
|
||||
len(model_parts) > 1 and len(model_parts[1]) == 64
|
||||
): ## checks if model name has a 64 digit code - e.g. "meta/llama-2-70b-chat:02e509c789964a7ea8736978a43525956ef40397be9033abf9fd2badfe68c9e3"
|
||||
initial_prediction_data["version"] = model_parts[1]
|
||||
|
||||
## LOGGING
|
||||
logging_obj.pre_call(
|
||||
input=input_data["prompt"],
|
||||
|
@ -307,9 +313,7 @@ def completion(
|
|||
result, logs = handle_prediction_response(
|
||||
prediction_url, api_key, print_verbose
|
||||
)
|
||||
model_response["ended"] = (
|
||||
time.time()
|
||||
) # for pricing this must remain right after calling api
|
||||
|
||||
## LOGGING
|
||||
logging_obj.post_call(
|
||||
input=prompt,
|
||||
|
@ -345,7 +349,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -399,7 +399,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
@ -617,7 +617,7 @@ async def async_completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
Deprecated. We now do together ai calls via the openai client.
|
||||
Reference: https://docs.together.ai/docs/openai-api-compatibility
|
||||
"""
|
||||
|
||||
import os, types
|
||||
import json
|
||||
from enum import Enum
|
||||
|
@ -225,7 +226,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -22,6 +22,35 @@ class VertexAIError(Exception):
|
|||
) # Call the base class constructor with the parameters it needs
|
||||
|
||||
|
||||
class ExtendedGenerationConfig(dict):
|
||||
"""Extended parameters for the generation."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
temperature: Optional[float] = None,
|
||||
top_p: Optional[float] = None,
|
||||
top_k: Optional[int] = None,
|
||||
candidate_count: Optional[int] = None,
|
||||
max_output_tokens: Optional[int] = None,
|
||||
stop_sequences: Optional[List[str]] = None,
|
||||
response_mime_type: Optional[str] = None,
|
||||
frequency_penalty: Optional[float] = None,
|
||||
presence_penalty: Optional[float] = None,
|
||||
):
|
||||
super().__init__(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
response_mime_type=response_mime_type,
|
||||
frequency_penalty=frequency_penalty,
|
||||
presence_penalty=presence_penalty,
|
||||
)
|
||||
|
||||
|
||||
class VertexAIConfig:
|
||||
"""
|
||||
Reference: https://cloud.google.com/vertex-ai/docs/generative-ai/chat/test-chat-prompts
|
||||
|
@ -43,6 +72,10 @@ class VertexAIConfig:
|
|||
|
||||
- `stop_sequences` (List[str]): The set of character sequences (up to 5) that will stop output generation. If specified, the API will stop at the first appearance of a stop sequence. The stop sequence will not be included as part of the response.
|
||||
|
||||
- `frequency_penalty` (float): This parameter is used to penalize the model from repeating the same output. The default value is 0.0.
|
||||
|
||||
- `presence_penalty` (float): This parameter is used to penalize the model from generating the same output as the input. The default value is 0.0.
|
||||
|
||||
Note: Please make sure to modify the default parameters as required for your use case.
|
||||
"""
|
||||
|
||||
|
@ -53,6 +86,8 @@ class VertexAIConfig:
|
|||
response_mime_type: Optional[str] = None
|
||||
candidate_count: Optional[int] = None
|
||||
stop_sequences: Optional[list] = None
|
||||
frequency_penalty: Optional[float] = None
|
||||
presence_penalty: Optional[float] = None
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
|
@ -63,6 +98,8 @@ class VertexAIConfig:
|
|||
response_mime_type: Optional[str] = None,
|
||||
candidate_count: Optional[int] = None,
|
||||
stop_sequences: Optional[list] = None,
|
||||
frequency_penalty: Optional[float] = None,
|
||||
presence_penalty: Optional[float] = None,
|
||||
) -> None:
|
||||
locals_ = locals()
|
||||
for key, value in locals_.items():
|
||||
|
@ -106,7 +143,9 @@ class VertexAIConfig:
|
|||
optional_params["temperature"] = value
|
||||
if param == "top_p":
|
||||
optional_params["top_p"] = value
|
||||
if param == "stream":
|
||||
if (
|
||||
param == "stream" and value == True
|
||||
): # sending stream = False, can cause it to get passed unchecked and raise issues
|
||||
optional_params["stream"] = value
|
||||
if param == "n":
|
||||
optional_params["candidate_count"] = value
|
||||
|
@ -119,6 +158,10 @@ class VertexAIConfig:
|
|||
optional_params["max_output_tokens"] = value
|
||||
if param == "response_format" and value["type"] == "json_object":
|
||||
optional_params["response_mime_type"] = "application/json"
|
||||
if param == "frequency_penalty":
|
||||
optional_params["frequency_penalty"] = value
|
||||
if param == "presence_penalty":
|
||||
optional_params["presence_penalty"] = value
|
||||
if param == "tools" and isinstance(value, list):
|
||||
from vertexai.preview import generative_models
|
||||
|
||||
|
@ -141,6 +184,20 @@ class VertexAIConfig:
|
|||
pass
|
||||
return optional_params
|
||||
|
||||
def get_mapped_special_auth_params(self) -> dict:
|
||||
"""
|
||||
Common auth params across bedrock/vertex_ai/azure/watsonx
|
||||
"""
|
||||
return {"project": "vertex_project", "region_name": "vertex_location"}
|
||||
|
||||
def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
|
||||
mapped_params = self.get_mapped_special_auth_params()
|
||||
|
||||
for param, value in non_default_params.items():
|
||||
if param in mapped_params:
|
||||
optional_params[mapped_params[param]] = value
|
||||
return optional_params
|
||||
|
||||
|
||||
import asyncio
|
||||
|
||||
|
@ -184,8 +241,7 @@ def _get_image_bytes_from_url(image_url: str) -> bytes:
|
|||
image_bytes = response.content
|
||||
return image_bytes
|
||||
except requests.exceptions.RequestException as e:
|
||||
# Handle any request exceptions (e.g., connection error, timeout)
|
||||
return b"" # Return an empty bytes object or handle the error as needed
|
||||
raise Exception(f"An exception occurs with this image - {str(e)}")
|
||||
|
||||
|
||||
def _load_image_from_url(image_url: str):
|
||||
|
@ -206,7 +262,8 @@ def _load_image_from_url(image_url: str):
|
|||
)
|
||||
|
||||
image_bytes = _get_image_bytes_from_url(image_url)
|
||||
return Image.from_bytes(image_bytes)
|
||||
|
||||
return Image.from_bytes(data=image_bytes)
|
||||
|
||||
|
||||
def _gemini_vision_convert_messages(messages: list):
|
||||
|
@ -363,42 +420,6 @@ def completion(
|
|||
from google.cloud.aiplatform_v1beta1.types import content as gapic_content_types # type: ignore
|
||||
import google.auth # type: ignore
|
||||
|
||||
class ExtendedGenerationConfig(GenerationConfig):
|
||||
"""Extended parameters for the generation."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
temperature: Optional[float] = None,
|
||||
top_p: Optional[float] = None,
|
||||
top_k: Optional[int] = None,
|
||||
candidate_count: Optional[int] = None,
|
||||
max_output_tokens: Optional[int] = None,
|
||||
stop_sequences: Optional[List[str]] = None,
|
||||
response_mime_type: Optional[str] = None,
|
||||
):
|
||||
args_spec = inspect.getfullargspec(gapic_content_types.GenerationConfig)
|
||||
|
||||
if "response_mime_type" in args_spec.args:
|
||||
self._raw_generation_config = gapic_content_types.GenerationConfig(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
response_mime_type=response_mime_type,
|
||||
)
|
||||
else:
|
||||
self._raw_generation_config = gapic_content_types.GenerationConfig(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
)
|
||||
|
||||
## Load credentials with the correct quota project ref: https://github.com/googleapis/python-aiplatform/issues/2557#issuecomment-1709284744
|
||||
print_verbose(
|
||||
f"VERTEX AI: vertex_project={vertex_project}; vertex_location={vertex_location}"
|
||||
|
@ -522,6 +543,7 @@ def completion(
|
|||
"instances": instances,
|
||||
"vertex_location": vertex_location,
|
||||
"vertex_project": vertex_project,
|
||||
"safety_settings": safety_settings,
|
||||
**optional_params,
|
||||
}
|
||||
if optional_params.get("stream", False) is True:
|
||||
|
@ -536,8 +558,9 @@ def completion(
|
|||
tools = optional_params.pop("tools", None)
|
||||
prompt, images = _gemini_vision_convert_messages(messages=messages)
|
||||
content = [prompt] + images
|
||||
if "stream" in optional_params and optional_params["stream"] == True:
|
||||
stream = optional_params.pop("stream")
|
||||
stream = optional_params.pop("stream", False)
|
||||
if stream == True:
|
||||
|
||||
request_str += f"response = llm_model.generate_content({content}, generation_config=GenerationConfig(**{optional_params}), safety_settings={safety_settings}, stream={stream})\n"
|
||||
logging_obj.pre_call(
|
||||
input=prompt,
|
||||
|
@ -550,12 +573,12 @@ def completion(
|
|||
|
||||
model_response = llm_model.generate_content(
|
||||
contents=content,
|
||||
generation_config=ExtendedGenerationConfig(**optional_params),
|
||||
generation_config=optional_params,
|
||||
safety_settings=safety_settings,
|
||||
stream=True,
|
||||
tools=tools,
|
||||
)
|
||||
optional_params["stream"] = True
|
||||
|
||||
return model_response
|
||||
|
||||
request_str += f"response = llm_model.generate_content({content})\n"
|
||||
|
@ -572,7 +595,7 @@ def completion(
|
|||
## LLM Call
|
||||
response = llm_model.generate_content(
|
||||
contents=content,
|
||||
generation_config=ExtendedGenerationConfig(**optional_params),
|
||||
generation_config=optional_params,
|
||||
safety_settings=safety_settings,
|
||||
tools=tools,
|
||||
)
|
||||
|
@ -627,7 +650,7 @@ def completion(
|
|||
},
|
||||
)
|
||||
model_response = chat.send_message_streaming(prompt, **optional_params)
|
||||
optional_params["stream"] = True
|
||||
|
||||
return model_response
|
||||
|
||||
request_str += f"chat.send_message({prompt}, **{optional_params}).text\n"
|
||||
|
@ -659,7 +682,7 @@ def completion(
|
|||
},
|
||||
)
|
||||
model_response = llm_model.predict_streaming(prompt, **optional_params)
|
||||
optional_params["stream"] = True
|
||||
|
||||
return model_response
|
||||
|
||||
request_str += f"llm_model.predict({prompt}, **{optional_params}).text\n"
|
||||
|
@ -784,7 +807,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
except Exception as e:
|
||||
raise VertexAIError(status_code=500, message=str(e))
|
||||
|
@ -805,55 +828,18 @@ async def async_completion(
|
|||
instances=None,
|
||||
vertex_project=None,
|
||||
vertex_location=None,
|
||||
safety_settings=None,
|
||||
**optional_params,
|
||||
):
|
||||
"""
|
||||
Add support for acompletion calls for gemini-pro
|
||||
"""
|
||||
try:
|
||||
from vertexai.preview.generative_models import GenerationConfig
|
||||
from google.cloud.aiplatform_v1beta1.types import content as gapic_content_types # type: ignore
|
||||
|
||||
class ExtendedGenerationConfig(GenerationConfig):
|
||||
"""Extended parameters for the generation."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
temperature: Optional[float] = None,
|
||||
top_p: Optional[float] = None,
|
||||
top_k: Optional[int] = None,
|
||||
candidate_count: Optional[int] = None,
|
||||
max_output_tokens: Optional[int] = None,
|
||||
stop_sequences: Optional[List[str]] = None,
|
||||
response_mime_type: Optional[str] = None,
|
||||
):
|
||||
args_spec = inspect.getfullargspec(gapic_content_types.GenerationConfig)
|
||||
|
||||
if "response_mime_type" in args_spec.args:
|
||||
self._raw_generation_config = gapic_content_types.GenerationConfig(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
response_mime_type=response_mime_type,
|
||||
)
|
||||
else:
|
||||
self._raw_generation_config = gapic_content_types.GenerationConfig(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
)
|
||||
|
||||
if mode == "vision":
|
||||
print_verbose("\nMaking VertexAI Gemini Pro Vision Call")
|
||||
print_verbose("\nMaking VertexAI Gemini Pro/Vision Call")
|
||||
print_verbose(f"\nProcessing input messages = {messages}")
|
||||
tools = optional_params.pop("tools", None)
|
||||
stream = optional_params.pop("stream", False)
|
||||
|
||||
prompt, images = _gemini_vision_convert_messages(messages=messages)
|
||||
content = [prompt] + images
|
||||
|
@ -870,9 +856,11 @@ async def async_completion(
|
|||
)
|
||||
|
||||
## LLM Call
|
||||
# print(f"final content: {content}")
|
||||
response = await llm_model._generate_content_async(
|
||||
contents=content,
|
||||
generation_config=ExtendedGenerationConfig(**optional_params),
|
||||
generation_config=optional_params,
|
||||
safety_settings=safety_settings,
|
||||
tools=tools,
|
||||
)
|
||||
|
||||
|
@ -1030,7 +1018,7 @@ async def async_completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
except Exception as e:
|
||||
raise VertexAIError(status_code=500, message=str(e))
|
||||
|
@ -1051,50 +1039,12 @@ async def async_streaming(
|
|||
instances=None,
|
||||
vertex_project=None,
|
||||
vertex_location=None,
|
||||
safety_settings=None,
|
||||
**optional_params,
|
||||
):
|
||||
"""
|
||||
Add support for async streaming calls for gemini-pro
|
||||
"""
|
||||
from vertexai.preview.generative_models import GenerationConfig
|
||||
from google.cloud.aiplatform_v1beta1.types import content as gapic_content_types # type: ignore
|
||||
|
||||
class ExtendedGenerationConfig(GenerationConfig):
|
||||
"""Extended parameters for the generation."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
temperature: Optional[float] = None,
|
||||
top_p: Optional[float] = None,
|
||||
top_k: Optional[int] = None,
|
||||
candidate_count: Optional[int] = None,
|
||||
max_output_tokens: Optional[int] = None,
|
||||
stop_sequences: Optional[List[str]] = None,
|
||||
response_mime_type: Optional[str] = None,
|
||||
):
|
||||
args_spec = inspect.getfullargspec(gapic_content_types.GenerationConfig)
|
||||
|
||||
if "response_mime_type" in args_spec.args:
|
||||
self._raw_generation_config = gapic_content_types.GenerationConfig(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
response_mime_type=response_mime_type,
|
||||
)
|
||||
else:
|
||||
self._raw_generation_config = gapic_content_types.GenerationConfig(
|
||||
temperature=temperature,
|
||||
top_p=top_p,
|
||||
top_k=top_k,
|
||||
candidate_count=candidate_count,
|
||||
max_output_tokens=max_output_tokens,
|
||||
stop_sequences=stop_sequences,
|
||||
)
|
||||
|
||||
if mode == "vision":
|
||||
stream = optional_params.pop("stream")
|
||||
tools = optional_params.pop("tools", None)
|
||||
|
@ -1115,11 +1065,11 @@ async def async_streaming(
|
|||
|
||||
response = await llm_model._generate_content_streaming_async(
|
||||
contents=content,
|
||||
generation_config=ExtendedGenerationConfig(**optional_params),
|
||||
generation_config=optional_params,
|
||||
safety_settings=safety_settings,
|
||||
tools=tools,
|
||||
)
|
||||
optional_params["stream"] = True
|
||||
optional_params["tools"] = tools
|
||||
|
||||
elif mode == "chat":
|
||||
chat = llm_model.start_chat()
|
||||
optional_params.pop(
|
||||
|
@ -1138,7 +1088,7 @@ async def async_streaming(
|
|||
},
|
||||
)
|
||||
response = chat.send_message_streaming_async(prompt, **optional_params)
|
||||
optional_params["stream"] = True
|
||||
|
||||
elif mode == "text":
|
||||
optional_params.pop(
|
||||
"stream", None
|
||||
|
|
|
@ -349,7 +349,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
except Exception as e:
|
||||
raise VertexAIError(status_code=500, message=str(e))
|
||||
|
@ -422,7 +422,7 @@ async def async_completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
|
|
@ -104,7 +104,7 @@ def completion(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
return model_response
|
||||
|
||||
|
||||
|
@ -186,7 +186,7 @@ def batch_completions(
|
|||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
)
|
||||
model_response.usage = usage
|
||||
setattr(model_response, "usage", usage)
|
||||
final_outputs.append(model_response)
|
||||
return final_outputs
|
||||
|
||||
|
|
609
litellm/llms/watsonx.py
Normal file
609
litellm/llms/watsonx.py
Normal file
|
@ -0,0 +1,609 @@
|
|||
from enum import Enum
|
||||
import json, types, time # noqa: E401
|
||||
from contextlib import contextmanager
|
||||
from typing import Callable, Dict, Optional, Any, Union, List
|
||||
|
||||
import httpx
|
||||
import requests
|
||||
import litellm
|
||||
from litellm.utils import ModelResponse, get_secret, Usage
|
||||
|
||||
from .base import BaseLLM
|
||||
from .prompt_templates import factory as ptf
|
||||
|
||||
|
||||
class WatsonXAIError(Exception):
|
||||
def __init__(self, status_code, message, url: Optional[str] = None):
|
||||
self.status_code = status_code
|
||||
self.message = message
|
||||
url = url or "https://https://us-south.ml.cloud.ibm.com"
|
||||
self.request = httpx.Request(method="POST", url=url)
|
||||
self.response = httpx.Response(status_code=status_code, request=self.request)
|
||||
super().__init__(
|
||||
self.message
|
||||
) # Call the base class constructor with the parameters it needs
|
||||
|
||||
|
||||
class IBMWatsonXAIConfig:
|
||||
"""
|
||||
Reference: https://cloud.ibm.com/apidocs/watsonx-ai#text-generation
|
||||
(See ibm_watsonx_ai.metanames.GenTextParamsMetaNames for a list of all available params)
|
||||
|
||||
Supported params for all available watsonx.ai foundational models.
|
||||
|
||||
- `decoding_method` (str): One of "greedy" or "sample"
|
||||
|
||||
- `temperature` (float): Sets the model temperature for sampling - not available when decoding_method='greedy'.
|
||||
|
||||
- `max_new_tokens` (integer): Maximum length of the generated tokens.
|
||||
|
||||
- `min_new_tokens` (integer): Maximum length of input tokens. Any more than this will be truncated.
|
||||
|
||||
- `length_penalty` (dict): A dictionary with keys "decay_factor" and "start_index".
|
||||
|
||||
- `stop_sequences` (string[]): list of strings to use as stop sequences.
|
||||
|
||||
- `top_k` (integer): top k for sampling - not available when decoding_method='greedy'.
|
||||
|
||||
- `top_p` (integer): top p for sampling - not available when decoding_method='greedy'.
|
||||
|
||||
- `repetition_penalty` (float): token repetition penalty during text generation.
|
||||
|
||||
- `truncate_input_tokens` (integer): Truncate input tokens to this length.
|
||||
|
||||
- `include_stop_sequences` (bool): If True, the stop sequence will be included at the end of the generated text in the case of a match.
|
||||
|
||||
- `return_options` (dict): A dictionary of options to return. Options include "input_text", "generated_tokens", "input_tokens", "token_ranks". Values are boolean.
|
||||
|
||||
- `random_seed` (integer): Random seed for text generation.
|
||||
|
||||
- `moderations` (dict): Dictionary of properties that control the moderations, for usages such as Hate and profanity (HAP) and PII filtering.
|
||||
|
||||
- `stream` (bool): If True, the model will return a stream of responses.
|
||||
"""
|
||||
|
||||
decoding_method: Optional[str] = "sample"
|
||||
temperature: Optional[float] = None
|
||||
max_new_tokens: Optional[int] = None # litellm.max_tokens
|
||||
min_new_tokens: Optional[int] = None
|
||||
length_penalty: Optional[dict] = None # e.g {"decay_factor": 2.5, "start_index": 5}
|
||||
stop_sequences: Optional[List[str]] = None # e.g ["}", ")", "."]
|
||||
top_k: Optional[int] = None
|
||||
top_p: Optional[float] = None
|
||||
repetition_penalty: Optional[float] = None
|
||||
truncate_input_tokens: Optional[int] = None
|
||||
include_stop_sequences: Optional[bool] = False
|
||||
return_options: Optional[Dict[str, bool]] = None
|
||||
random_seed: Optional[int] = None # e.g 42
|
||||
moderations: Optional[dict] = None
|
||||
stream: Optional[bool] = False
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
decoding_method: Optional[str] = None,
|
||||
temperature: Optional[float] = None,
|
||||
max_new_tokens: Optional[int] = None,
|
||||
min_new_tokens: Optional[int] = None,
|
||||
length_penalty: Optional[dict] = None,
|
||||
stop_sequences: Optional[List[str]] = None,
|
||||
top_k: Optional[int] = None,
|
||||
top_p: Optional[float] = None,
|
||||
repetition_penalty: Optional[float] = None,
|
||||
truncate_input_tokens: Optional[int] = None,
|
||||
include_stop_sequences: Optional[bool] = None,
|
||||
return_options: Optional[dict] = None,
|
||||
random_seed: Optional[int] = None,
|
||||
moderations: Optional[dict] = None,
|
||||
stream: Optional[bool] = None,
|
||||
**kwargs,
|
||||
) -> None:
|
||||
locals_ = locals()
|
||||
for key, value in locals_.items():
|
||||
if key != "self" and value is not None:
|
||||
setattr(self.__class__, key, value)
|
||||
|
||||
@classmethod
|
||||
def get_config(cls):
|
||||
return {
|
||||
k: v
|
||||
for k, v in cls.__dict__.items()
|
||||
if not k.startswith("__")
|
||||
and not isinstance(
|
||||
v,
|
||||
(
|
||||
types.FunctionType,
|
||||
types.BuiltinFunctionType,
|
||||
classmethod,
|
||||
staticmethod,
|
||||
),
|
||||
)
|
||||
and v is not None
|
||||
}
|
||||
|
||||
def get_supported_openai_params(self):
|
||||
return [
|
||||
"temperature", # equivalent to temperature
|
||||
"max_tokens", # equivalent to max_new_tokens
|
||||
"top_p", # equivalent to top_p
|
||||
"frequency_penalty", # equivalent to repetition_penalty
|
||||
"stop", # equivalent to stop_sequences
|
||||
"seed", # equivalent to random_seed
|
||||
"stream", # equivalent to stream
|
||||
]
|
||||
|
||||
def get_mapped_special_auth_params(self) -> dict:
|
||||
"""
|
||||
Common auth params across bedrock/vertex_ai/azure/watsonx
|
||||
"""
|
||||
return {
|
||||
"project": "watsonx_project",
|
||||
"region_name": "watsonx_region_name",
|
||||
"token": "watsonx_token",
|
||||
}
|
||||
|
||||
def map_special_auth_params(self, non_default_params: dict, optional_params: dict):
|
||||
mapped_params = self.get_mapped_special_auth_params()
|
||||
|
||||
for param, value in non_default_params.items():
|
||||
if param in mapped_params:
|
||||
optional_params[mapped_params[param]] = value
|
||||
return optional_params
|
||||
|
||||
|
||||
def convert_messages_to_prompt(model, messages, provider, custom_prompt_dict):
|
||||
# handle anthropic prompts and amazon titan prompts
|
||||
if model in custom_prompt_dict:
|
||||
# check if the model has a registered custom prompt
|
||||
model_prompt_dict = custom_prompt_dict[model]
|
||||
prompt = ptf.custom_prompt(
|
||||
messages=messages,
|
||||
role_dict=model_prompt_dict.get(
|
||||
"role_dict", model_prompt_dict.get("roles")
|
||||
),
|
||||
initial_prompt_value=model_prompt_dict.get("initial_prompt_value", ""),
|
||||
final_prompt_value=model_prompt_dict.get("final_prompt_value", ""),
|
||||
bos_token=model_prompt_dict.get("bos_token", ""),
|
||||
eos_token=model_prompt_dict.get("eos_token", ""),
|
||||
)
|
||||
return prompt
|
||||
elif provider == "ibm":
|
||||
prompt = ptf.prompt_factory(
|
||||
model=model, messages=messages, custom_llm_provider="watsonx"
|
||||
)
|
||||
elif provider == "ibm-mistralai":
|
||||
prompt = ptf.mistral_instruct_pt(messages=messages)
|
||||
else:
|
||||
prompt = ptf.prompt_factory(
|
||||
model=model, messages=messages, custom_llm_provider="watsonx"
|
||||
)
|
||||
return prompt
|
||||
|
||||
|
||||
class WatsonXAIEndpoint(str, Enum):
|
||||
TEXT_GENERATION = "/ml/v1/text/generation"
|
||||
TEXT_GENERATION_STREAM = "/ml/v1/text/generation_stream"
|
||||
DEPLOYMENT_TEXT_GENERATION = "/ml/v1/deployments/{deployment_id}/text/generation"
|
||||
DEPLOYMENT_TEXT_GENERATION_STREAM = (
|
||||
"/ml/v1/deployments/{deployment_id}/text/generation_stream"
|
||||
)
|
||||
EMBEDDINGS = "/ml/v1/text/embeddings"
|
||||
PROMPTS = "/ml/v1/prompts"
|
||||
|
||||
|
||||
class IBMWatsonXAI(BaseLLM):
|
||||
"""
|
||||
Class to interface with IBM Watsonx.ai API for text generation and embeddings.
|
||||
|
||||
Reference: https://cloud.ibm.com/apidocs/watsonx-ai
|
||||
"""
|
||||
|
||||
api_version = "2024-03-13"
|
||||
|
||||
def __init__(self) -> None:
|
||||
super().__init__()
|
||||
|
||||
def _prepare_text_generation_req(
|
||||
self,
|
||||
model_id: str,
|
||||
prompt: str,
|
||||
stream: bool,
|
||||
optional_params: dict,
|
||||
print_verbose: Optional[Callable] = None,
|
||||
) -> dict:
|
||||
"""
|
||||
Get the request parameters for text generation.
|
||||
"""
|
||||
api_params = self._get_api_params(optional_params, print_verbose=print_verbose)
|
||||
# build auth headers
|
||||
api_token = api_params.get("token")
|
||||
|
||||
headers = {
|
||||
"Authorization": f"Bearer {api_token}",
|
||||
"Content-Type": "application/json",
|
||||
"Accept": "application/json",
|
||||
}
|
||||
extra_body_params = optional_params.pop("extra_body", {})
|
||||
optional_params.update(extra_body_params)
|
||||
# init the payload to the text generation call
|
||||
payload = {
|
||||
"input": prompt,
|
||||
"moderations": optional_params.pop("moderations", {}),
|
||||
"parameters": optional_params,
|
||||
}
|
||||
request_params = dict(version=api_params["api_version"])
|
||||
# text generation endpoint deployment or model / stream or not
|
||||
if model_id.startswith("deployment/"):
|
||||
# deployment models are passed in as 'deployment/<deployment_id>'
|
||||
if api_params.get("space_id") is None:
|
||||
raise WatsonXAIError(
|
||||
status_code=401,
|
||||
url=api_params["url"],
|
||||
message="Error: space_id is required for models called using the 'deployment/' endpoint. Pass in the space_id as a parameter or set it in the WX_SPACE_ID environment variable.",
|
||||
)
|
||||
deployment_id = "/".join(model_id.split("/")[1:])
|
||||
endpoint = (
|
||||
WatsonXAIEndpoint.DEPLOYMENT_TEXT_GENERATION_STREAM.value
|
||||
if stream
|
||||
else WatsonXAIEndpoint.DEPLOYMENT_TEXT_GENERATION.value
|
||||
)
|
||||
endpoint = endpoint.format(deployment_id=deployment_id)
|
||||
else:
|
||||
payload["model_id"] = model_id
|
||||
payload["project_id"] = api_params["project_id"]
|
||||
endpoint = (
|
||||
WatsonXAIEndpoint.TEXT_GENERATION_STREAM
|
||||
if stream
|
||||
else WatsonXAIEndpoint.TEXT_GENERATION
|
||||
)
|
||||
url = api_params["url"].rstrip("/") + endpoint
|
||||
return dict(
|
||||
method="POST", url=url, headers=headers, json=payload, params=request_params
|
||||
)
|
||||
|
||||
def _get_api_params(
|
||||
self, params: dict, print_verbose: Optional[Callable] = None
|
||||
) -> dict:
|
||||
"""
|
||||
Find watsonx.ai credentials in the params or environment variables and return the headers for authentication.
|
||||
"""
|
||||
# Load auth variables from params
|
||||
url = params.pop("url", params.pop("api_base", params.pop("base_url", None)))
|
||||
api_key = params.pop("apikey", None)
|
||||
token = params.pop("token", None)
|
||||
project_id = params.pop(
|
||||
"project_id", params.pop("watsonx_project", None)
|
||||
) # watsonx.ai project_id - allow 'watsonx_project' to be consistent with how vertex project implementation works -> reduce provider-specific params
|
||||
space_id = params.pop("space_id", None) # watsonx.ai deployment space_id
|
||||
region_name = params.pop("region_name", params.pop("region", None))
|
||||
if region_name is None:
|
||||
region_name = params.pop(
|
||||
"watsonx_region_name", params.pop("watsonx_region", None)
|
||||
) # consistent with how vertex ai + aws regions are accepted
|
||||
wx_credentials = params.pop(
|
||||
"wx_credentials",
|
||||
params.pop(
|
||||
"watsonx_credentials", None
|
||||
), # follow {provider}_credentials, same as vertex ai
|
||||
)
|
||||
api_version = params.pop("api_version", IBMWatsonXAI.api_version)
|
||||
# Load auth variables from environment variables
|
||||
if url is None:
|
||||
url = (
|
||||
get_secret("WATSONX_API_BASE") # consistent with 'AZURE_API_BASE'
|
||||
or get_secret("WATSONX_URL")
|
||||
or get_secret("WX_URL")
|
||||
or get_secret("WML_URL")
|
||||
)
|
||||
if api_key is None:
|
||||
api_key = (
|
||||
get_secret("WATSONX_APIKEY")
|
||||
or get_secret("WATSONX_API_KEY")
|
||||
or get_secret("WX_API_KEY")
|
||||
)
|
||||
if token is None:
|
||||
token = get_secret("WATSONX_TOKEN") or get_secret("WX_TOKEN")
|
||||
if project_id is None:
|
||||
project_id = (
|
||||
get_secret("WATSONX_PROJECT_ID")
|
||||
or get_secret("WX_PROJECT_ID")
|
||||
or get_secret("PROJECT_ID")
|
||||
)
|
||||
if region_name is None:
|
||||
region_name = (
|
||||
get_secret("WATSONX_REGION")
|
||||
or get_secret("WX_REGION")
|
||||
or get_secret("REGION")
|
||||
)
|
||||
if space_id is None:
|
||||
space_id = (
|
||||
get_secret("WATSONX_DEPLOYMENT_SPACE_ID")
|
||||
or get_secret("WATSONX_SPACE_ID")
|
||||
or get_secret("WX_SPACE_ID")
|
||||
or get_secret("SPACE_ID")
|
||||
)
|
||||
|
||||
# credentials parsing
|
||||
if wx_credentials is not None:
|
||||
url = wx_credentials.get("url", url)
|
||||
api_key = wx_credentials.get(
|
||||
"apikey", wx_credentials.get("api_key", api_key)
|
||||
)
|
||||
token = wx_credentials.get(
|
||||
"token",
|
||||
wx_credentials.get(
|
||||
"watsonx_token", token
|
||||
), # follow format of {provider}_token, same as azure - e.g. 'azure_ad_token=..'
|
||||
)
|
||||
|
||||
# verify that all required credentials are present
|
||||
if url is None:
|
||||
raise WatsonXAIError(
|
||||
status_code=401,
|
||||
message="Error: Watsonx URL not set. Set WX_URL in environment variables or pass in as a parameter.",
|
||||
)
|
||||
if token is None and api_key is not None:
|
||||
# generate the auth token
|
||||
if print_verbose:
|
||||
print_verbose("Generating IAM token for Watsonx.ai")
|
||||
token = self.generate_iam_token(api_key)
|
||||
elif token is None and api_key is None:
|
||||
raise WatsonXAIError(
|
||||
status_code=401,
|
||||
url=url,
|
||||
message="Error: API key or token not found. Set WX_API_KEY or WX_TOKEN in environment variables or pass in as a parameter.",
|
||||
)
|
||||
if project_id is None:
|
||||
raise WatsonXAIError(
|
||||
status_code=401,
|
||||
url=url,
|
||||
message="Error: Watsonx project_id not set. Set WX_PROJECT_ID in environment variables or pass in as a parameter.",
|
||||
)
|
||||
|
||||
return {
|
||||
"url": url,
|
||||
"api_key": api_key,
|
||||
"token": token,
|
||||
"project_id": project_id,
|
||||
"space_id": space_id,
|
||||
"region_name": region_name,
|
||||
"api_version": api_version,
|
||||
}
|
||||
|
||||
def completion(
|
||||
self,
|
||||
model: str,
|
||||
messages: list,
|
||||
custom_prompt_dict: dict,
|
||||
model_response: ModelResponse,
|
||||
print_verbose: Callable,
|
||||
encoding,
|
||||
logging_obj,
|
||||
optional_params: dict,
|
||||
litellm_params: Optional[dict] = None,
|
||||
logger_fn=None,
|
||||
timeout: Optional[float] = None,
|
||||
):
|
||||
"""
|
||||
Send a text generation request to the IBM Watsonx.ai API.
|
||||
Reference: https://cloud.ibm.com/apidocs/watsonx-ai#text-generation
|
||||
"""
|
||||
stream = optional_params.pop("stream", False)
|
||||
|
||||
# Load default configs
|
||||
config = IBMWatsonXAIConfig.get_config()
|
||||
for k, v in config.items():
|
||||
if k not in optional_params:
|
||||
optional_params[k] = v
|
||||
|
||||
# Make prompt to send to model
|
||||
provider = model.split("/")[0]
|
||||
# model_name = "/".join(model.split("/")[1:])
|
||||
prompt = convert_messages_to_prompt(
|
||||
model, messages, provider, custom_prompt_dict
|
||||
)
|
||||
|
||||
def process_text_request(request_params: dict) -> ModelResponse:
|
||||
with self._manage_response(
|
||||
request_params, logging_obj=logging_obj, input=prompt, timeout=timeout
|
||||
) as resp:
|
||||
json_resp = resp.json()
|
||||
|
||||
generated_text = json_resp["results"][0]["generated_text"]
|
||||
prompt_tokens = json_resp["results"][0]["input_token_count"]
|
||||
completion_tokens = json_resp["results"][0]["generated_token_count"]
|
||||
model_response["choices"][0]["message"]["content"] = generated_text
|
||||
model_response["finish_reason"] = json_resp["results"][0]["stop_reason"]
|
||||
model_response["created"] = int(time.time())
|
||||
model_response["model"] = model
|
||||
setattr(
|
||||
model_response,
|
||||
"usage",
|
||||
Usage(
|
||||
prompt_tokens=prompt_tokens,
|
||||
completion_tokens=completion_tokens,
|
||||
total_tokens=prompt_tokens + completion_tokens,
|
||||
),
|
||||
)
|
||||
return model_response
|
||||
|
||||
def process_stream_request(
|
||||
request_params: dict,
|
||||
) -> litellm.CustomStreamWrapper:
|
||||
# stream the response - generated chunks will be handled
|
||||
# by litellm.utils.CustomStreamWrapper.handle_watsonx_stream
|
||||
with self._manage_response(
|
||||
request_params,
|
||||
logging_obj=logging_obj,
|
||||
stream=True,
|
||||
input=prompt,
|
||||
timeout=timeout,
|
||||
) as resp:
|
||||
response = litellm.CustomStreamWrapper(
|
||||
resp.iter_lines(),
|
||||
model=model,
|
||||
custom_llm_provider="watsonx",
|
||||
logging_obj=logging_obj,
|
||||
)
|
||||
return response
|
||||
|
||||
try:
|
||||
## Get the response from the model
|
||||
req_params = self._prepare_text_generation_req(
|
||||
model_id=model,
|
||||
prompt=prompt,
|
||||
stream=stream,
|
||||
optional_params=optional_params,
|
||||
print_verbose=print_verbose,
|
||||
)
|
||||
if stream:
|
||||
return process_stream_request(req_params)
|
||||
else:
|
||||
return process_text_request(req_params)
|
||||
except WatsonXAIError as e:
|
||||
raise e
|
||||
except Exception as e:
|
||||
raise WatsonXAIError(status_code=500, message=str(e))
|
||||
|
||||
def embedding(
|
||||
self,
|
||||
model: str,
|
||||
input: Union[list, str],
|
||||
api_key: Optional[str] = None,
|
||||
logging_obj=None,
|
||||
model_response=None,
|
||||
optional_params=None,
|
||||
encoding=None,
|
||||
):
|
||||
"""
|
||||
Send a text embedding request to the IBM Watsonx.ai API.
|
||||
"""
|
||||
if optional_params is None:
|
||||
optional_params = {}
|
||||
# Load default configs
|
||||
config = IBMWatsonXAIConfig.get_config()
|
||||
for k, v in config.items():
|
||||
if k not in optional_params:
|
||||
optional_params[k] = v
|
||||
|
||||
# Load auth variables from environment variables
|
||||
if isinstance(input, str):
|
||||
input = [input]
|
||||
if api_key is not None:
|
||||
optional_params["api_key"] = api_key
|
||||
api_params = self._get_api_params(optional_params)
|
||||
# build auth headers
|
||||
api_token = api_params.get("token")
|
||||
headers = {
|
||||
"Authorization": f"Bearer {api_token}",
|
||||
"Content-Type": "application/json",
|
||||
"Accept": "application/json",
|
||||
}
|
||||
# init the payload to the text generation call
|
||||
payload = {
|
||||
"inputs": input,
|
||||
"model_id": model,
|
||||
"project_id": api_params["project_id"],
|
||||
"parameters": optional_params,
|
||||
}
|
||||
request_params = dict(version=api_params["api_version"])
|
||||
url = api_params["url"].rstrip("/") + WatsonXAIEndpoint.EMBEDDINGS
|
||||
# request = httpx.Request(
|
||||
# "POST", url, headers=headers, json=payload, params=request_params
|
||||
# )
|
||||
req_params = {
|
||||
"method": "POST",
|
||||
"url": url,
|
||||
"headers": headers,
|
||||
"json": payload,
|
||||
"params": request_params,
|
||||
}
|
||||
with self._manage_response(
|
||||
req_params, logging_obj=logging_obj, input=input
|
||||
) as resp:
|
||||
json_resp = resp.json()
|
||||
|
||||
results = json_resp.get("results", [])
|
||||
embedding_response = []
|
||||
for idx, result in enumerate(results):
|
||||
embedding_response.append(
|
||||
{"object": "embedding", "index": idx, "embedding": result["embedding"]}
|
||||
)
|
||||
model_response["object"] = "list"
|
||||
model_response["data"] = embedding_response
|
||||
model_response["model"] = model
|
||||
input_tokens = json_resp.get("input_token_count", 0)
|
||||
model_response.usage = Usage(
|
||||
prompt_tokens=input_tokens, completion_tokens=0, total_tokens=input_tokens
|
||||
)
|
||||
return model_response
|
||||
|
||||
def generate_iam_token(self, api_key=None, **params):
|
||||
headers = {}
|
||||
headers["Content-Type"] = "application/x-www-form-urlencoded"
|
||||
if api_key is None:
|
||||
api_key = get_secret("WX_API_KEY") or get_secret("WATSONX_API_KEY")
|
||||
if api_key is None:
|
||||
raise ValueError("API key is required")
|
||||
headers["Accept"] = "application/json"
|
||||
data = {
|
||||
"grant_type": "urn:ibm:params:oauth:grant-type:apikey",
|
||||
"apikey": api_key,
|
||||
}
|
||||
response = httpx.post(
|
||||
"https://iam.cloud.ibm.com/identity/token", data=data, headers=headers
|
||||
)
|
||||
response.raise_for_status()
|
||||
json_data = response.json()
|
||||
iam_access_token = json_data["access_token"]
|
||||
self.token = iam_access_token
|
||||
return iam_access_token
|
||||
|
||||
@contextmanager
|
||||
def _manage_response(
|
||||
self,
|
||||
request_params: dict,
|
||||
logging_obj: Any,
|
||||
stream: bool = False,
|
||||
input: Optional[Any] = None,
|
||||
timeout: Optional[float] = None,
|
||||
):
|
||||
request_str = (
|
||||
f"response = {request_params['method']}(\n"
|
||||
f"\turl={request_params['url']},\n"
|
||||
f"\tjson={request_params['json']},\n"
|
||||
f")"
|
||||
)
|
||||
logging_obj.pre_call(
|
||||
input=input,
|
||||
api_key=request_params["headers"].get("Authorization"),
|
||||
additional_args={
|
||||
"complete_input_dict": request_params["json"],
|
||||
"request_str": request_str,
|
||||
},
|
||||
)
|
||||
if timeout:
|
||||
request_params["timeout"] = timeout
|
||||
try:
|
||||
if stream:
|
||||
resp = requests.request(
|
||||
**request_params,
|
||||
stream=True,
|
||||
)
|
||||
resp.raise_for_status()
|
||||
yield resp
|
||||
else:
|
||||
resp = requests.request(**request_params)
|
||||
resp.raise_for_status()
|
||||
yield resp
|
||||
except Exception as e:
|
||||
raise WatsonXAIError(status_code=500, message=str(e))
|
||||
if not stream:
|
||||
logging_obj.post_call(
|
||||
input=input,
|
||||
api_key=request_params["headers"].get("Authorization"),
|
||||
original_response=json.dumps(resp.json()),
|
||||
additional_args={
|
||||
"status_code": resp.status_code,
|
||||
"complete_input_dict": request_params["json"],
|
||||
},
|
||||
)
|
|
@ -12,7 +12,6 @@ from typing import Any, Literal, Union, BinaryIO
|
|||
from functools import partial
|
||||
import dotenv, traceback, random, asyncio, time, contextvars
|
||||
from copy import deepcopy
|
||||
|
||||
import httpx
|
||||
import litellm
|
||||
from ._logging import verbose_logger
|
||||
|
@ -64,6 +63,7 @@ from .llms import (
|
|||
vertex_ai,
|
||||
vertex_ai_anthropic,
|
||||
maritalk,
|
||||
watsonx,
|
||||
)
|
||||
from .llms.openai import OpenAIChatCompletion, OpenAITextCompletion
|
||||
from .llms.azure import AzureChatCompletion
|
||||
|
@ -343,6 +343,7 @@ async def acompletion(
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=completion_kwargs,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -408,8 +409,10 @@ def mock_completion(
|
|||
model_response["created"] = int(time.time())
|
||||
model_response["model"] = model
|
||||
|
||||
model_response.usage = Usage(
|
||||
prompt_tokens=10, completion_tokens=20, total_tokens=30
|
||||
setattr(
|
||||
model_response,
|
||||
"usage",
|
||||
Usage(prompt_tokens=10, completion_tokens=20, total_tokens=30),
|
||||
)
|
||||
|
||||
try:
|
||||
|
@ -609,6 +612,7 @@ def completion(
|
|||
"client",
|
||||
"rpm",
|
||||
"tpm",
|
||||
"max_parallel_requests",
|
||||
"input_cost_per_token",
|
||||
"output_cost_per_token",
|
||||
"input_cost_per_second",
|
||||
|
@ -652,6 +656,7 @@ def completion(
|
|||
model
|
||||
] # update the model to the actual value if an alias has been passed in
|
||||
model_response = ModelResponse()
|
||||
setattr(model_response, "usage", litellm.Usage())
|
||||
if (
|
||||
kwargs.get("azure", False) == True
|
||||
): # don't remove flag check, to remain backwards compatible for repos like Codium
|
||||
|
@ -1732,13 +1737,14 @@ def completion(
|
|||
or optional_params.pop("vertex_ai_credentials", None)
|
||||
or get_secret("VERTEXAI_CREDENTIALS")
|
||||
)
|
||||
new_params = deepcopy(optional_params)
|
||||
if "claude-3" in model:
|
||||
model_response = vertex_ai_anthropic.completion(
|
||||
model=model,
|
||||
messages=messages,
|
||||
model_response=model_response,
|
||||
print_verbose=print_verbose,
|
||||
optional_params=optional_params,
|
||||
optional_params=new_params,
|
||||
litellm_params=litellm_params,
|
||||
logger_fn=logger_fn,
|
||||
encoding=encoding,
|
||||
|
@ -1754,7 +1760,7 @@ def completion(
|
|||
messages=messages,
|
||||
model_response=model_response,
|
||||
print_verbose=print_verbose,
|
||||
optional_params=optional_params,
|
||||
optional_params=new_params,
|
||||
litellm_params=litellm_params,
|
||||
logger_fn=logger_fn,
|
||||
encoding=encoding,
|
||||
|
@ -1907,6 +1913,43 @@ def completion(
|
|||
|
||||
## RESPONSE OBJECT
|
||||
response = response
|
||||
elif custom_llm_provider == "watsonx":
|
||||
custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
|
||||
response = watsonx.IBMWatsonXAI().completion(
|
||||
model=model,
|
||||
messages=messages,
|
||||
custom_prompt_dict=custom_prompt_dict,
|
||||
model_response=model_response,
|
||||
print_verbose=print_verbose,
|
||||
optional_params=optional_params,
|
||||
litellm_params=litellm_params, # type: ignore
|
||||
logger_fn=logger_fn,
|
||||
encoding=encoding,
|
||||
logging_obj=logging,
|
||||
timeout=timeout,
|
||||
)
|
||||
if (
|
||||
"stream" in optional_params
|
||||
and optional_params["stream"] == True
|
||||
and not isinstance(response, CustomStreamWrapper)
|
||||
):
|
||||
# don't try to access stream object,
|
||||
response = CustomStreamWrapper(
|
||||
iter(response),
|
||||
model,
|
||||
custom_llm_provider="watsonx",
|
||||
logging_obj=logging,
|
||||
)
|
||||
|
||||
if optional_params.get("stream", False):
|
||||
## LOGGING
|
||||
logging.post_call(
|
||||
input=messages,
|
||||
api_key=None,
|
||||
original_response=response,
|
||||
)
|
||||
## RESPONSE OBJECT
|
||||
response = response
|
||||
elif custom_llm_provider == "vllm":
|
||||
custom_prompt_dict = custom_prompt_dict or litellm.custom_prompt_dict
|
||||
model_response = vllm.completion(
|
||||
|
@ -1990,9 +2033,16 @@ def completion(
|
|||
or "http://localhost:11434"
|
||||
)
|
||||
|
||||
api_key = (
|
||||
api_key
|
||||
or litellm.ollama_key
|
||||
or os.environ.get("OLLAMA_API_KEY")
|
||||
or litellm.api_key
|
||||
)
|
||||
## LOGGING
|
||||
generator = ollama_chat.get_ollama_response(
|
||||
api_base,
|
||||
api_key,
|
||||
model,
|
||||
messages,
|
||||
optional_params,
|
||||
|
@ -2188,6 +2238,7 @@ def completion(
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=args,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -2549,6 +2600,7 @@ async def aembedding(*args, **kwargs):
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=args,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -2600,6 +2652,7 @@ def embedding(
|
|||
client = kwargs.pop("client", None)
|
||||
rpm = kwargs.pop("rpm", None)
|
||||
tpm = kwargs.pop("tpm", None)
|
||||
max_parallel_requests = kwargs.pop("max_parallel_requests", None)
|
||||
model_info = kwargs.get("model_info", None)
|
||||
metadata = kwargs.get("metadata", None)
|
||||
encoding_format = kwargs.get("encoding_format", None)
|
||||
|
@ -2657,6 +2710,7 @@ def embedding(
|
|||
"client",
|
||||
"rpm",
|
||||
"tpm",
|
||||
"max_parallel_requests",
|
||||
"input_cost_per_token",
|
||||
"output_cost_per_token",
|
||||
"input_cost_per_second",
|
||||
|
@ -2975,6 +3029,15 @@ def embedding(
|
|||
client=client,
|
||||
aembedding=aembedding,
|
||||
)
|
||||
elif custom_llm_provider == "watsonx":
|
||||
response = watsonx.IBMWatsonXAI().embedding(
|
||||
model=model,
|
||||
input=input,
|
||||
encoding=encoding,
|
||||
logging_obj=logging,
|
||||
optional_params=optional_params,
|
||||
model_response=EmbeddingResponse(),
|
||||
)
|
||||
else:
|
||||
args = locals()
|
||||
raise ValueError(f"No valid embedding model args passed in - {args}")
|
||||
|
@ -2990,7 +3053,10 @@ def embedding(
|
|||
)
|
||||
## Map to OpenAI Exception
|
||||
raise exception_type(
|
||||
model=model, original_exception=e, custom_llm_provider=custom_llm_provider
|
||||
model=model,
|
||||
original_exception=e,
|
||||
custom_llm_provider=custom_llm_provider,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -3084,6 +3150,7 @@ async def atext_completion(*args, **kwargs):
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=args,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -3421,6 +3488,7 @@ async def aimage_generation(*args, **kwargs):
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=args,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -3511,6 +3579,7 @@ def image_generation(
|
|||
"client",
|
||||
"rpm",
|
||||
"tpm",
|
||||
"max_parallel_requests",
|
||||
"input_cost_per_token",
|
||||
"output_cost_per_token",
|
||||
"hf_model_name",
|
||||
|
@ -3620,6 +3689,7 @@ def image_generation(
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=locals(),
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
@ -3669,6 +3739,7 @@ async def atranscription(*args, **kwargs):
|
|||
custom_llm_provider=custom_llm_provider,
|
||||
original_exception=e,
|
||||
completion_kwargs=args,
|
||||
extra_kwargs=kwargs,
|
||||
)
|
||||
|
||||
|
||||
|
|
|
@ -650,6 +650,7 @@
|
|||
"input_cost_per_token": 0.000002,
|
||||
"output_cost_per_token": 0.000006,
|
||||
"litellm_provider": "mistral",
|
||||
"supports_function_calling": true,
|
||||
"mode": "chat"
|
||||
},
|
||||
"mistral/mistral-small-latest": {
|
||||
|
@ -659,6 +660,7 @@
|
|||
"input_cost_per_token": 0.000002,
|
||||
"output_cost_per_token": 0.000006,
|
||||
"litellm_provider": "mistral",
|
||||
"supports_function_calling": true,
|
||||
"mode": "chat"
|
||||
},
|
||||
"mistral/mistral-medium": {
|
||||
|
@ -735,6 +737,26 @@
|
|||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"groq/llama3-8b-8192": {
|
||||
"max_tokens": 8192,
|
||||
"max_input_tokens": 8192,
|
||||
"max_output_tokens": 8192,
|
||||
"input_cost_per_token": 0.00000010,
|
||||
"output_cost_per_token": 0.00000010,
|
||||
"litellm_provider": "groq",
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"groq/llama3-70b-8192": {
|
||||
"max_tokens": 8192,
|
||||
"max_input_tokens": 8192,
|
||||
"max_output_tokens": 8192,
|
||||
"input_cost_per_token": 0.00000064,
|
||||
"output_cost_per_token": 0.00000080,
|
||||
"litellm_provider": "groq",
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"groq/mixtral-8x7b-32768": {
|
||||
"max_tokens": 32768,
|
||||
"max_input_tokens": 32768,
|
||||
|
@ -789,7 +811,9 @@
|
|||
"input_cost_per_token": 0.00000025,
|
||||
"output_cost_per_token": 0.00000125,
|
||||
"litellm_provider": "anthropic",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true,
|
||||
"tool_use_system_prompt_tokens": 264
|
||||
},
|
||||
"claude-3-opus-20240229": {
|
||||
"max_tokens": 4096,
|
||||
|
@ -798,7 +822,9 @@
|
|||
"input_cost_per_token": 0.000015,
|
||||
"output_cost_per_token": 0.000075,
|
||||
"litellm_provider": "anthropic",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true,
|
||||
"tool_use_system_prompt_tokens": 395
|
||||
},
|
||||
"claude-3-sonnet-20240229": {
|
||||
"max_tokens": 4096,
|
||||
|
@ -807,7 +833,9 @@
|
|||
"input_cost_per_token": 0.000003,
|
||||
"output_cost_per_token": 0.000015,
|
||||
"litellm_provider": "anthropic",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true,
|
||||
"tool_use_system_prompt_tokens": 159
|
||||
},
|
||||
"text-bison": {
|
||||
"max_tokens": 1024,
|
||||
|
@ -1113,7 +1141,8 @@
|
|||
"input_cost_per_token": 0.000003,
|
||||
"output_cost_per_token": 0.000015,
|
||||
"litellm_provider": "vertex_ai-anthropic_models",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"vertex_ai/claude-3-haiku@20240307": {
|
||||
"max_tokens": 4096,
|
||||
|
@ -1122,7 +1151,8 @@
|
|||
"input_cost_per_token": 0.00000025,
|
||||
"output_cost_per_token": 0.00000125,
|
||||
"litellm_provider": "vertex_ai-anthropic_models",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"vertex_ai/claude-3-opus@20240229": {
|
||||
"max_tokens": 4096,
|
||||
|
@ -1131,7 +1161,8 @@
|
|||
"input_cost_per_token": 0.0000015,
|
||||
"output_cost_per_token": 0.0000075,
|
||||
"litellm_provider": "vertex_ai-anthropic_models",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"textembedding-gecko": {
|
||||
"max_tokens": 3072,
|
||||
|
@ -1387,6 +1418,123 @@
|
|||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-2-13b": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.0000001,
|
||||
"output_cost_per_token": 0.0000005,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-2-13b-chat": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.0000001,
|
||||
"output_cost_per_token": 0.0000005,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-2-70b": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000065,
|
||||
"output_cost_per_token": 0.00000275,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-2-70b-chat": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000065,
|
||||
"output_cost_per_token": 0.00000275,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-2-7b": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000005,
|
||||
"output_cost_per_token": 0.00000025,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-2-7b-chat": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000005,
|
||||
"output_cost_per_token": 0.00000025,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-3-70b": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000065,
|
||||
"output_cost_per_token": 0.00000275,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-3-70b-instruct": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000065,
|
||||
"output_cost_per_token": 0.00000275,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-3-8b": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000005,
|
||||
"output_cost_per_token": 0.00000025,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/meta/llama-3-8b-instruct": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000005,
|
||||
"output_cost_per_token": 0.00000025,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/mistralai/mistral-7b-v0.1": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000005,
|
||||
"output_cost_per_token": 0.00000025,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/mistralai/mistral-7b-instruct-v0.2": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.00000005,
|
||||
"output_cost_per_token": 0.00000025,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"replicate/mistralai/mixtral-8x7b-instruct-v0.1": {
|
||||
"max_tokens": 4096,
|
||||
"max_input_tokens": 4096,
|
||||
"max_output_tokens": 4096,
|
||||
"input_cost_per_token": 0.0000003,
|
||||
"output_cost_per_token": 0.000001,
|
||||
"litellm_provider": "replicate",
|
||||
"mode": "chat"
|
||||
},
|
||||
"openrouter/openai/gpt-3.5-turbo": {
|
||||
"max_tokens": 4095,
|
||||
"input_cost_per_token": 0.0000015,
|
||||
|
@ -1515,6 +1663,13 @@
|
|||
"litellm_provider": "openrouter",
|
||||
"mode": "chat"
|
||||
},
|
||||
"openrouter/meta-llama/llama-3-70b-instruct": {
|
||||
"max_tokens": 8192,
|
||||
"input_cost_per_token": 0.0000008,
|
||||
"output_cost_per_token": 0.0000008,
|
||||
"litellm_provider": "openrouter",
|
||||
"mode": "chat"
|
||||
},
|
||||
"j2-ultra": {
|
||||
"max_tokens": 8192,
|
||||
"max_input_tokens": 8192,
|
||||
|
@ -1762,7 +1917,8 @@
|
|||
"input_cost_per_token": 0.000003,
|
||||
"output_cost_per_token": 0.000015,
|
||||
"litellm_provider": "bedrock",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"anthropic.claude-3-haiku-20240307-v1:0": {
|
||||
"max_tokens": 4096,
|
||||
|
@ -1771,7 +1927,8 @@
|
|||
"input_cost_per_token": 0.00000025,
|
||||
"output_cost_per_token": 0.00000125,
|
||||
"litellm_provider": "bedrock",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"anthropic.claude-3-opus-20240229-v1:0": {
|
||||
"max_tokens": 4096,
|
||||
|
@ -1780,7 +1937,8 @@
|
|||
"input_cost_per_token": 0.000015,
|
||||
"output_cost_per_token": 0.000075,
|
||||
"litellm_provider": "bedrock",
|
||||
"mode": "chat"
|
||||
"mode": "chat",
|
||||
"supports_function_calling": true
|
||||
},
|
||||
"anthropic.claude-v1": {
|
||||
"max_tokens": 8191,
|
||||
|
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
@ -1 +1 @@
|
|||
!function(){"use strict";var e,t,n,r,o,u,i,c,f,a={},l={};function d(e){var t=l[e];if(void 0!==t)return t.exports;var n=l[e]={id:e,loaded:!1,exports:{}},r=!0;try{a[e](n,n.exports,d),r=!1}finally{r&&delete l[e]}return n.loaded=!0,n.exports}d.m=a,e=[],d.O=function(t,n,r,o){if(n){o=o||0;for(var u=e.length;u>0&&e[u-1][2]>o;u--)e[u]=e[u-1];e[u]=[n,r,o];return}for(var i=1/0,u=0;u<e.length;u++){for(var n=e[u][0],r=e[u][1],o=e[u][2],c=!0,f=0;f<n.length;f++)i>=o&&Object.keys(d.O).every(function(e){return d.O[e](n[f])})?n.splice(f--,1):(c=!1,o<i&&(i=o));if(c){e.splice(u--,1);var a=r();void 0!==a&&(t=a)}}return t},d.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return d.d(t,{a:t}),t},n=Object.getPrototypeOf?function(e){return Object.getPrototypeOf(e)}:function(e){return e.__proto__},d.t=function(e,r){if(1&r&&(e=this(e)),8&r||"object"==typeof e&&e&&(4&r&&e.__esModule||16&r&&"function"==typeof e.then))return e;var o=Object.create(null);d.r(o);var u={};t=t||[null,n({}),n([]),n(n)];for(var i=2&r&&e;"object"==typeof i&&!~t.indexOf(i);i=n(i))Object.getOwnPropertyNames(i).forEach(function(t){u[t]=function(){return e[t]}});return u.default=function(){return e},d.d(o,u),o},d.d=function(e,t){for(var n in t)d.o(t,n)&&!d.o(e,n)&&Object.defineProperty(e,n,{enumerable:!0,get:t[n]})},d.f={},d.e=function(e){return Promise.all(Object.keys(d.f).reduce(function(t,n){return d.f[n](e,t),t},[]))},d.u=function(e){},d.miniCssF=function(e){return"static/css/dc347b0d22ffde5d.css"},d.g=function(){if("object"==typeof globalThis)return globalThis;try{return this||Function("return this")()}catch(e){if("object"==typeof window)return window}}(),d.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)},r={},o="_N_E:",d.l=function(e,t,n,u){if(r[e]){r[e].push(t);return}if(void 0!==n)for(var i,c,f=document.getElementsByTagName("script"),a=0;a<f.length;a++){var l=f[a];if(l.getAttribute("src")==e||l.getAttribute("data-webpack")==o+n){i=l;break}}i||(c=!0,(i=document.createElement("script")).charset="utf-8",i.timeout=120,d.nc&&i.setAttribute("nonce",d.nc),i.setAttribute("data-webpack",o+n),i.src=d.tu(e)),r[e]=[t];var s=function(t,n){i.onerror=i.onload=null,clearTimeout(p);var o=r[e];if(delete r[e],i.parentNode&&i.parentNode.removeChild(i),o&&o.forEach(function(e){return e(n)}),t)return t(n)},p=setTimeout(s.bind(null,void 0,{type:"timeout",target:i}),12e4);i.onerror=s.bind(null,i.onerror),i.onload=s.bind(null,i.onload),c&&document.head.appendChild(i)},d.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},d.nmd=function(e){return e.paths=[],e.children||(e.children=[]),e},d.tt=function(){return void 0===u&&(u={createScriptURL:function(e){return e}},"undefined"!=typeof trustedTypes&&trustedTypes.createPolicy&&(u=trustedTypes.createPolicy("nextjs#bundler",u))),u},d.tu=function(e){return d.tt().createScriptURL(e)},d.p="/ui/_next/",i={272:0},d.f.j=function(e,t){var n=d.o(i,e)?i[e]:void 0;if(0!==n){if(n)t.push(n[2]);else if(272!=e){var r=new Promise(function(t,r){n=i[e]=[t,r]});t.push(n[2]=r);var o=d.p+d.u(e),u=Error();d.l(o,function(t){if(d.o(i,e)&&(0!==(n=i[e])&&(i[e]=void 0),n)){var r=t&&("load"===t.type?"missing":t.type),o=t&&t.target&&t.target.src;u.message="Loading chunk "+e+" failed.\n("+r+": "+o+")",u.name="ChunkLoadError",u.type=r,u.request=o,n[1](u)}},"chunk-"+e,e)}else i[e]=0}},d.O.j=function(e){return 0===i[e]},c=function(e,t){var n,r,o=t[0],u=t[1],c=t[2],f=0;if(o.some(function(e){return 0!==i[e]})){for(n in u)d.o(u,n)&&(d.m[n]=u[n]);if(c)var a=c(d)}for(e&&e(t);f<o.length;f++)r=o[f],d.o(i,r)&&i[r]&&i[r][0](),i[r]=0;return d.O(a)},(f=self.webpackChunk_N_E=self.webpackChunk_N_E||[]).forEach(c.bind(null,0)),f.push=c.bind(null,f.push.bind(f))}();
|
||||
!function(){"use strict";var e,t,n,r,o,u,i,c,f,a={},l={};function d(e){var t=l[e];if(void 0!==t)return t.exports;var n=l[e]={id:e,loaded:!1,exports:{}},r=!0;try{a[e](n,n.exports,d),r=!1}finally{r&&delete l[e]}return n.loaded=!0,n.exports}d.m=a,e=[],d.O=function(t,n,r,o){if(n){o=o||0;for(var u=e.length;u>0&&e[u-1][2]>o;u--)e[u]=e[u-1];e[u]=[n,r,o];return}for(var i=1/0,u=0;u<e.length;u++){for(var n=e[u][0],r=e[u][1],o=e[u][2],c=!0,f=0;f<n.length;f++)i>=o&&Object.keys(d.O).every(function(e){return d.O[e](n[f])})?n.splice(f--,1):(c=!1,o<i&&(i=o));if(c){e.splice(u--,1);var a=r();void 0!==a&&(t=a)}}return t},d.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return d.d(t,{a:t}),t},n=Object.getPrototypeOf?function(e){return Object.getPrototypeOf(e)}:function(e){return e.__proto__},d.t=function(e,r){if(1&r&&(e=this(e)),8&r||"object"==typeof e&&e&&(4&r&&e.__esModule||16&r&&"function"==typeof e.then))return e;var o=Object.create(null);d.r(o);var u={};t=t||[null,n({}),n([]),n(n)];for(var i=2&r&&e;"object"==typeof i&&!~t.indexOf(i);i=n(i))Object.getOwnPropertyNames(i).forEach(function(t){u[t]=function(){return e[t]}});return u.default=function(){return e},d.d(o,u),o},d.d=function(e,t){for(var n in t)d.o(t,n)&&!d.o(e,n)&&Object.defineProperty(e,n,{enumerable:!0,get:t[n]})},d.f={},d.e=function(e){return Promise.all(Object.keys(d.f).reduce(function(t,n){return d.f[n](e,t),t},[]))},d.u=function(e){},d.miniCssF=function(e){return"static/css/5e699db73bf6f8c2.css"},d.g=function(){if("object"==typeof globalThis)return globalThis;try{return this||Function("return this")()}catch(e){if("object"==typeof window)return window}}(),d.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)},r={},o="_N_E:",d.l=function(e,t,n,u){if(r[e]){r[e].push(t);return}if(void 0!==n)for(var i,c,f=document.getElementsByTagName("script"),a=0;a<f.length;a++){var l=f[a];if(l.getAttribute("src")==e||l.getAttribute("data-webpack")==o+n){i=l;break}}i||(c=!0,(i=document.createElement("script")).charset="utf-8",i.timeout=120,d.nc&&i.setAttribute("nonce",d.nc),i.setAttribute("data-webpack",o+n),i.src=d.tu(e)),r[e]=[t];var s=function(t,n){i.onerror=i.onload=null,clearTimeout(p);var o=r[e];if(delete r[e],i.parentNode&&i.parentNode.removeChild(i),o&&o.forEach(function(e){return e(n)}),t)return t(n)},p=setTimeout(s.bind(null,void 0,{type:"timeout",target:i}),12e4);i.onerror=s.bind(null,i.onerror),i.onload=s.bind(null,i.onload),c&&document.head.appendChild(i)},d.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},d.nmd=function(e){return e.paths=[],e.children||(e.children=[]),e},d.tt=function(){return void 0===u&&(u={createScriptURL:function(e){return e}},"undefined"!=typeof trustedTypes&&trustedTypes.createPolicy&&(u=trustedTypes.createPolicy("nextjs#bundler",u))),u},d.tu=function(e){return d.tt().createScriptURL(e)},d.p="/ui/_next/",i={272:0},d.f.j=function(e,t){var n=d.o(i,e)?i[e]:void 0;if(0!==n){if(n)t.push(n[2]);else if(272!=e){var r=new Promise(function(t,r){n=i[e]=[t,r]});t.push(n[2]=r);var o=d.p+d.u(e),u=Error();d.l(o,function(t){if(d.o(i,e)&&(0!==(n=i[e])&&(i[e]=void 0),n)){var r=t&&("load"===t.type?"missing":t.type),o=t&&t.target&&t.target.src;u.message="Loading chunk "+e+" failed.\n("+r+": "+o+")",u.name="ChunkLoadError",u.type=r,u.request=o,n[1](u)}},"chunk-"+e,e)}else i[e]=0}},d.O.j=function(e){return 0===i[e]},c=function(e,t){var n,r,o=t[0],u=t[1],c=t[2],f=0;if(o.some(function(e){return 0!==i[e]})){for(n in u)d.o(u,n)&&(d.m[n]=u[n]);if(c)var a=c(d)}for(e&&e(t);f<o.length;f++)r=o[f],d.o(i,r)&&i[r]&&i[r][0](),i[r]=0;return d.O(a)},(f=self.webpackChunk_N_E=self.webpackChunk_N_E||[]).forEach(c.bind(null,0)),f.push=c.bind(null,f.push.bind(f))}();
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
@ -1 +1 @@
|
|||
<!DOCTYPE html><html id="__next_error__"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="preload" as="script" fetchPriority="low" href="/ui/_next/static/chunks/webpack-75b5d58291566cf9.js" crossorigin=""/><script src="/ui/_next/static/chunks/fd9d1056-bcf69420342937de.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/69-442a9c01c3fd20f9.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/main-app-9b4fb13a7db53edf.js" async="" crossorigin=""></script><title>LiteLLM Dashboard</title><meta name="description" content="LiteLLM Proxy Admin UI"/><link rel="icon" href="/ui/favicon.ico" type="image/x-icon" sizes="16x16"/><meta name="next-size-adjust"/><script src="/ui/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js" crossorigin="" noModule=""></script></head><body><script src="/ui/_next/static/chunks/webpack-75b5d58291566cf9.js" crossorigin="" async=""></script><script>(self.__next_f=self.__next_f||[]).push([0]);self.__next_f.push([2,null])</script><script>self.__next_f.push([1,"1:HL[\"/ui/_next/static/media/c9a5bc6a7c948fb0-s.p.woff2\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/woff2\"}]\n2:HL[\"/ui/_next/static/css/dc347b0d22ffde5d.css\",\"style\",{\"crossOrigin\":\"\"}]\n0:\"$L3\"\n"])</script><script>self.__next_f.push([1,"4:I[47690,[],\"\"]\n6:I[77831,[],\"\"]\n7:I[21225,[\"289\",\"static/chunks/289-04be6cb9636840d2.js\",\"931\",\"static/chunks/app/page-cb85da9a307105a0.js\"],\"\"]\n8:I[5613,[],\"\"]\n9:I[31778,[],\"\"]\nb:I[48955,[],\"\"]\nc:[]\n"])</script><script>self.__next_f.push([1,"3:[[[\"$\",\"link\",\"0\",{\"rel\":\"stylesheet\",\"href\":\"/ui/_next/static/css/dc347b0d22ffde5d.css\",\"precedence\":\"next\",\"crossOrigin\":\"\"}]],[\"$\",\"$L4\",null,{\"buildId\":\"y7Wf8hfvd5KooOO87je1n\",\"assetPrefix\":\"/ui\",\"initialCanonicalUrl\":\"/\",\"initialTree\":[\"\",{\"children\":[\"__PAGE__\",{}]},\"$undefined\",\"$undefined\",true],\"initialSeedData\":[\"\",{\"children\":[\"__PAGE__\",{},[\"$L5\",[\"$\",\"$L6\",null,{\"propsForComponent\":{\"params\":{}},\"Component\":\"$7\",\"isStaticGeneration\":true}],null]]},[null,[\"$\",\"html\",null,{\"lang\":\"en\",\"children\":[\"$\",\"body\",null,{\"className\":\"__className_c23dc8\",\"children\":[\"$\",\"$L8\",null,{\"parallelRouterKey\":\"children\",\"segmentPath\":[\"children\"],\"loading\":\"$undefined\",\"loadingStyles\":\"$undefined\",\"loadingScripts\":\"$undefined\",\"hasLoading\":false,\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L9\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":[[\"$\",\"title\",null,{\"children\":\"404: This page could not be found.\"}],[\"$\",\"div\",null,{\"style\":{\"fontFamily\":\"system-ui,\\\"Segoe UI\\\",Roboto,Helvetica,Arial,sans-serif,\\\"Apple Color Emoji\\\",\\\"Segoe UI Emoji\\\"\",\"height\":\"100vh\",\"textAlign\":\"center\",\"display\":\"flex\",\"flexDirection\":\"column\",\"alignItems\":\"center\",\"justifyContent\":\"center\"},\"children\":[\"$\",\"div\",null,{\"children\":[[\"$\",\"style\",null,{\"dangerouslySetInnerHTML\":{\"__html\":\"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}\"}}],[\"$\",\"h1\",null,{\"className\":\"next-error-h1\",\"style\":{\"display\":\"inline-block\",\"margin\":\"0 20px 0 0\",\"padding\":\"0 23px 0 0\",\"fontSize\":24,\"fontWeight\":500,\"verticalAlign\":\"top\",\"lineHeight\":\"49px\"},\"children\":\"404\"}],[\"$\",\"div\",null,{\"style\":{\"display\":\"inline-block\"},\"children\":[\"$\",\"h2\",null,{\"style\":{\"fontSize\":14,\"fontWeight\":400,\"lineHeight\":\"49px\",\"margin\":0},\"children\":\"This page could not be found.\"}]}]]}]}]],\"notFoundStyles\":[],\"styles\":null}]}]}],null]],\"initialHead\":[false,\"$La\"],\"globalErrorComponent\":\"$b\",\"missingSlots\":\"$Wc\"}]]\n"])</script><script>self.__next_f.push([1,"a:[[\"$\",\"meta\",\"0\",{\"name\":\"viewport\",\"content\":\"width=device-width, initial-scale=1\"}],[\"$\",\"meta\",\"1\",{\"charSet\":\"utf-8\"}],[\"$\",\"title\",\"2\",{\"children\":\"LiteLLM Dashboard\"}],[\"$\",\"meta\",\"3\",{\"name\":\"description\",\"content\":\"LiteLLM Proxy Admin UI\"}],[\"$\",\"link\",\"4\",{\"rel\":\"icon\",\"href\":\"/ui/favicon.ico\",\"type\":\"image/x-icon\",\"sizes\":\"16x16\"}],[\"$\",\"meta\",\"5\",{\"name\":\"next-size-adjust\"}]]\n5:null\n"])</script><script>self.__next_f.push([1,""])</script></body></html>
|
||||
<!DOCTYPE html><html id="__next_error__"><head><meta charSet="utf-8"/><meta name="viewport" content="width=device-width, initial-scale=1"/><link rel="preload" as="script" fetchPriority="low" href="/ui/_next/static/chunks/webpack-ccae12a25017afa5.js" crossorigin=""/><script src="/ui/_next/static/chunks/fd9d1056-dafd44dfa2da140c.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/69-e49705773ae41779.js" async="" crossorigin=""></script><script src="/ui/_next/static/chunks/main-app-9b4fb13a7db53edf.js" async="" crossorigin=""></script><title>LiteLLM Dashboard</title><meta name="description" content="LiteLLM Proxy Admin UI"/><link rel="icon" href="/ui/favicon.ico" type="image/x-icon" sizes="16x16"/><meta name="next-size-adjust"/><script src="/ui/_next/static/chunks/polyfills-c67a75d1b6f99dc8.js" crossorigin="" noModule=""></script></head><body><script src="/ui/_next/static/chunks/webpack-ccae12a25017afa5.js" crossorigin="" async=""></script><script>(self.__next_f=self.__next_f||[]).push([0]);self.__next_f.push([2,null])</script><script>self.__next_f.push([1,"1:HL[\"/ui/_next/static/media/c9a5bc6a7c948fb0-s.p.woff2\",\"font\",{\"crossOrigin\":\"\",\"type\":\"font/woff2\"}]\n2:HL[\"/ui/_next/static/css/5e699db73bf6f8c2.css\",\"style\",{\"crossOrigin\":\"\"}]\n0:\"$L3\"\n"])</script><script>self.__next_f.push([1,"4:I[47690,[],\"\"]\n6:I[77831,[],\"\"]\n7:I[27125,[\"447\",\"static/chunks/447-9f8d32190ff7d16d.js\",\"931\",\"static/chunks/app/page-508c39694bd40fe9.js\"],\"\"]\n8:I[5613,[],\"\"]\n9:I[31778,[],\"\"]\nb:I[48955,[],\"\"]\nc:[]\n"])</script><script>self.__next_f.push([1,"3:[[[\"$\",\"link\",\"0\",{\"rel\":\"stylesheet\",\"href\":\"/ui/_next/static/css/5e699db73bf6f8c2.css\",\"precedence\":\"next\",\"crossOrigin\":\"\"}]],[\"$\",\"$L4\",null,{\"buildId\":\"kbGdRQFfI6W3bEwfzmJDI\",\"assetPrefix\":\"/ui\",\"initialCanonicalUrl\":\"/\",\"initialTree\":[\"\",{\"children\":[\"__PAGE__\",{}]},\"$undefined\",\"$undefined\",true],\"initialSeedData\":[\"\",{\"children\":[\"__PAGE__\",{},[\"$L5\",[\"$\",\"$L6\",null,{\"propsForComponent\":{\"params\":{}},\"Component\":\"$7\",\"isStaticGeneration\":true}],null]]},[null,[\"$\",\"html\",null,{\"lang\":\"en\",\"children\":[\"$\",\"body\",null,{\"className\":\"__className_c23dc8\",\"children\":[\"$\",\"$L8\",null,{\"parallelRouterKey\":\"children\",\"segmentPath\":[\"children\"],\"loading\":\"$undefined\",\"loadingStyles\":\"$undefined\",\"loadingScripts\":\"$undefined\",\"hasLoading\":false,\"error\":\"$undefined\",\"errorStyles\":\"$undefined\",\"errorScripts\":\"$undefined\",\"template\":[\"$\",\"$L9\",null,{}],\"templateStyles\":\"$undefined\",\"templateScripts\":\"$undefined\",\"notFound\":[[\"$\",\"title\",null,{\"children\":\"404: This page could not be found.\"}],[\"$\",\"div\",null,{\"style\":{\"fontFamily\":\"system-ui,\\\"Segoe UI\\\",Roboto,Helvetica,Arial,sans-serif,\\\"Apple Color Emoji\\\",\\\"Segoe UI Emoji\\\"\",\"height\":\"100vh\",\"textAlign\":\"center\",\"display\":\"flex\",\"flexDirection\":\"column\",\"alignItems\":\"center\",\"justifyContent\":\"center\"},\"children\":[\"$\",\"div\",null,{\"children\":[[\"$\",\"style\",null,{\"dangerouslySetInnerHTML\":{\"__html\":\"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}\"}}],[\"$\",\"h1\",null,{\"className\":\"next-error-h1\",\"style\":{\"display\":\"inline-block\",\"margin\":\"0 20px 0 0\",\"padding\":\"0 23px 0 0\",\"fontSize\":24,\"fontWeight\":500,\"verticalAlign\":\"top\",\"lineHeight\":\"49px\"},\"children\":\"404\"}],[\"$\",\"div\",null,{\"style\":{\"display\":\"inline-block\"},\"children\":[\"$\",\"h2\",null,{\"style\":{\"fontSize\":14,\"fontWeight\":400,\"lineHeight\":\"49px\",\"margin\":0},\"children\":\"This page could not be found.\"}]}]]}]}]],\"notFoundStyles\":[],\"styles\":null}]}]}],null]],\"initialHead\":[false,\"$La\"],\"globalErrorComponent\":\"$b\",\"missingSlots\":\"$Wc\"}]]\n"])</script><script>self.__next_f.push([1,"a:[[\"$\",\"meta\",\"0\",{\"name\":\"viewport\",\"content\":\"width=device-width, initial-scale=1\"}],[\"$\",\"meta\",\"1\",{\"charSet\":\"utf-8\"}],[\"$\",\"title\",\"2\",{\"children\":\"LiteLLM Dashboard\"}],[\"$\",\"meta\",\"3\",{\"name\":\"description\",\"content\":\"LiteLLM Proxy Admin UI\"}],[\"$\",\"link\",\"4\",{\"rel\":\"icon\",\"href\":\"/ui/favicon.ico\",\"type\":\"image/x-icon\",\"sizes\":\"16x16\"}],[\"$\",\"meta\",\"5\",{\"name\":\"next-size-adjust\"}]]\n5:null\n"])</script><script>self.__next_f.push([1,""])</script></body></html>
|
|
@ -1,7 +1,7 @@
|
|||
2:I[77831,[],""]
|
||||
3:I[21225,["289","static/chunks/289-04be6cb9636840d2.js","931","static/chunks/app/page-cb85da9a307105a0.js"],""]
|
||||
3:I[27125,["447","static/chunks/447-9f8d32190ff7d16d.js","931","static/chunks/app/page-508c39694bd40fe9.js"],""]
|
||||
4:I[5613,[],""]
|
||||
5:I[31778,[],""]
|
||||
0:["y7Wf8hfvd5KooOO87je1n",[[["",{"children":["__PAGE__",{}]},"$undefined","$undefined",true],["",{"children":["__PAGE__",{},["$L1",["$","$L2",null,{"propsForComponent":{"params":{}},"Component":"$3","isStaticGeneration":true}],null]]},[null,["$","html",null,{"lang":"en","children":["$","body",null,{"className":"__className_c23dc8","children":["$","$L4",null,{"parallelRouterKey":"children","segmentPath":["children"],"loading":"$undefined","loadingStyles":"$undefined","loadingScripts":"$undefined","hasLoading":false,"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L5",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[],"styles":null}]}]}],null]],[[["$","link","0",{"rel":"stylesheet","href":"/ui/_next/static/css/dc347b0d22ffde5d.css","precedence":"next","crossOrigin":""}]],"$L6"]]]]
|
||||
0:["kbGdRQFfI6W3bEwfzmJDI",[[["",{"children":["__PAGE__",{}]},"$undefined","$undefined",true],["",{"children":["__PAGE__",{},["$L1",["$","$L2",null,{"propsForComponent":{"params":{}},"Component":"$3","isStaticGeneration":true}],null]]},[null,["$","html",null,{"lang":"en","children":["$","body",null,{"className":"__className_c23dc8","children":["$","$L4",null,{"parallelRouterKey":"children","segmentPath":["children"],"loading":"$undefined","loadingStyles":"$undefined","loadingScripts":"$undefined","hasLoading":false,"error":"$undefined","errorStyles":"$undefined","errorScripts":"$undefined","template":["$","$L5",null,{}],"templateStyles":"$undefined","templateScripts":"$undefined","notFound":[["$","title",null,{"children":"404: This page could not be found."}],["$","div",null,{"style":{"fontFamily":"system-ui,\"Segoe UI\",Roboto,Helvetica,Arial,sans-serif,\"Apple Color Emoji\",\"Segoe UI Emoji\"","height":"100vh","textAlign":"center","display":"flex","flexDirection":"column","alignItems":"center","justifyContent":"center"},"children":["$","div",null,{"children":[["$","style",null,{"dangerouslySetInnerHTML":{"__html":"body{color:#000;background:#fff;margin:0}.next-error-h1{border-right:1px solid rgba(0,0,0,.3)}@media (prefers-color-scheme:dark){body{color:#fff;background:#000}.next-error-h1{border-right:1px solid rgba(255,255,255,.3)}}"}}],["$","h1",null,{"className":"next-error-h1","style":{"display":"inline-block","margin":"0 20px 0 0","padding":"0 23px 0 0","fontSize":24,"fontWeight":500,"verticalAlign":"top","lineHeight":"49px"},"children":"404"}],["$","div",null,{"style":{"display":"inline-block"},"children":["$","h2",null,{"style":{"fontSize":14,"fontWeight":400,"lineHeight":"49px","margin":0},"children":"This page could not be found."}]}]]}]}]],"notFoundStyles":[],"styles":null}]}]}],null]],[[["$","link","0",{"rel":"stylesheet","href":"/ui/_next/static/css/5e699db73bf6f8c2.css","precedence":"next","crossOrigin":""}]],"$L6"]]]]
|
||||
6:[["$","meta","0",{"name":"viewport","content":"width=device-width, initial-scale=1"}],["$","meta","1",{"charSet":"utf-8"}],["$","title","2",{"children":"LiteLLM Dashboard"}],["$","meta","3",{"name":"description","content":"LiteLLM Proxy Admin UI"}],["$","link","4",{"rel":"icon","href":"/ui/favicon.ico","type":"image/x-icon","sizes":"16x16"}],["$","meta","5",{"name":"next-size-adjust"}]]
|
||||
1:null
|
||||
|
|
|
@ -1,52 +1,61 @@
|
|||
model_list:
|
||||
- model_name: fake-openai-endpoint
|
||||
litellm_params:
|
||||
model: openai/my-fake-model
|
||||
api_key: my-fake-key
|
||||
# api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
|
||||
api_base: http://0.0.0.0:8080
|
||||
stream_timeout: 0.001
|
||||
rpm: 10
|
||||
- litellm_params:
|
||||
model: azure/chatgpt-v-2
|
||||
api_base: os.environ/AZURE_API_BASE
|
||||
api_key: os.environ/AZURE_API_KEY
|
||||
api_version: "2023-07-01-preview"
|
||||
stream_timeout: 0.001
|
||||
model_name: azure-gpt-3.5
|
||||
# - model_name: text-embedding-ada-002
|
||||
# litellm_params:
|
||||
# model: text-embedding-ada-002
|
||||
# api_key: os.environ/OPENAI_API_KEY
|
||||
- model_name: gpt-instruct
|
||||
litellm_params:
|
||||
model: text-completion-openai/gpt-3.5-turbo-instruct
|
||||
# api_key: my-fake-key
|
||||
# api_base: https://exampleopenaiendpoint-production.up.railway.app/
|
||||
|
||||
environment_variables:
|
||||
LANGFUSE_PUBLIC_KEY: Q6K8MQN6L7sPYSJiFKM9eNrETOx6V/FxVPup4FqdKsZK1hyR4gyanlQ2KHLg5D5afng99uIt0JCEQ2jiKF9UxFvtnb4BbJ4qpeceH+iK8v/bdg==
|
||||
LANGFUSE_SECRET_KEY: 5xQ7KMa6YMLsm+H/Pf1VmlqWq1NON5IoCxABhkUBeSck7ftsj2CmpkL2ZwrxwrktgiTUBH+3gJYBX+XBk7lqOOUpvmiLjol/E5lCqq0M1CqLWA==
|
||||
SLACK_WEBHOOK_URL: RJjhS0Hhz0/s07sCIf1OTXmTGodpK9L2K9p953Z+fOX0l2SkPFT6mB9+yIrLufmlwEaku5NNEBKy//+AG01yOd+7wV1GhK65vfj3B/gTN8t5cuVnR4vFxKY5Rx4eSGLtzyAs+aIBTp4GoNXDIjroCqfCjPkItEZWCg==
|
||||
general_settings:
|
||||
alerting:
|
||||
- slack
|
||||
alerting_threshold: 300
|
||||
database_connection_pool_limit: 100
|
||||
database_connection_timeout: 60
|
||||
disable_master_key_return: true
|
||||
health_check_interval: 300
|
||||
proxy_batch_write_at: 60
|
||||
ui_access_mode: all
|
||||
# master_key: sk-1234
|
||||
litellm_settings:
|
||||
success_callback: ["prometheus"]
|
||||
service_callback: ["prometheus_system"]
|
||||
upperbound_key_generate_params:
|
||||
max_budget: os.environ/LITELLM_UPPERBOUND_KEYS_MAX_BUDGET
|
||||
|
||||
allowed_fails: 3
|
||||
failure_callback:
|
||||
- prometheus
|
||||
num_retries: 3
|
||||
service_callback:
|
||||
- prometheus_system
|
||||
success_callback:
|
||||
- langfuse
|
||||
- prometheus
|
||||
- langsmith
|
||||
model_list:
|
||||
- litellm_params:
|
||||
model: gpt-3.5-turbo
|
||||
model_name: gpt-3.5-turbo
|
||||
- litellm_params:
|
||||
api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
|
||||
api_key: my-fake-key
|
||||
model: openai/my-fake-model
|
||||
stream_timeout: 0.001
|
||||
model_name: fake-openai-endpoint
|
||||
- litellm_params:
|
||||
api_base: https://openai-function-calling-workers.tasslexyz.workers.dev/
|
||||
api_key: my-fake-key
|
||||
model: openai/my-fake-model-2
|
||||
stream_timeout: 0.001
|
||||
model_name: fake-openai-endpoint
|
||||
- litellm_params:
|
||||
api_base: os.environ/AZURE_API_BASE
|
||||
api_key: os.environ/AZURE_API_KEY
|
||||
api_version: 2023-07-01-preview
|
||||
model: azure/chatgpt-v-2
|
||||
stream_timeout: 0.001
|
||||
model_name: azure-gpt-3.5
|
||||
- litellm_params:
|
||||
api_key: os.environ/OPENAI_API_KEY
|
||||
model: text-embedding-ada-002
|
||||
model_name: text-embedding-ada-002
|
||||
- litellm_params:
|
||||
model: text-completion-openai/gpt-3.5-turbo-instruct
|
||||
model_name: gpt-instruct
|
||||
router_settings:
|
||||
routing_strategy: usage-based-routing-v2
|
||||
enable_pre_call_checks: true
|
||||
redis_host: os.environ/REDIS_HOST
|
||||
redis_password: os.environ/REDIS_PASSWORD
|
||||
redis_port: os.environ/REDIS_PORT
|
||||
enable_pre_call_checks: True
|
||||
|
||||
general_settings:
|
||||
master_key: sk-1234
|
||||
allow_user_auth: true
|
||||
alerting: ["slack"]
|
||||
store_model_in_db: True // set via environment variable - os.environ["STORE_MODEL_IN_DB"] = "True"
|
||||
proxy_batch_write_at: 5 # 👈 Frequency of batch writing logs to server (in seconds)
|
||||
enable_jwt_auth: True
|
||||
alerting: ["slack"]
|
||||
litellm_jwtauth:
|
||||
admin_jwt_scope: "litellm_proxy_admin"
|
||||
public_key_ttl: os.environ/LITELLM_PUBLIC_KEY_TTL
|
||||
user_id_jwt_field: "sub"
|
||||
org_id_jwt_field: "azp"
|
19
litellm/proxy/_super_secret_config.yaml
Normal file
19
litellm/proxy/_super_secret_config.yaml
Normal file
|
@ -0,0 +1,19 @@
|
|||
model_list:
|
||||
- litellm_params:
|
||||
api_base: http://0.0.0.0:8080
|
||||
api_key: my-fake-key
|
||||
model: openai/my-fake-model
|
||||
rpm: 100
|
||||
model_name: fake-openai-endpoint
|
||||
- litellm_params:
|
||||
api_base: http://0.0.0.0:8081
|
||||
api_key: my-fake-key
|
||||
model: openai/my-fake-model-2
|
||||
rpm: 100
|
||||
model_name: fake-openai-endpoint
|
||||
router_settings:
|
||||
num_retries: 0
|
||||
enable_pre_call_checks: true
|
||||
redis_host: os.environ/REDIS_HOST
|
||||
redis_password: os.environ/REDIS_PASSWORD
|
||||
redis_port: os.environ/REDIS_PORT
|
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Add a link
Reference in a new issue