forked from phoenix/litellm-mirror
* build(model_prices_and_context_window.json): add bedrock llama3.2 pricing
* build(model_prices_and_context_window.json): add bedrock cross region inference pricing
* Revert "(perf) move s3 logging to Batch logging + async [94% faster perf under 100 RPS on 1 litellm instance] (#6165)"
This reverts commit 2a5624af47
.
* add azure/gpt-4o-2024-05-13 (#6174)
* LiteLLM Minor Fixes & Improvements (10/10/2024) (#6158)
* refactor(vertex_ai_partner_models/anthropic): refactor anthropic to use partner model logic
* fix(vertex_ai/): support passing custom api base to partner models
Fixes https://github.com/BerriAI/litellm/issues/4317
* fix(proxy_server.py): Fix prometheus premium user check logic
* docs(prometheus.md): update quick start docs
* fix(custom_llm.py): support passing dynamic api key + api base
* fix(realtime_api/main.py): Add request/response logging for realtime api endpoints
Closes https://github.com/BerriAI/litellm/issues/6081
* feat(openai/realtime): add openai realtime api logging
Closes https://github.com/BerriAI/litellm/issues/6081
* fix(realtime_streaming.py): fix linting errors
* fix(realtime_streaming.py): fix linting errors
* fix: fix linting errors
* fix pattern match router
* Add literalai in the sidebar observability category (#6163)
* fix: add literalai in the sidebar
* fix: typo
* update (#6160)
* Feat: Add Langtrace integration (#5341)
* Feat: Add Langtrace integration
* add langtrace service name
* fix timestamps for traces
* add tests
* Discard Callback + use existing otel logger
* cleanup
* remove print statments
* remove callback
* add docs
* docs
* add logging docs
* format logging
* remove emoji and add litellm proxy example
* format logging
* format `logging.md`
* add langtrace docs to logging.md
* sync conflict
* docs fix
* (perf) move s3 logging to Batch logging + async [94% faster perf under 100 RPS on 1 litellm instance] (#6165)
* fix move s3 to use customLogger
* add basic s3 logging test
* add s3 to custom logger compatible
* use batch logger for s3
* s3 set flush interval and batch size
* fix s3 logging
* add notes on s3 logging
* fix s3 logging
* add basic s3 logging test
* fix s3 type errors
* add test for sync logging on s3
* fix: fix to debug log
---------
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: Willy Douhard <willy.douhard@gmail.com>
Co-authored-by: yujonglee <yujonglee.dev@gmail.com>
Co-authored-by: Ali Waleed <ali@scale3labs.com>
* docs(custom_llm_server.md): update doc on passing custom params
* fix(pass_through_endpoints.py): don't require headers
Fixes https://github.com/BerriAI/litellm/issues/6128
* feat(utils.py): add support for caching rerank endpoints
Closes https://github.com/BerriAI/litellm/issues/6144
* feat(litellm_logging.py'): add response headers for failed requests
Closes https://github.com/BerriAI/litellm/issues/6159
---------
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: Willy Douhard <willy.douhard@gmail.com>
Co-authored-by: yujonglee <yujonglee.dev@gmail.com>
Co-authored-by: Ali Waleed <ali@scale3labs.com>
412 lines
11 KiB
Markdown
412 lines
11 KiB
Markdown
# Custom API Server (Custom Format)
|
|
|
|
Call your custom torch-serve / internal LLM APIs via LiteLLM
|
|
|
|
:::info
|
|
|
|
- For calling an openai-compatible endpoint, [go here](./openai_compatible.md)
|
|
- For modifying incoming/outgoing calls on proxy, [go here](../proxy/call_hooks.md)
|
|
:::
|
|
|
|
## Quick Start
|
|
|
|
```python
|
|
import litellm
|
|
from litellm import CustomLLM, completion, get_llm_provider
|
|
|
|
|
|
class MyCustomLLM(CustomLLM):
|
|
def completion(self, *args, **kwargs) -> litellm.ModelResponse:
|
|
return litellm.completion(
|
|
model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hello world"}],
|
|
mock_response="Hi!",
|
|
) # type: ignore
|
|
|
|
my_custom_llm = MyCustomLLM()
|
|
|
|
litellm.custom_provider_map = [ # 👈 KEY STEP - REGISTER HANDLER
|
|
{"provider": "my-custom-llm", "custom_handler": my_custom_llm}
|
|
]
|
|
|
|
resp = completion(
|
|
model="my-custom-llm/my-fake-model",
|
|
messages=[{"role": "user", "content": "Hello world!"}],
|
|
)
|
|
|
|
assert resp.choices[0].message.content == "Hi!"
|
|
```
|
|
|
|
## OpenAI Proxy Usage
|
|
|
|
1. Setup your `custom_handler.py` file
|
|
|
|
```python
|
|
import litellm
|
|
from litellm import CustomLLM, completion, get_llm_provider
|
|
|
|
|
|
class MyCustomLLM(CustomLLM):
|
|
def completion(self, *args, **kwargs) -> litellm.ModelResponse:
|
|
return litellm.completion(
|
|
model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hello world"}],
|
|
mock_response="Hi!",
|
|
) # type: ignore
|
|
|
|
async def acompletion(self, *args, **kwargs) -> litellm.ModelResponse:
|
|
return litellm.completion(
|
|
model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hello world"}],
|
|
mock_response="Hi!",
|
|
) # type: ignore
|
|
|
|
|
|
my_custom_llm = MyCustomLLM()
|
|
```
|
|
|
|
2. Add to `config.yaml`
|
|
|
|
In the config below, we pass
|
|
|
|
python_filename: `custom_handler.py`
|
|
custom_handler_instance_name: `my_custom_llm`. This is defined in Step 1
|
|
|
|
custom_handler: `custom_handler.my_custom_llm`
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: "test-model"
|
|
litellm_params:
|
|
model: "openai/text-embedding-ada-002"
|
|
- model_name: "my-custom-model"
|
|
litellm_params:
|
|
model: "my-custom-llm/my-model"
|
|
|
|
litellm_settings:
|
|
custom_provider_map:
|
|
- {"provider": "my-custom-llm", "custom_handler": custom_handler.my_custom_llm}
|
|
```
|
|
|
|
```bash
|
|
litellm --config /path/to/config.yaml
|
|
```
|
|
|
|
3. Test it!
|
|
|
|
```bash
|
|
curl -X POST 'http://0.0.0.0:4000/chat/completions' \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'Authorization: Bearer sk-1234' \
|
|
-d '{
|
|
"model": "my-custom-model",
|
|
"messages": [{"role": "user", "content": "Say \"this is a test\" in JSON!"}],
|
|
}'
|
|
```
|
|
|
|
Expected Response
|
|
|
|
```
|
|
{
|
|
"id": "chatcmpl-06f1b9cd-08bc-43f7-9814-a69173921216",
|
|
"choices": [
|
|
{
|
|
"finish_reason": "stop",
|
|
"index": 0,
|
|
"message": {
|
|
"content": "Hi!",
|
|
"role": "assistant",
|
|
"tool_calls": null,
|
|
"function_call": null
|
|
}
|
|
}
|
|
],
|
|
"created": 1721955063,
|
|
"model": "gpt-3.5-turbo",
|
|
"object": "chat.completion",
|
|
"system_fingerprint": null,
|
|
"usage": {
|
|
"prompt_tokens": 10,
|
|
"completion_tokens": 20,
|
|
"total_tokens": 30
|
|
}
|
|
}
|
|
```
|
|
|
|
## Add Streaming Support
|
|
|
|
Here's a simple example of returning unix epoch seconds for both completion + streaming use-cases.
|
|
|
|
s/o [@Eloy Lafuente](https://github.com/stronk7) for this code example.
|
|
|
|
```python
|
|
import time
|
|
from typing import Iterator, AsyncIterator
|
|
from litellm.types.utils import GenericStreamingChunk, ModelResponse
|
|
from litellm import CustomLLM, completion, acompletion
|
|
|
|
class UnixTimeLLM(CustomLLM):
|
|
def completion(self, *args, **kwargs) -> ModelResponse:
|
|
return completion(
|
|
model="test/unixtime",
|
|
mock_response=str(int(time.time())),
|
|
) # type: ignore
|
|
|
|
async def acompletion(self, *args, **kwargs) -> ModelResponse:
|
|
return await acompletion(
|
|
model="test/unixtime",
|
|
mock_response=str(int(time.time())),
|
|
) # type: ignore
|
|
|
|
def streaming(self, *args, **kwargs) -> Iterator[GenericStreamingChunk]:
|
|
generic_streaming_chunk: GenericStreamingChunk = {
|
|
"finish_reason": "stop",
|
|
"index": 0,
|
|
"is_finished": True,
|
|
"text": str(int(time.time())),
|
|
"tool_use": None,
|
|
"usage": {"completion_tokens": 0, "prompt_tokens": 0, "total_tokens": 0},
|
|
}
|
|
return generic_streaming_chunk # type: ignore
|
|
|
|
async def astreaming(self, *args, **kwargs) -> AsyncIterator[GenericStreamingChunk]:
|
|
generic_streaming_chunk: GenericStreamingChunk = {
|
|
"finish_reason": "stop",
|
|
"index": 0,
|
|
"is_finished": True,
|
|
"text": str(int(time.time())),
|
|
"tool_use": None,
|
|
"usage": {"completion_tokens": 0, "prompt_tokens": 0, "total_tokens": 0},
|
|
}
|
|
yield generic_streaming_chunk # type: ignore
|
|
|
|
unixtime = UnixTimeLLM()
|
|
```
|
|
|
|
## Image Generation
|
|
|
|
1. Setup your `custom_handler.py` file
|
|
```python
|
|
import litellm
|
|
from litellm import CustomLLM
|
|
from litellm.types.utils import ImageResponse, ImageObject
|
|
|
|
|
|
class MyCustomLLM(CustomLLM):
|
|
async def aimage_generation(self, model: str, prompt: str, model_response: ImageResponse, optional_params: dict, logging_obj: Any, timeout: Optional[Union[float, httpx.Timeout]] = None, client: Optional[AsyncHTTPHandler] = None,) -> ImageResponse:
|
|
return ImageResponse(
|
|
created=int(time.time()),
|
|
data=[ImageObject(url="https://example.com/image.png")],
|
|
)
|
|
|
|
my_custom_llm = MyCustomLLM()
|
|
```
|
|
|
|
|
|
2. Add to `config.yaml`
|
|
|
|
In the config below, we pass
|
|
|
|
python_filename: `custom_handler.py`
|
|
custom_handler_instance_name: `my_custom_llm`. This is defined in Step 1
|
|
|
|
custom_handler: `custom_handler.my_custom_llm`
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: "test-model"
|
|
litellm_params:
|
|
model: "openai/text-embedding-ada-002"
|
|
- model_name: "my-custom-model"
|
|
litellm_params:
|
|
model: "my-custom-llm/my-model"
|
|
|
|
litellm_settings:
|
|
custom_provider_map:
|
|
- {"provider": "my-custom-llm", "custom_handler": custom_handler.my_custom_llm}
|
|
```
|
|
|
|
```bash
|
|
litellm --config /path/to/config.yaml
|
|
```
|
|
|
|
3. Test it!
|
|
|
|
```bash
|
|
curl -X POST 'http://0.0.0.0:4000/v1/images/generations' \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'Authorization: Bearer sk-1234' \
|
|
-d '{
|
|
"model": "my-custom-model",
|
|
"prompt": "A cute baby sea otter",
|
|
}'
|
|
```
|
|
|
|
Expected Response
|
|
|
|
```
|
|
{
|
|
"created": 1721955063,
|
|
"data": [{"url": "https://example.com/image.png"}],
|
|
}
|
|
```
|
|
|
|
## Additional Parameters
|
|
|
|
Additional parameters are passed inside `optional_params` key in the `completion` or `image_generation` function.
|
|
|
|
Here's how to set this:
|
|
|
|
<Tabs>
|
|
<TabItem value="sdk" label="SDK">
|
|
|
|
```python
|
|
import litellm
|
|
from litellm import CustomLLM, completion, get_llm_provider
|
|
|
|
|
|
class MyCustomLLM(CustomLLM):
|
|
def completion(self, *args, **kwargs) -> litellm.ModelResponse:
|
|
assert kwargs["optional_params"] == {"my_custom_param": "my-custom-param"} # 👈 CHECK HERE
|
|
return litellm.completion(
|
|
model="gpt-3.5-turbo",
|
|
messages=[{"role": "user", "content": "Hello world"}],
|
|
mock_response="Hi!",
|
|
) # type: ignore
|
|
|
|
my_custom_llm = MyCustomLLM()
|
|
|
|
litellm.custom_provider_map = [ # 👈 KEY STEP - REGISTER HANDLER
|
|
{"provider": "my-custom-llm", "custom_handler": my_custom_llm}
|
|
]
|
|
|
|
resp = completion(model="my-custom-llm/my-model", my_custom_param="my-custom-param")
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="proxy" label="Proxy">
|
|
|
|
|
|
1. Setup your `custom_handler.py` file
|
|
```python
|
|
import litellm
|
|
from litellm import CustomLLM
|
|
from litellm.types.utils import ImageResponse, ImageObject
|
|
|
|
|
|
class MyCustomLLM(CustomLLM):
|
|
async def aimage_generation(self, model: str, prompt: str, model_response: ImageResponse, optional_params: dict, logging_obj: Any, timeout: Optional[Union[float, httpx.Timeout]] = None, client: Optional[AsyncHTTPHandler] = None,) -> ImageResponse:
|
|
assert optional_params == {"my_custom_param": "my-custom-param"} # 👈 CHECK HERE
|
|
return ImageResponse(
|
|
created=int(time.time()),
|
|
data=[ImageObject(url="https://example.com/image.png")],
|
|
)
|
|
|
|
my_custom_llm = MyCustomLLM()
|
|
```
|
|
|
|
|
|
2. Add to `config.yaml`
|
|
|
|
In the config below, we pass
|
|
|
|
python_filename: `custom_handler.py`
|
|
custom_handler_instance_name: `my_custom_llm`. This is defined in Step 1
|
|
|
|
custom_handler: `custom_handler.my_custom_llm`
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: "test-model"
|
|
litellm_params:
|
|
model: "openai/text-embedding-ada-002"
|
|
- model_name: "my-custom-model"
|
|
litellm_params:
|
|
model: "my-custom-llm/my-model"
|
|
my_custom_param: "my-custom-param" # 👈 CUSTOM PARAM
|
|
|
|
litellm_settings:
|
|
custom_provider_map:
|
|
- {"provider": "my-custom-llm", "custom_handler": custom_handler.my_custom_llm}
|
|
```
|
|
|
|
```bash
|
|
litellm --config /path/to/config.yaml
|
|
```
|
|
|
|
3. Test it!
|
|
|
|
```bash
|
|
curl -X POST 'http://0.0.0.0:4000/v1/images/generations' \
|
|
-H 'Content-Type: application/json' \
|
|
-H 'Authorization: Bearer sk-1234' \
|
|
-d '{
|
|
"model": "my-custom-model",
|
|
"prompt": "A cute baby sea otter",
|
|
}'
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
|
|
|
|
## Custom Handler Spec
|
|
|
|
```python
|
|
from litellm.types.utils import GenericStreamingChunk, ModelResponse, ImageResponse
|
|
from typing import Iterator, AsyncIterator, Any, Optional, Union
|
|
from litellm.llms.base import BaseLLM
|
|
|
|
class CustomLLMError(Exception): # use this for all your exceptions
|
|
def __init__(
|
|
self,
|
|
status_code,
|
|
message,
|
|
):
|
|
self.status_code = status_code
|
|
self.message = message
|
|
super().__init__(
|
|
self.message
|
|
) # Call the base class constructor with the parameters it needs
|
|
|
|
class CustomLLM(BaseLLM):
|
|
def __init__(self) -> None:
|
|
super().__init__()
|
|
|
|
def completion(self, *args, **kwargs) -> ModelResponse:
|
|
raise CustomLLMError(status_code=500, message="Not implemented yet!")
|
|
|
|
def streaming(self, *args, **kwargs) -> Iterator[GenericStreamingChunk]:
|
|
raise CustomLLMError(status_code=500, message="Not implemented yet!")
|
|
|
|
async def acompletion(self, *args, **kwargs) -> ModelResponse:
|
|
raise CustomLLMError(status_code=500, message="Not implemented yet!")
|
|
|
|
async def astreaming(self, *args, **kwargs) -> AsyncIterator[GenericStreamingChunk]:
|
|
raise CustomLLMError(status_code=500, message="Not implemented yet!")
|
|
|
|
def image_generation(
|
|
self,
|
|
model: str,
|
|
prompt: str,
|
|
model_response: ImageResponse,
|
|
optional_params: dict,
|
|
logging_obj: Any,
|
|
timeout: Optional[Union[float, httpx.Timeout]] = None,
|
|
client: Optional[HTTPHandler] = None,
|
|
) -> ImageResponse:
|
|
raise CustomLLMError(status_code=500, message="Not implemented yet!")
|
|
|
|
async def aimage_generation(
|
|
self,
|
|
model: str,
|
|
prompt: str,
|
|
model_response: ImageResponse,
|
|
optional_params: dict,
|
|
logging_obj: Any,
|
|
timeout: Optional[Union[float, httpx.Timeout]] = None,
|
|
client: Optional[AsyncHTTPHandler] = None,
|
|
) -> ImageResponse:
|
|
raise CustomLLMError(status_code=500, message="Not implemented yet!")
|
|
```
|