forked from phoenix/litellm-mirror
* docs(prompt_caching.md): add prompt caching cost calc example to docs * docs(prompt_caching.md): add proxy examples to docs * feat(utils.py): expose new helper `supports_prompt_caching()` to check if a model supports prompt caching * docs(prompt_caching.md): add docs on checking model support for prompt caching * build: fix invalid json
502 lines
No EOL
13 KiB
Markdown
502 lines
No EOL
13 KiB
Markdown
import Tabs from '@theme/Tabs';
|
|
import TabItem from '@theme/TabItem';
|
|
|
|
# Prompt Caching
|
|
|
|
For OpenAI + Anthropic + Deepseek, LiteLLM follows the OpenAI prompt caching usage object format:
|
|
|
|
```bash
|
|
"usage": {
|
|
"prompt_tokens": 2006,
|
|
"completion_tokens": 300,
|
|
"total_tokens": 2306,
|
|
"prompt_tokens_details": {
|
|
"cached_tokens": 1920
|
|
},
|
|
"completion_tokens_details": {
|
|
"reasoning_tokens": 0
|
|
}
|
|
# ANTHROPIC_ONLY #
|
|
"cache_creation_input_tokens": 0
|
|
}
|
|
```
|
|
|
|
- `prompt_tokens`: These are the non-cached prompt tokens (same as Anthropic, equivalent to Deepseek `prompt_cache_miss_tokens`).
|
|
- `completion_tokens`: These are the output tokens generated by the model.
|
|
- `total_tokens`: Sum of prompt_tokens + completion_tokens.
|
|
- `prompt_tokens_details`: Object containing cached_tokens.
|
|
- `cached_tokens`: Tokens that were a cache-hit for that call.
|
|
- `completion_tokens_details`: Object containing reasoning_tokens.
|
|
- **ANTHROPIC_ONLY**: `cache_creation_input_tokens` are the number of tokens that were written to cache. (Anthropic charges for this).
|
|
|
|
## Quick Start
|
|
|
|
Note: OpenAI caching is only available for prompts containing 1024 tokens or more
|
|
|
|
<Tabs>
|
|
<TabItem value="sdk" label="SDK">
|
|
|
|
```python
|
|
from litellm import completion
|
|
import os
|
|
|
|
os.environ["OPENAI_API_KEY"] = ""
|
|
|
|
for _ in range(2):
|
|
response = completion(
|
|
model="gpt-4o",
|
|
messages=[
|
|
# System Message
|
|
{
|
|
"role": "system",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "Here is the full text of a complex legal agreement"
|
|
* 400,
|
|
}
|
|
],
|
|
},
|
|
# marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "What are the key terms and conditions in this agreement?",
|
|
}
|
|
],
|
|
},
|
|
{
|
|
"role": "assistant",
|
|
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
|
|
},
|
|
# The final turn is marked with cache-control, for continuing in followups.
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "What are the key terms and conditions in this agreement?",
|
|
}
|
|
],
|
|
},
|
|
],
|
|
temperature=0.2,
|
|
max_tokens=10,
|
|
)
|
|
|
|
print("response=", response)
|
|
print("response.usage=", response.usage)
|
|
|
|
assert "prompt_tokens_details" in response.usage
|
|
assert response.usage.prompt_tokens_details.cached_tokens > 0
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="proxy" label="PROXY">
|
|
|
|
1. Setup config.yaml
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: gpt-4o
|
|
litellm_params:
|
|
model: openai/gpt-4o
|
|
api_key: os.environ/OPENAI_API_KEY
|
|
```
|
|
|
|
2. Start proxy
|
|
|
|
```bash
|
|
litellm --config /path/to/config.yaml
|
|
```
|
|
|
|
3. Test it!
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
import os
|
|
|
|
client = OpenAI(
|
|
api_key="LITELLM_PROXY_KEY", # sk-1234
|
|
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
|
|
)
|
|
|
|
for _ in range(2):
|
|
response = client.chat.completions.create(
|
|
model="gpt-4o",
|
|
messages=[
|
|
# System Message
|
|
{
|
|
"role": "system",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "Here is the full text of a complex legal agreement"
|
|
* 400,
|
|
}
|
|
],
|
|
},
|
|
# marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "What are the key terms and conditions in this agreement?",
|
|
}
|
|
],
|
|
},
|
|
{
|
|
"role": "assistant",
|
|
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
|
|
},
|
|
# The final turn is marked with cache-control, for continuing in followups.
|
|
{
|
|
"role": "user",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "What are the key terms and conditions in this agreement?",
|
|
}
|
|
],
|
|
},
|
|
],
|
|
temperature=0.2,
|
|
max_tokens=10,
|
|
)
|
|
|
|
print("response=", response)
|
|
print("response.usage=", response.usage)
|
|
|
|
assert "prompt_tokens_details" in response.usage
|
|
assert response.usage.prompt_tokens_details.cached_tokens > 0
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
### Anthropic Example
|
|
|
|
Anthropic charges for cache writes.
|
|
|
|
Specify the content to cache with `"cache_control": {"type": "ephemeral"}`.
|
|
|
|
If you pass that in for any other llm provider, it will be ignored.
|
|
|
|
<Tabs>
|
|
<TabItem value="sdk" label="SDK">
|
|
|
|
```python
|
|
from litellm import completion
|
|
import litellm
|
|
import os
|
|
|
|
litellm.set_verbose = True # 👈 SEE RAW REQUEST
|
|
os.environ["ANTHROPIC_API_KEY"] = ""
|
|
|
|
response = completion(
|
|
model="anthropic/claude-3-5-sonnet-20240620",
|
|
messages=[
|
|
{
|
|
"role": "system",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "You are an AI assistant tasked with analyzing legal documents.",
|
|
},
|
|
{
|
|
"type": "text",
|
|
"text": "Here is the full text of a complex legal agreement" * 400,
|
|
"cache_control": {"type": "ephemeral"},
|
|
},
|
|
],
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": "what are the key terms and conditions in this agreement?",
|
|
},
|
|
]
|
|
)
|
|
|
|
print(response.usage)
|
|
```
|
|
</TabItem>
|
|
<TabItem value="proxy" label="PROXY">
|
|
|
|
1. Setup config.yaml
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: claude-3-5-sonnet-20240620
|
|
litellm_params:
|
|
model: anthropic/claude-3-5-sonnet-20240620
|
|
api_key: os.environ/ANTHROPIC_API_KEY
|
|
```
|
|
|
|
2. Start proxy
|
|
|
|
```bash
|
|
litellm --config /path/to/config.yaml
|
|
```
|
|
|
|
3. Test it!
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
import os
|
|
|
|
client = OpenAI(
|
|
api_key="LITELLM_PROXY_KEY", # sk-1234
|
|
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
|
|
)
|
|
|
|
response = client.chat.completions.create(
|
|
model="claude-3-5-sonnet-20240620",
|
|
messages=[
|
|
{
|
|
"role": "system",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "You are an AI assistant tasked with analyzing legal documents.",
|
|
},
|
|
{
|
|
"type": "text",
|
|
"text": "Here is the full text of a complex legal agreement" * 400,
|
|
"cache_control": {"type": "ephemeral"},
|
|
},
|
|
],
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": "what are the key terms and conditions in this agreement?",
|
|
},
|
|
]
|
|
)
|
|
|
|
print(response.usage)
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
### Deepeek Example
|
|
|
|
Works the same as OpenAI.
|
|
|
|
```python
|
|
from litellm import completion
|
|
import litellm
|
|
import os
|
|
|
|
os.environ["DEEPSEEK_API_KEY"] = ""
|
|
|
|
litellm.set_verbose = True # 👈 SEE RAW REQUEST
|
|
|
|
model_name = "deepseek/deepseek-chat"
|
|
messages_1 = [
|
|
{
|
|
"role": "system",
|
|
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": "In what year did Qin Shi Huang unify the six states?",
|
|
},
|
|
{"role": "assistant", "content": "Answer: 221 BC"},
|
|
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
|
|
{"role": "assistant", "content": "Answer: Liu Bang"},
|
|
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
|
|
{"role": "assistant", "content": "Answer: Li Zhu"},
|
|
{
|
|
"role": "user",
|
|
"content": "Who was the founding emperor of the Ming Dynasty?",
|
|
},
|
|
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
|
|
{
|
|
"role": "user",
|
|
"content": "Who was the founding emperor of the Qing Dynasty?",
|
|
},
|
|
]
|
|
|
|
message_2 = [
|
|
{
|
|
"role": "system",
|
|
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": "In what year did Qin Shi Huang unify the six states?",
|
|
},
|
|
{"role": "assistant", "content": "Answer: 221 BC"},
|
|
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
|
|
{"role": "assistant", "content": "Answer: Liu Bang"},
|
|
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
|
|
{"role": "assistant", "content": "Answer: Li Zhu"},
|
|
{
|
|
"role": "user",
|
|
"content": "Who was the founding emperor of the Ming Dynasty?",
|
|
},
|
|
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
|
|
{"role": "user", "content": "When did the Shang Dynasty fall?"},
|
|
]
|
|
|
|
response_1 = litellm.completion(model=model_name, messages=messages_1)
|
|
response_2 = litellm.completion(model=model_name, messages=message_2)
|
|
|
|
# Add any assertions here to check the response
|
|
print(response_2.usage)
|
|
```
|
|
|
|
|
|
## Calculate Cost
|
|
|
|
Cost cache-hit prompt tokens can differ from cache-miss prompt tokens.
|
|
|
|
Use the `completion_cost()` function for calculating cost ([handles prompt caching cost calculation](https://github.com/BerriAI/litellm/blob/f7ce1173f3315cc6cae06cf9bcf12e54a2a19705/litellm/llms/anthropic/cost_calculation.py#L12) as well). [**See more helper functions**](./token_usage.md)
|
|
|
|
```python
|
|
cost = completion_cost(completion_response=response, model=model)
|
|
```
|
|
|
|
### Usage
|
|
|
|
<Tabs>
|
|
<TabItem value="sdk" label="SDK">
|
|
|
|
```python
|
|
from litellm import completion, completion_cost
|
|
import litellm
|
|
import os
|
|
|
|
litellm.set_verbose = True # 👈 SEE RAW REQUEST
|
|
os.environ["ANTHROPIC_API_KEY"] = ""
|
|
model = "anthropic/claude-3-5-sonnet-20240620"
|
|
response = completion(
|
|
model=model,
|
|
messages=[
|
|
{
|
|
"role": "system",
|
|
"content": [
|
|
{
|
|
"type": "text",
|
|
"text": "You are an AI assistant tasked with analyzing legal documents.",
|
|
},
|
|
{
|
|
"type": "text",
|
|
"text": "Here is the full text of a complex legal agreement" * 400,
|
|
"cache_control": {"type": "ephemeral"},
|
|
},
|
|
],
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": "what are the key terms and conditions in this agreement?",
|
|
},
|
|
]
|
|
)
|
|
|
|
print(response.usage)
|
|
|
|
cost = completion_cost(completion_response=response, model=model)
|
|
|
|
formatted_string = f"${float(cost):.10f}"
|
|
print(formatted_string)
|
|
```
|
|
</TabItem>
|
|
<TabItem value="proxy" label="PROXY">
|
|
|
|
LiteLLM returns the calculated cost in the response headers - `x-litellm-response-cost`
|
|
|
|
```python
|
|
from openai import OpenAI
|
|
|
|
client = OpenAI(
|
|
api_key="LITELLM_PROXY_KEY", # sk-1234..
|
|
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
|
|
)
|
|
response = client.chat.completions.with_raw_response.create(
|
|
messages=[{
|
|
"role": "user",
|
|
"content": "Say this is a test",
|
|
}],
|
|
model="gpt-3.5-turbo",
|
|
)
|
|
print(response.headers.get('x-litellm-response-cost'))
|
|
|
|
completion = response.parse() # get the object that `chat.completions.create()` would have returned
|
|
print(completion)
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
## Check Model Support
|
|
|
|
Check if a model supports prompt caching with `supports_prompt_caching()`
|
|
|
|
<Tabs>
|
|
<TabItem value="sdk" label="SDK">
|
|
|
|
```python
|
|
from litellm.utils import supports_prompt_caching
|
|
|
|
supports_pc: bool = supports_prompt_caching(model="anthropic/claude-3-5-sonnet-20240620")
|
|
|
|
assert supports_pc
|
|
```
|
|
|
|
</TabItem>
|
|
<TabItem value="proxy" label="PROXY">
|
|
|
|
Use the `/model/info` endpoint to check if a model on the proxy supports prompt caching
|
|
|
|
1. Setup config.yaml
|
|
|
|
```yaml
|
|
model_list:
|
|
- model_name: claude-3-5-sonnet-20240620
|
|
litellm_params:
|
|
model: anthropic/claude-3-5-sonnet-20240620
|
|
api_key: os.environ/ANTHROPIC_API_KEY
|
|
```
|
|
|
|
2. Start proxy
|
|
|
|
```bash
|
|
litellm --config /path/to/config.yaml
|
|
```
|
|
|
|
3. Test it!
|
|
|
|
```bash
|
|
curl -L -X GET 'http://0.0.0.0:4000/v1/model/info' \
|
|
-H 'Authorization: Bearer sk-1234' \
|
|
```
|
|
|
|
**Expected Response**
|
|
|
|
```bash
|
|
{
|
|
"data": [
|
|
{
|
|
"model_name": "claude-3-5-sonnet-20240620",
|
|
"litellm_params": {
|
|
"model": "anthropic/claude-3-5-sonnet-20240620"
|
|
},
|
|
"model_info": {
|
|
"key": "claude-3-5-sonnet-20240620",
|
|
...
|
|
"supports_prompt_caching": true # 👈 LOOK FOR THIS!
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
</TabItem>
|
|
</Tabs>
|
|
|
|
This checks our maintained [model info/cost map](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json) |