litellm/docs/my-website/docs/caching/local_caching.md
2023-11-02 14:51:07 -07:00

92 lines
No EOL
2.3 KiB
Markdown

# LiteLLM - Local Caching
## Caching `completion()` and `embedding()` calls when switched on
liteLLM implements exact match caching and supports the following Caching:
* In-Memory Caching [Default]
* Redis Caching Local
* Redis Caching Hosted
## Quick Start Usage - Completion
Caching - cache
Keys in the cache are `model`, the following example will lead to a cache hit
```python
import litellm
from litellm import completion
from litellm.caching import Cache
litellm.cache = Cache()
# Make completion calls
response1 = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me a joke."}]
caching=True
)
response2 = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me a joke."}],
caching=True
)
# response1 == response2, response 1 is cached
```
## Custom Key-Value Pairs
Add custom key-value pairs to your cache.
```python
from litellm.caching import Cache
cache = Cache()
cache.add_cache(cache_key="test-key", result="1234")
cache.get_cache(cache_key="test-key)
```
## Caching with Streaming
LiteLLM can cache your streamed responses for you
### Usage
```python
import litellm
from litellm import completion
from litellm.caching import Cache
litellm.cache = Cache()
# Make completion calls
response1 = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me a joke."}],
stream=True,
caching=True)
for chunk in response1:
print(chunk)
response2 = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Tell me a joke."}],
stream=True,
caching=True)
for chunk in response2:
print(chunk)
```
## Usage - Embedding()
1. Caching - cache
Keys in the cache are `model`, the following example will lead to a cache hit
```python
import time
import litellm
from litellm import embedding
from litellm.caching import Cache
litellm.cache = Cache()
start_time = time.time()
embedding1 = embedding(model="text-embedding-ada-002", input=["hello from litellm"*5], caching=True)
end_time = time.time()
print(f"Embedding 1 response time: {end_time - start_time} seconds")
start_time = time.time()
embedding2 = embedding(model="text-embedding-ada-002", input=["hello from litellm"*5], caching=True)
end_time = time.time()
print(f"Embedding 2 response time: {end_time - start_time} seconds")
```