forked from phoenix/litellm-mirror
1.7 KiB
1.7 KiB
Caching
liteLLM implements exact match caching. It can be enabled by setting
-
litellm.caching
: When set toTrue
, enables caching for all responses. Keys are the inputmessages
and values store in the cache is the correspondingresponse
-
litellm.caching_with_models
: When set toTrue
, enables caching on a per-model basis.Keys are the inputmessages + model
and values store in the cache is the correspondingresponse
Usage
- Caching - cache
Keys in the cache are
model
, the following example will lead to a cache hit
litellm.caching = True
# Make completion calls
response1 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
response2 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
# response1 == response2, response 1 is cached
# with a diff model
response3 = completion(model="command-nightly", messages=[{"role": "user", "content": "Tell me a joke."}])
# response3 == response1 == response2, since keys are messages
- Caching with Models - caching_with_models
Keys in the cache are
messages + model
, the following example will not lead to a cache hit
litellm.caching_with_models = True
# Make completion calls
response1 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
response2 = completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Tell me a joke."}])
# response1 == response2, response 1 is cached
# with a diff model, this will call the API since the key is not cached
response3 = completion(model="command-nightly", messages=[{"role": "user", "content": "Tell me a joke."}])
# response3 != response1, since keys are messages + model