mirror of
https://github.com/BerriAI/litellm.git
synced 2025-04-25 10:44:24 +00:00
LiteLLM Minor Fixes & Improvements (11/05/2024) (#6590)
* fix(pattern_matching_router.py): update model name using correct function
* fix(langfuse.py): metadata deepcopy can cause unhandled error (#6563)
Co-authored-by: seva <seva@inita.com>
* fix(stream_chunk_builder_utils.py): correctly set prompt tokens + log correct streaming usage
Closes https://github.com/BerriAI/litellm/issues/6488
* build(deps): bump cookie and express in /docs/my-website (#6566)
Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.
Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1)
Updates `express` from 4.20.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.20.0...4.21.1)
---
updated-dependencies:
- dependency-name: cookie
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* docs(virtual_keys.md): update Dockerfile reference (#6554)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
* (proxy fix) - call connect on prisma client when running setup (#6534)
* critical fix - call connect on prisma client when running setup
* fix test_proxy_server_prisma_setup
* fix test_proxy_server_prisma_setup
* Add 3.5 haiku (#6588)
* feat: add claude-3-5-haiku-20241022 entries
* feat: add claude-3-5-haiku-20241022 and vertex_ai/claude-3-5-haiku@20241022 models
* add missing entries, remove vision
* remove image token costs
* Litellm perf improvements 3 (#6573)
* perf: move writing key to cache, to background task
* perf(litellm_pre_call_utils.py): add otel tracing for pre-call utils
adds 200ms on calls with pgdb connected
* fix(litellm_pre_call_utils.py'): rename call_type to actual call used
* perf(proxy_server.py): remove db logic from _get_config_from_file
was causing db calls to occur on every llm request, if team_id was set on key
* fix(auth_checks.py): add check for reducing db calls if user/team id does not exist in db
reduces latency/call by ~100ms
* fix(proxy_server.py): minor fix on existing_settings not incl alerting
* fix(exception_mapping_utils.py): map databricks exception string
* fix(auth_checks.py): fix auth check logic
* test: correctly mark flaky test
* fix(utils.py): handle auth token error for tokenizers.from_pretrained
* build: fix map
* build: fix map
* build: fix json for model map
* fix ImageObject conversion (#6584)
* (fix) litellm.text_completion raises a non-blocking error on simple usage (#6546)
* unit test test_huggingface_text_completion_logprobs
* fix return TextCompletionHandler convert_chat_to_text_completion
* fix hf rest api
* fix test_huggingface_text_completion_logprobs
* fix linting errors
* fix importLiteLLMResponseObjectHandler
* fix test for LiteLLMResponseObjectHandler
* fix test text completion
* fix allow using 15 seconds for premium license check
* testing fix bedrock deprecated cohere.command-text-v14
* (feat) add `Predicted Outputs` for OpenAI (#6594)
* bump openai to openai==1.54.0
* add 'prediction' param
* testing fix bedrock deprecated cohere.command-text-v14
* test test_openai_prediction_param.py
* test_openai_prediction_param_with_caching
* doc Predicted Outputs
* doc Predicted Output
* (fix) Vertex Improve Performance when using `image_url` (#6593)
* fix transformation vertex
* test test_process_gemini_image
* test_image_completion_request
* testing fix - bedrock has deprecated cohere.command-text-v14
* fix vertex pdf
* bump: version 1.51.5 → 1.52.0
* fix(lowest_tpm_rpm_routing.py): fix parallel rate limit check (#6577)
* fix(lowest_tpm_rpm_routing.py): fix parallel rate limit check
* fix(lowest_tpm_rpm_v2.py): return headers in correct format
* test: update test
* build(deps): bump cookie and express in /docs/my-website (#6566)
Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.
Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1)
Updates `express` from 4.20.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.20.0...4.21.1)
---
updated-dependencies:
- dependency-name: cookie
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* docs(virtual_keys.md): update Dockerfile reference (#6554)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
* (proxy fix) - call connect on prisma client when running setup (#6534)
* critical fix - call connect on prisma client when running setup
* fix test_proxy_server_prisma_setup
* fix test_proxy_server_prisma_setup
* Add 3.5 haiku (#6588)
* feat: add claude-3-5-haiku-20241022 entries
* feat: add claude-3-5-haiku-20241022 and vertex_ai/claude-3-5-haiku@20241022 models
* add missing entries, remove vision
* remove image token costs
* Litellm perf improvements 3 (#6573)
* perf: move writing key to cache, to background task
* perf(litellm_pre_call_utils.py): add otel tracing for pre-call utils
adds 200ms on calls with pgdb connected
* fix(litellm_pre_call_utils.py'): rename call_type to actual call used
* perf(proxy_server.py): remove db logic from _get_config_from_file
was causing db calls to occur on every llm request, if team_id was set on key
* fix(auth_checks.py): add check for reducing db calls if user/team id does not exist in db
reduces latency/call by ~100ms
* fix(proxy_server.py): minor fix on existing_settings not incl alerting
* fix(exception_mapping_utils.py): map databricks exception string
* fix(auth_checks.py): fix auth check logic
* test: correctly mark flaky test
* fix(utils.py): handle auth token error for tokenizers.from_pretrained
* build: fix map
* build: fix map
* build: fix json for model map
* test: remove eol model
* fix(proxy_server.py): fix db config loading logic
* fix(proxy_server.py): fix order of config / db updates, to ensure fields not overwritten
* test: skip test if required env var is missing
* test: fix test
---------
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com>
* test: mark flaky test
* test: handle anthropic api instability
* test(test_proxy_utils.py): add testing for db config update logic
* Update setuptools in docker and fastapi to latest verison, in order to upgrade starlette version (#6597)
* build(deps): bump cookie and express in /docs/my-website (#6566)
Bumps [cookie](https://github.com/jshttp/cookie) and [express](https://github.com/expressjs/express). These dependencies needed to be updated together.
Updates `cookie` from 0.6.0 to 0.7.1
- [Release notes](https://github.com/jshttp/cookie/releases)
- [Commits](https://github.com/jshttp/cookie/compare/v0.6.0...v0.7.1)
Updates `express` from 4.20.0 to 4.21.1
- [Release notes](https://github.com/expressjs/express/releases)
- [Changelog](https://github.com/expressjs/express/blob/4.21.1/History.md)
- [Commits](https://github.com/expressjs/express/compare/4.20.0...4.21.1)
---
updated-dependencies:
- dependency-name: cookie
dependency-type: indirect
- dependency-name: express
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* docs(virtual_keys.md): update Dockerfile reference (#6554)
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
* (proxy fix) - call connect on prisma client when running setup (#6534)
* critical fix - call connect on prisma client when running setup
* fix test_proxy_server_prisma_setup
* fix test_proxy_server_prisma_setup
* Add 3.5 haiku (#6588)
* feat: add claude-3-5-haiku-20241022 entries
* feat: add claude-3-5-haiku-20241022 and vertex_ai/claude-3-5-haiku@20241022 models
* add missing entries, remove vision
* remove image token costs
* Litellm perf improvements 3 (#6573)
* perf: move writing key to cache, to background task
* perf(litellm_pre_call_utils.py): add otel tracing for pre-call utils
adds 200ms on calls with pgdb connected
* fix(litellm_pre_call_utils.py'): rename call_type to actual call used
* perf(proxy_server.py): remove db logic from _get_config_from_file
was causing db calls to occur on every llm request, if team_id was set on key
* fix(auth_checks.py): add check for reducing db calls if user/team id does not exist in db
reduces latency/call by ~100ms
* fix(proxy_server.py): minor fix on existing_settings not incl alerting
* fix(exception_mapping_utils.py): map databricks exception string
* fix(auth_checks.py): fix auth check logic
* test: correctly mark flaky test
* fix(utils.py): handle auth token error for tokenizers.from_pretrained
* build: fix map
* build: fix map
* build: fix json for model map
* fix ImageObject conversion (#6584)
* (fix) litellm.text_completion raises a non-blocking error on simple usage (#6546)
* unit test test_huggingface_text_completion_logprobs
* fix return TextCompletionHandler convert_chat_to_text_completion
* fix hf rest api
* fix test_huggingface_text_completion_logprobs
* fix linting errors
* fix importLiteLLMResponseObjectHandler
* fix test for LiteLLMResponseObjectHandler
* fix test text completion
* fix allow using 15 seconds for premium license check
* testing fix bedrock deprecated cohere.command-text-v14
* (feat) add `Predicted Outputs` for OpenAI (#6594)
* bump openai to openai==1.54.0
* add 'prediction' param
* testing fix bedrock deprecated cohere.command-text-v14
* test test_openai_prediction_param.py
* test_openai_prediction_param_with_caching
* doc Predicted Outputs
* doc Predicted Output
* (fix) Vertex Improve Performance when using `image_url` (#6593)
* fix transformation vertex
* test test_process_gemini_image
* test_image_completion_request
* testing fix - bedrock has deprecated cohere.command-text-v14
* fix vertex pdf
* bump: version 1.51.5 → 1.52.0
* Update setuptools in docker and fastapi to latest verison, in order to upgrade starlette version
---------
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com>
Co-authored-by: Krish Dholakia <krrishdholakia@gmail.com>
Co-authored-by: Jacob Hagstedt <wcgs@novonordisk.com>
* fix(langfuse.py): fix linting errors
* fix: fix linting errors
* fix: fix casting error
* fix: fix typing error
* fix: add more tests
* fix(utils.py): fix return_processed_chunk_logic
* Revert "Update setuptools in docker and fastapi to latest verison, in order t…" (#6615)
This reverts commit 1a7f7bdfb7
.
* docs fix clarify team_id on team based logging
* doc fix team based logging with langfuse
* fix flake8 checks
* test: bump sleep time
* refactor: replace claude-instant-1.2 with haiku in testing
* fix(proxy_server.py): move to using sl payload in track_cost_callback
* fix(proxy_server.py): fix linting errors
* fix(proxy_server.py): fallback to kwargs(response_cost) if given
* test: remove claude-instant-1 from tests
* test: fix claude test
* docs fix clarify team_id on team based logging
* doc fix team based logging with langfuse
* build: remove lint.yml
---------
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Vsevolod Karvetskiy <56288164+karvetskiy@users.noreply.github.com>
Co-authored-by: seva <seva@inita.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Ishaan Jaff <ishaanjaffer0324@gmail.com>
Co-authored-by: paul-gauthier <69695708+paul-gauthier@users.noreply.github.com>
Co-authored-by: Jacob Hagstedt P Suorra <Jacobh2@users.noreply.github.com>
Co-authored-by: Jacob Hagstedt <wcgs@novonordisk.com>
This commit is contained in:
parent
66c1ee09cf
commit
136693cac4
32 changed files with 634 additions and 533 deletions
419
litellm/utils.py
419
litellm/utils.py
|
@ -114,6 +114,7 @@ from litellm.types.utils import (
|
|||
Message,
|
||||
ModelInfo,
|
||||
ModelResponse,
|
||||
ModelResponseStream,
|
||||
ProviderField,
|
||||
StreamingChoices,
|
||||
TextChoices,
|
||||
|
@ -5642,6 +5643,9 @@ class CustomStreamWrapper:
|
|||
)
|
||||
self.messages = getattr(logging_obj, "messages", None)
|
||||
self.sent_stream_usage = False
|
||||
self.send_stream_usage = (
|
||||
True if self.check_send_stream_usage(self.stream_options) else False
|
||||
)
|
||||
self.tool_call = False
|
||||
self.chunks: List = (
|
||||
[]
|
||||
|
@ -5654,6 +5658,12 @@ class CustomStreamWrapper:
|
|||
def __aiter__(self):
|
||||
return self
|
||||
|
||||
def check_send_stream_usage(self, stream_options: Optional[dict]):
|
||||
return (
|
||||
stream_options is not None
|
||||
and stream_options.get("include_usage", False) is True
|
||||
)
|
||||
|
||||
def check_is_function_call(self, logging_obj) -> bool:
|
||||
if hasattr(logging_obj, "optional_params") and isinstance(
|
||||
logging_obj.optional_params, dict
|
||||
|
@ -6506,9 +6516,148 @@ class CustomStreamWrapper:
|
|||
is_empty = False
|
||||
return is_empty
|
||||
|
||||
def return_processed_chunk_logic( # noqa
|
||||
self,
|
||||
completion_obj: dict,
|
||||
model_response: ModelResponseStream,
|
||||
response_obj: dict,
|
||||
):
|
||||
|
||||
print_verbose(
|
||||
f"completion_obj: {completion_obj}, model_response.choices[0]: {model_response.choices[0]}, response_obj: {response_obj}"
|
||||
)
|
||||
if (
|
||||
"content" in completion_obj
|
||||
and (
|
||||
isinstance(completion_obj["content"], str)
|
||||
and len(completion_obj["content"]) > 0
|
||||
)
|
||||
or (
|
||||
"tool_calls" in completion_obj
|
||||
and completion_obj["tool_calls"] is not None
|
||||
and len(completion_obj["tool_calls"]) > 0
|
||||
)
|
||||
or (
|
||||
"function_call" in completion_obj
|
||||
and completion_obj["function_call"] is not None
|
||||
)
|
||||
): # cannot set content of an OpenAI Object to be an empty string
|
||||
self.safety_checker()
|
||||
hold, model_response_str = self.check_special_tokens(
|
||||
chunk=completion_obj["content"],
|
||||
finish_reason=model_response.choices[0].finish_reason,
|
||||
) # filter out bos/eos tokens from openai-compatible hf endpoints
|
||||
print_verbose(f"hold - {hold}, model_response_str - {model_response_str}")
|
||||
if hold is False:
|
||||
## check if openai/azure chunk
|
||||
original_chunk = response_obj.get("original_chunk", None)
|
||||
if original_chunk:
|
||||
model_response.id = original_chunk.id
|
||||
self.response_id = original_chunk.id
|
||||
if len(original_chunk.choices) > 0:
|
||||
choices = []
|
||||
for choice in original_chunk.choices:
|
||||
try:
|
||||
if isinstance(choice, BaseModel):
|
||||
choice_json = choice.model_dump()
|
||||
choice_json.pop(
|
||||
"finish_reason", None
|
||||
) # for mistral etc. which return a value in their last chunk (not-openai compatible).
|
||||
print_verbose(f"choice_json: {choice_json}")
|
||||
choices.append(StreamingChoices(**choice_json))
|
||||
except Exception:
|
||||
choices.append(StreamingChoices())
|
||||
print_verbose(f"choices in streaming: {choices}")
|
||||
setattr(model_response, "choices", choices)
|
||||
else:
|
||||
return
|
||||
model_response.system_fingerprint = (
|
||||
original_chunk.system_fingerprint
|
||||
)
|
||||
setattr(
|
||||
model_response,
|
||||
"citations",
|
||||
getattr(original_chunk, "citations", None),
|
||||
)
|
||||
print_verbose(f"self.sent_first_chunk: {self.sent_first_chunk}")
|
||||
if self.sent_first_chunk is False:
|
||||
model_response.choices[0].delta["role"] = "assistant"
|
||||
self.sent_first_chunk = True
|
||||
elif self.sent_first_chunk is True and hasattr(
|
||||
model_response.choices[0].delta, "role"
|
||||
):
|
||||
_initial_delta = model_response.choices[0].delta.model_dump()
|
||||
_initial_delta.pop("role", None)
|
||||
model_response.choices[0].delta = Delta(**_initial_delta)
|
||||
print_verbose(
|
||||
f"model_response.choices[0].delta: {model_response.choices[0].delta}"
|
||||
)
|
||||
else:
|
||||
## else
|
||||
completion_obj["content"] = model_response_str
|
||||
if self.sent_first_chunk is False:
|
||||
completion_obj["role"] = "assistant"
|
||||
self.sent_first_chunk = True
|
||||
|
||||
model_response.choices[0].delta = Delta(**completion_obj)
|
||||
_index: Optional[int] = completion_obj.get("index")
|
||||
if _index is not None:
|
||||
model_response.choices[0].index = _index
|
||||
print_verbose(f"returning model_response: {model_response}")
|
||||
return model_response
|
||||
else:
|
||||
return
|
||||
elif self.received_finish_reason is not None:
|
||||
if self.sent_last_chunk is True:
|
||||
# Bedrock returns the guardrail trace in the last chunk - we want to return this here
|
||||
if self.custom_llm_provider == "bedrock" and "trace" in model_response:
|
||||
return model_response
|
||||
|
||||
# Default - return StopIteration
|
||||
raise StopIteration
|
||||
# flush any remaining holding chunk
|
||||
if len(self.holding_chunk) > 0:
|
||||
if model_response.choices[0].delta.content is None:
|
||||
model_response.choices[0].delta.content = self.holding_chunk
|
||||
else:
|
||||
model_response.choices[0].delta.content = (
|
||||
self.holding_chunk + model_response.choices[0].delta.content
|
||||
)
|
||||
self.holding_chunk = ""
|
||||
# if delta is None
|
||||
_is_delta_empty = self.is_delta_empty(delta=model_response.choices[0].delta)
|
||||
|
||||
if _is_delta_empty:
|
||||
# get any function call arguments
|
||||
model_response.choices[0].finish_reason = map_finish_reason(
|
||||
finish_reason=self.received_finish_reason
|
||||
) # ensure consistent output to openai
|
||||
|
||||
self.sent_last_chunk = True
|
||||
|
||||
return model_response
|
||||
elif (
|
||||
model_response.choices[0].delta.tool_calls is not None
|
||||
or model_response.choices[0].delta.function_call is not None
|
||||
):
|
||||
if self.sent_first_chunk is False:
|
||||
model_response.choices[0].delta["role"] = "assistant"
|
||||
self.sent_first_chunk = True
|
||||
return model_response
|
||||
elif (
|
||||
len(model_response.choices) > 0
|
||||
and hasattr(model_response.choices[0].delta, "audio")
|
||||
and model_response.choices[0].delta.audio is not None
|
||||
):
|
||||
return model_response
|
||||
else:
|
||||
if hasattr(model_response, "usage"):
|
||||
self.chunks.append(model_response)
|
||||
return
|
||||
|
||||
def chunk_creator(self, chunk): # type: ignore # noqa: PLR0915
|
||||
model_response = self.model_response_creator()
|
||||
response_obj = {}
|
||||
response_obj: dict = {}
|
||||
try:
|
||||
# return this for all models
|
||||
completion_obj = {"content": ""}
|
||||
|
@ -6559,6 +6708,7 @@ class CustomStreamWrapper:
|
|||
"provider_specific_fields"
|
||||
].items():
|
||||
setattr(model_response, key, value)
|
||||
|
||||
response_obj = anthropic_response_obj
|
||||
elif (
|
||||
self.custom_llm_provider
|
||||
|
@ -6626,7 +6776,7 @@ class CustomStreamWrapper:
|
|||
if self.sent_first_chunk is False:
|
||||
raise Exception("An unknown error occurred with the stream")
|
||||
self.received_finish_reason = "stop"
|
||||
elif self.custom_llm_provider and (self.custom_llm_provider == "vertex_ai"):
|
||||
elif self.custom_llm_provider == "vertex_ai":
|
||||
import proto # type: ignore
|
||||
|
||||
if self.model.startswith("claude-3"):
|
||||
|
@ -7009,145 +7159,12 @@ class CustomStreamWrapper:
|
|||
self.tool_call = True
|
||||
|
||||
## RETURN ARG
|
||||
if (
|
||||
"content" in completion_obj
|
||||
and (
|
||||
isinstance(completion_obj["content"], str)
|
||||
and len(completion_obj["content"]) > 0
|
||||
)
|
||||
or (
|
||||
"tool_calls" in completion_obj
|
||||
and completion_obj["tool_calls"] is not None
|
||||
and len(completion_obj["tool_calls"]) > 0
|
||||
)
|
||||
or (
|
||||
"function_call" in completion_obj
|
||||
and completion_obj["function_call"] is not None
|
||||
)
|
||||
): # cannot set content of an OpenAI Object to be an empty string
|
||||
self.safety_checker()
|
||||
hold, model_response_str = self.check_special_tokens(
|
||||
chunk=completion_obj["content"],
|
||||
finish_reason=model_response.choices[0].finish_reason,
|
||||
) # filter out bos/eos tokens from openai-compatible hf endpoints
|
||||
print_verbose(
|
||||
f"hold - {hold}, model_response_str - {model_response_str}"
|
||||
)
|
||||
if hold is False:
|
||||
## check if openai/azure chunk
|
||||
original_chunk = response_obj.get("original_chunk", None)
|
||||
if original_chunk:
|
||||
model_response.id = original_chunk.id
|
||||
self.response_id = original_chunk.id
|
||||
if len(original_chunk.choices) > 0:
|
||||
choices = []
|
||||
for idx, choice in enumerate(original_chunk.choices):
|
||||
try:
|
||||
if isinstance(choice, BaseModel):
|
||||
try:
|
||||
choice_json = choice.model_dump()
|
||||
except Exception:
|
||||
choice_json = choice.dict()
|
||||
choice_json.pop(
|
||||
"finish_reason", None
|
||||
) # for mistral etc. which return a value in their last chunk (not-openai compatible).
|
||||
print_verbose(f"choice_json: {choice_json}")
|
||||
choices.append(StreamingChoices(**choice_json))
|
||||
except Exception:
|
||||
choices.append(StreamingChoices())
|
||||
print_verbose(f"choices in streaming: {choices}")
|
||||
model_response.choices = choices
|
||||
else:
|
||||
return
|
||||
model_response.system_fingerprint = (
|
||||
original_chunk.system_fingerprint
|
||||
)
|
||||
model_response.citations = getattr(
|
||||
original_chunk, "citations", None
|
||||
)
|
||||
print_verbose(f"self.sent_first_chunk: {self.sent_first_chunk}")
|
||||
if self.sent_first_chunk is False:
|
||||
model_response.choices[0].delta["role"] = "assistant"
|
||||
self.sent_first_chunk = True
|
||||
elif self.sent_first_chunk is True and hasattr(
|
||||
model_response.choices[0].delta, "role"
|
||||
):
|
||||
_initial_delta = model_response.choices[
|
||||
0
|
||||
].delta.model_dump()
|
||||
_initial_delta.pop("role", None)
|
||||
model_response.choices[0].delta = Delta(**_initial_delta)
|
||||
print_verbose(
|
||||
f"model_response.choices[0].delta: {model_response.choices[0].delta}"
|
||||
)
|
||||
else:
|
||||
## else
|
||||
completion_obj["content"] = model_response_str
|
||||
if self.sent_first_chunk is False:
|
||||
completion_obj["role"] = "assistant"
|
||||
self.sent_first_chunk = True
|
||||
return self.return_processed_chunk_logic(
|
||||
completion_obj=completion_obj,
|
||||
model_response=model_response, # type: ignore
|
||||
response_obj=response_obj,
|
||||
)
|
||||
|
||||
model_response.choices[0].delta = Delta(**completion_obj)
|
||||
if completion_obj.get("index") is not None:
|
||||
model_response.choices[0].index = completion_obj.get(
|
||||
"index"
|
||||
)
|
||||
print_verbose(f"returning model_response: {model_response}")
|
||||
return model_response
|
||||
else:
|
||||
return
|
||||
elif self.received_finish_reason is not None:
|
||||
if self.sent_last_chunk is True:
|
||||
# Bedrock returns the guardrail trace in the last chunk - we want to return this here
|
||||
if (
|
||||
self.custom_llm_provider == "bedrock"
|
||||
and "trace" in model_response
|
||||
):
|
||||
return model_response
|
||||
|
||||
# Default - return StopIteration
|
||||
raise StopIteration
|
||||
# flush any remaining holding chunk
|
||||
if len(self.holding_chunk) > 0:
|
||||
if model_response.choices[0].delta.content is None:
|
||||
model_response.choices[0].delta.content = self.holding_chunk
|
||||
else:
|
||||
model_response.choices[0].delta.content = (
|
||||
self.holding_chunk + model_response.choices[0].delta.content
|
||||
)
|
||||
self.holding_chunk = ""
|
||||
# if delta is None
|
||||
_is_delta_empty = self.is_delta_empty(
|
||||
delta=model_response.choices[0].delta
|
||||
)
|
||||
|
||||
if _is_delta_empty:
|
||||
# get any function call arguments
|
||||
model_response.choices[0].finish_reason = map_finish_reason(
|
||||
finish_reason=self.received_finish_reason
|
||||
) # ensure consistent output to openai
|
||||
|
||||
self.sent_last_chunk = True
|
||||
|
||||
return model_response
|
||||
elif (
|
||||
model_response.choices[0].delta.tool_calls is not None
|
||||
or model_response.choices[0].delta.function_call is not None
|
||||
):
|
||||
if self.sent_first_chunk is False:
|
||||
model_response.choices[0].delta["role"] = "assistant"
|
||||
self.sent_first_chunk = True
|
||||
return model_response
|
||||
elif (
|
||||
len(model_response.choices) > 0
|
||||
and hasattr(model_response.choices[0].delta, "audio")
|
||||
and model_response.choices[0].delta.audio is not None
|
||||
):
|
||||
return model_response
|
||||
else:
|
||||
if hasattr(model_response, "usage"):
|
||||
self.chunks.append(model_response)
|
||||
return
|
||||
except StopIteration:
|
||||
raise StopIteration
|
||||
except Exception as e:
|
||||
|
@ -7293,27 +7310,24 @@ class CustomStreamWrapper:
|
|||
|
||||
except StopIteration:
|
||||
if self.sent_last_chunk is True:
|
||||
if (
|
||||
self.sent_stream_usage is False
|
||||
and self.stream_options is not None
|
||||
and self.stream_options.get("include_usage", False) is True
|
||||
):
|
||||
# send the final chunk with stream options
|
||||
complete_streaming_response = litellm.stream_chunk_builder(
|
||||
chunks=self.chunks, messages=self.messages
|
||||
complete_streaming_response = litellm.stream_chunk_builder(
|
||||
chunks=self.chunks, messages=self.messages
|
||||
)
|
||||
response = self.model_response_creator()
|
||||
if complete_streaming_response is not None:
|
||||
setattr(
|
||||
response,
|
||||
"usage",
|
||||
getattr(complete_streaming_response, "usage"),
|
||||
)
|
||||
response = self.model_response_creator()
|
||||
if complete_streaming_response is not None:
|
||||
setattr(
|
||||
response,
|
||||
"usage",
|
||||
getattr(complete_streaming_response, "usage"),
|
||||
)
|
||||
## LOGGING
|
||||
threading.Thread(
|
||||
target=self.logging_obj.success_handler,
|
||||
args=(response, None, None, cache_hit),
|
||||
).start() # log response
|
||||
|
||||
## LOGGING
|
||||
threading.Thread(
|
||||
target=self.logging_obj.success_handler,
|
||||
args=(response, None, None, cache_hit),
|
||||
).start() # log response
|
||||
|
||||
if self.sent_stream_usage is False and self.send_stream_usage is True:
|
||||
self.sent_stream_usage = True
|
||||
return response
|
||||
raise # Re-raise StopIteration
|
||||
|
@ -7401,7 +7415,6 @@ class CustomStreamWrapper:
|
|||
or self.custom_llm_provider in litellm._custom_providers
|
||||
):
|
||||
async for chunk in self.completion_stream:
|
||||
print_verbose(f"value of async chunk: {chunk}")
|
||||
if chunk == "None" or chunk is None:
|
||||
raise Exception
|
||||
elif (
|
||||
|
@ -7431,10 +7444,7 @@ class CustomStreamWrapper:
|
|||
end_time=None,
|
||||
cache_hit=cache_hit,
|
||||
)
|
||||
# threading.Thread(
|
||||
# target=self.logging_obj.success_handler,
|
||||
# args=(processed_chunk, None, None, cache_hit),
|
||||
# ).start() # log response
|
||||
|
||||
asyncio.create_task(
|
||||
self.logging_obj.async_success_handler(
|
||||
processed_chunk, cache_hit=cache_hit
|
||||
|
@ -7515,82 +7525,33 @@ class CustomStreamWrapper:
|
|||
# RETURN RESULT
|
||||
self.chunks.append(processed_chunk)
|
||||
return processed_chunk
|
||||
except StopAsyncIteration:
|
||||
except (StopAsyncIteration, StopIteration):
|
||||
if self.sent_last_chunk is True:
|
||||
if (
|
||||
self.sent_stream_usage is False
|
||||
and self.stream_options is not None
|
||||
and self.stream_options.get("include_usage", False) is True
|
||||
):
|
||||
# send the final chunk with stream options
|
||||
complete_streaming_response = litellm.stream_chunk_builder(
|
||||
chunks=self.chunks, messages=self.messages
|
||||
# log the final chunk with accurate streaming values
|
||||
complete_streaming_response = litellm.stream_chunk_builder(
|
||||
chunks=self.chunks, messages=self.messages
|
||||
)
|
||||
response = self.model_response_creator()
|
||||
if complete_streaming_response is not None:
|
||||
setattr(
|
||||
response,
|
||||
"usage",
|
||||
getattr(complete_streaming_response, "usage"),
|
||||
)
|
||||
response = self.model_response_creator()
|
||||
if complete_streaming_response is not None:
|
||||
setattr(
|
||||
response,
|
||||
"usage",
|
||||
getattr(complete_streaming_response, "usage"),
|
||||
)
|
||||
## LOGGING
|
||||
threading.Thread(
|
||||
target=self.logging_obj.success_handler,
|
||||
args=(response, None, None, cache_hit),
|
||||
).start() # log response
|
||||
asyncio.create_task(
|
||||
self.logging_obj.async_success_handler(
|
||||
response, cache_hit=cache_hit
|
||||
)
|
||||
)
|
||||
self.sent_stream_usage = True
|
||||
return response
|
||||
raise # Re-raise StopIteration
|
||||
else:
|
||||
self.sent_last_chunk = True
|
||||
processed_chunk = self.finish_reason_handler()
|
||||
## LOGGING
|
||||
threading.Thread(
|
||||
target=self.logging_obj.success_handler,
|
||||
args=(processed_chunk, None, None, cache_hit),
|
||||
args=(response, None, None, cache_hit),
|
||||
).start() # log response
|
||||
asyncio.create_task(
|
||||
self.logging_obj.async_success_handler(
|
||||
processed_chunk, cache_hit=cache_hit
|
||||
response, cache_hit=cache_hit
|
||||
)
|
||||
)
|
||||
return processed_chunk
|
||||
except StopIteration:
|
||||
if self.sent_last_chunk is True:
|
||||
if (
|
||||
self.sent_stream_usage is False
|
||||
and self.stream_options is not None
|
||||
and self.stream_options.get("include_usage", False) is True
|
||||
):
|
||||
# send the final chunk with stream options
|
||||
complete_streaming_response = litellm.stream_chunk_builder(
|
||||
chunks=self.chunks, messages=self.messages
|
||||
)
|
||||
response = self.model_response_creator()
|
||||
if complete_streaming_response is not None:
|
||||
setattr(
|
||||
response,
|
||||
"usage",
|
||||
getattr(complete_streaming_response, "usage"),
|
||||
)
|
||||
## LOGGING
|
||||
threading.Thread(
|
||||
target=self.logging_obj.success_handler,
|
||||
args=(response, None, None, cache_hit),
|
||||
).start() # log response
|
||||
asyncio.create_task(
|
||||
self.logging_obj.async_success_handler(
|
||||
response, cache_hit=cache_hit
|
||||
)
|
||||
)
|
||||
if self.sent_stream_usage is False and self.send_stream_usage is True:
|
||||
self.sent_stream_usage = True
|
||||
return response
|
||||
raise StopAsyncIteration
|
||||
raise StopAsyncIteration # Re-raise StopIteration
|
||||
else:
|
||||
self.sent_last_chunk = True
|
||||
processed_chunk = self.finish_reason_handler()
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue