mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-08-15 14:08:00 +00:00
# What does this PR do? I found a few issues while adding new metrics for various APIs: currently metrics are only propagated in `chat_completion` and `completion` since most providers use the `openai_..` routes as the default in `llama-stack-client inference chat-completion`, metrics are currently not working as expected. in order to get them working the following had to be done: 1. get the completion as usual 2. use new `openai_` versions of the metric gathering functions which use `.usage` from the `OpenAI..` response types to gather the metrics which are already populated. 3. define a `stream_generator` which counts the tokens and computes the metrics (only for stream=True) 5. add metrics to response NOTE: I could not add metrics to `openai_completion` where stream=True because that ONLY returns an `OpenAICompletion` not an AsyncGenerator that we can manipulate. acquire the lock, and add event to the span as the other `_log_...` methods do some new output: `llama-stack-client inference chat-completion --message hi` <img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM" src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326" /> and in the client: <img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM" src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007" /> these were not previously being recorded nor were they being printed to the server due to the improper console sink handling --------- Signed-off-by: Charlie Doern <cdoern@redhat.com>
56 lines
1.9 KiB
JSON
56 lines
1.9 KiB
JSON
{
|
|
"request": {
|
|
"method": "POST",
|
|
"url": "http://localhost:11434/v1/v1/completions",
|
|
"headers": {},
|
|
"body": {
|
|
"model": "llama3.2:3b-instruct-fp16",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": "Test trace openai 1"
|
|
}
|
|
],
|
|
"stream": false
|
|
},
|
|
"endpoint": "/v1/completions",
|
|
"model": "llama3.2:3b-instruct-fp16"
|
|
},
|
|
"response": {
|
|
"body": {
|
|
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
|
|
"__data__": {
|
|
"id": "chatcmpl-771",
|
|
"choices": [
|
|
{
|
|
"finish_reason": "stop",
|
|
"index": 0,
|
|
"logprobs": null,
|
|
"message": {
|
|
"content": "I'd be happy to test out the ChatGPT model with you, but I need to clarify that I can only simulate a conversation up to a certain extent. The Conversational AI (Chatbots) developed by OpenAI is an advanced version of my programming language model.\n\nAssume I have been trained on a massive dataset and have been fine-tuned for conversational interactions.\n\nWhat would you like to talk about? Would you like me to respond as if we were having a conversation in person, or should I try to engage you in a more abstract discussion?\n\nGo ahead and start the conversation.",
|
|
"refusal": null,
|
|
"role": "assistant",
|
|
"annotations": null,
|
|
"audio": null,
|
|
"function_call": null,
|
|
"tool_calls": null
|
|
}
|
|
}
|
|
],
|
|
"created": 1754051827,
|
|
"model": "llama3.2:3b-instruct-fp16",
|
|
"object": "chat.completion",
|
|
"service_tier": null,
|
|
"system_fingerprint": "fp_ollama",
|
|
"usage": {
|
|
"completion_tokens": 121,
|
|
"prompt_tokens": 31,
|
|
"total_tokens": 152,
|
|
"completion_tokens_details": null,
|
|
"prompt_tokens_details": null
|
|
}
|
|
}
|
|
},
|
|
"is_streaming": false
|
|
}
|
|
}
|