mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-24 08:47:26 +00:00
# What does this PR do? I found a few issues while adding new metrics for various APIs: currently metrics are only propagated in `chat_completion` and `completion` since most providers use the `openai_..` routes as the default in `llama-stack-client inference chat-completion`, metrics are currently not working as expected. in order to get them working the following had to be done: 1. get the completion as usual 2. use new `openai_` versions of the metric gathering functions which use `.usage` from the `OpenAI..` response types to gather the metrics which are already populated. 3. define a `stream_generator` which counts the tokens and computes the metrics (only for stream=True) 5. add metrics to response NOTE: I could not add metrics to `openai_completion` where stream=True because that ONLY returns an `OpenAICompletion` not an AsyncGenerator that we can manipulate. acquire the lock, and add event to the span as the other `_log_...` methods do some new output: `llama-stack-client inference chat-completion --message hi` <img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM" src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326" /> and in the client: <img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM" src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007" /> these were not previously being recorded nor were they being printed to the server due to the improper console sink handling --------- Signed-off-by: Charlie Doern <cdoern@redhat.com>
56 lines
3.9 KiB
JSON
56 lines
3.9 KiB
JSON
{
|
|
"request": {
|
|
"method": "POST",
|
|
"url": "http://localhost:11434/v1/v1/chat/completions",
|
|
"headers": {},
|
|
"body": {
|
|
"model": "llama3.2:3b-instruct-fp16",
|
|
"messages": [
|
|
{
|
|
"role": "user",
|
|
"content": "Test trace openai 0"
|
|
}
|
|
],
|
|
"stream": false
|
|
},
|
|
"endpoint": "/v1/chat/completions",
|
|
"model": "llama3.2:3b-instruct-fp16"
|
|
},
|
|
"response": {
|
|
"body": {
|
|
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
|
|
"__data__": {
|
|
"id": "chatcmpl-876",
|
|
"choices": [
|
|
{
|
|
"finish_reason": "stop",
|
|
"index": 0,
|
|
"logprobs": null,
|
|
"message": {
|
|
"content": "I'm afraid I don't have a built-in ability to directly interface with or \"test\" OpenAI models, including the original GPT-1 model. However, I can explain how you might approach this task:\n\nThe OpenAI GPT-1 is a large transformer-based language model that was trained on a massive dataset of text and achieved state-of-the-art results in various natural language processing tasks.\n\nTo test or evaluate the performance of a model like GPT-1, you would typically follow these steps:\n\n1. **Get access to the OpenAI API**: The OpenAI API provides a way for developers to interact with the GPT-1 model programmatically. You can sign up for an API key on the OpenAI website.\n2. **Choose a testing platform or environment**: You'll need a compute platform that supports the necessary algorithms and data structures to run inference on the GPT-1 model. Some popular options include AWS, Google Cloud, or Azure Compute Virtual Machines.\n3. **Prepare your test input data**: This will involve creating text inputs in the format expected by the OpenAI API (i.e., a JSON object containing the text to be processed).\n4. **Use the OpenAI Python library or SDK**: The OpenAI Python library provides an easy-to-use interface for interacting with the GPT-1 model through the API.\n\nHere's some example code that demonstrates how you might use the OpenAI Flask API to test a single input:\n\n```python\nfrom flask import Flask, request, jsonify\nimport json\n\napp = Flask(__name__)\n\n@ app . route ( '/ /gpt-en ', ' Text ', methods = ['POST'])\ndef gpt_en () -> Json :\n data = request . get_json ()\n if not data or \"message\" in ( data ):\n return None , 400 , { ' error' : \"Input must be a text string.\" }\n response = []\n while True:\n message = \"\"\n for token in data [\"input\"]:\n response_text = f\"{data['prompt']} {token}\"\n data[\"input\"] = [response_text]\n new_response = gpt_en()(data)\n if all([not item or not isinstance(item, dict) for item in new_response]):\n break\n\n message = json . dumps ({}\"text\": response_text})\n response.append(message)\n\n return jsonify ({\"output\": response}), 200 , {}\n\nif __name__ == \"__main__\":\n app.run(debug=True)\n```\n\n5. **Evaluate the output**: Once you have processed your test input data using the GPT-1 model, you can evaluate the accuracy of the generated responses.\n\nKeep in mind that this is just a basic example to illustrate how you might approach testing the OpenAI GPT-1 model.",
|
|
"refusal": null,
|
|
"role": "assistant",
|
|
"annotations": null,
|
|
"audio": null,
|
|
"function_call": null,
|
|
"tool_calls": null
|
|
}
|
|
}
|
|
],
|
|
"created": 1754510050,
|
|
"model": "llama3.2:3b-instruct-fp16",
|
|
"object": "chat.completion",
|
|
"service_tier": null,
|
|
"system_fingerprint": "fp_ollama",
|
|
"usage": {
|
|
"completion_tokens": 567,
|
|
"prompt_tokens": 31,
|
|
"total_tokens": 598,
|
|
"completion_tokens_details": null,
|
|
"prompt_tokens_details": null
|
|
}
|
|
}
|
|
},
|
|
"is_streaming": false
|
|
}
|
|
}
|