fix: telemetry fixes (inference and core telemetry) (#2733)

# What does this PR do? I found a few issues while adding new metrics for various APIs: currently metrics are only propagated in `chat_completion` and `completion` since most providers use the `openai_..` routes as the default in `llama-stack-client inference chat-completion`, metrics are currently not working as expected. in order to get them working the following had to be done: 1. get the completion as usual 2. use new `openai_` versions of the metric gathering functions which use `.usage` from the `OpenAI..` response types to gather the metrics which are already populated. 3. define a `stream_generator` which counts the tokens and computes the metrics (only for stream=True) 5. add metrics to response NOTE: I could not add metrics to `openai_completion` where stream=True because that ONLY returns an `OpenAICompletion` not an AsyncGenerator that we can manipulate. acquire the lock, and add event to the span as the other `_log_...` methods do some new output: `llama-stack-client inference chat-completion --message hi` <img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM" src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326" /> and in the client: <img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM" src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007" /> these were not previously being recorded nor were they being printed to the server due to the improper console sink handling --------- Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-12-10 03:30:58 +00:00 · 2025-08-06 16:37:40 -04:00 · 2025-08-06 16:37:40 -04:00 · 0caef40e0d
commit 0caef40e0d
parent c252dfa3ef
26 changed files with 1595 additions and 246 deletions
--- a/tests/integration/recordings/responses/d0ac68cbde69.json
+++ b/tests/integration/recordings/responses/d0ac68cbde69.json
@ -13,12 +13,12 @@
      "__data__": {
        "models": [
          {
-            "model": "llama3.2:3b-instruct-fp16",
-            "name": "llama3.2:3b-instruct-fp16",
-            "digest": "195a8c01d91ec3cb1e0aad4624a51f2602c51fa7d96110f8ab5a20c84081804d",
-            "expires_at": "2025-08-05T14:12:18.480323-07:00",
-            "size": 7919570944,
-            "size_vram": 7919570944,
+            "model": "llama3.2:3b",
+            "name": "llama3.2:3b",
+            "digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
+            "expires_at": "2025-08-06T15:57:21.573326-04:00",
+            "size": 4030033920,
+            "size_vram": 4030033920,
            "details": {
              "parent_model": "",
              "format": "gguf",
@ -27,25 +27,7 @@
                "llama"
              ],
              "parameter_size": "3.2B",
-              "quantization_level": "F16"
-            }
-          },
-          {
-            "model": "all-minilm:l6-v2",
-            "name": "all-minilm:l6-v2",
-            "digest": "1b226e2802dbb772b5fc32a58f103ca1804ef7501331012de126ab22f67475ef",
-            "expires_at": "2025-08-05T14:10:20.883978-07:00",
-            "size": 590204928,
-            "size_vram": 590204928,
-            "details": {
-              "parent_model": "",
-              "format": "gguf",
-              "family": "bert",
-              "families": [
-                "bert"
-              ],
-              "parameter_size": "23M",
-              "quantization_level": "F16"
+              "quantization_level": "Q4_K_M"
            }
          }
        ]