fix(inference): enable routing of models with provider_data alone (backport #3928) (#4142)

This PR enables routing of fully qualified model IDs of the form `provider_id/model_id` even when the models are not registered with the Stack. Here's the situation: assume a remote inference provider which works only when users provide their own API keys via `X-LlamaStack-Provider-Data` header. By definition, we cannot list models and hence update our routing registry. But because we _require_ a provider ID in the models now, we can identify which provider to route to and let that provider decide. Note that we still try to look up our registry since it may have a pre-registered alias. Just that we don't outright fail when we are not able to look it up. Also, updated inference router so that the responses have the _exact_ model that the request had. ## Test Plan Added an integration test Closes #3929<hr>This is an automatic backport of pull request #3928 done by [Mergify](https://mergify.com). --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com> Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
2025-12-03 18:00:36 +00:00 · 2025-11-12 13:41:27 -08:00 · 2025-11-12 13:41:27 -08:00 · 641d5144be
commit 641d5144be
parent a6c3a9cadf
6 changed files with 214 additions and 55 deletions
--- a/tests/integration/telemetry/test_completions.py
+++ b/tests/integration/telemetry/test_completions.py
@ -64,10 +64,11 @@ def test_telemetry_format_completeness(mock_otlp_collector, llama_stack_client,

    # Verify spans
    spans = mock_otlp_collector.get_spans()
-    assert len(spans) == 5
+    # Expected spans: 1 root span + 3 autotraced method calls from routing/inference
+    assert len(spans) == 4, f"Expected 4 spans, got {len(spans)}"

-    # we only need this captured one time
-    logged_model_id = None
+    # Collect all model_ids found in spans
+    logged_model_ids = []

    for span in spans:
        attrs = span.attributes
@ -87,10 +88,10 @@ def test_telemetry_format_completeness(mock_otlp_collector, llama_stack_client,

            args = json.loads(attrs["__args__"])
            if "model_id" in args:
-                logged_model_id = args["model_id"]
+                logged_model_ids.append(args["model_id"])

-    assert logged_model_id is not None
-    assert logged_model_id == text_model_id
+    # At least one span should capture the fully qualified model ID
+    assert text_model_id in logged_model_ids, f"Expected to find {text_model_id} in spans, but got {logged_model_ids}"

    # TODO: re-enable this once metrics get fixed
    """