fix(inference): enable routing of models with provider_data alone (backport #3928) (#4142)

This PR enables routing of fully qualified model IDs of the form
`provider_id/model_id` even when the models are not registered with the
Stack.

Here's the situation: assume a remote inference provider which works
only when users provide their own API keys via
`X-LlamaStack-Provider-Data` header. By definition, we cannot list
models and hence update our routing registry. But because we _require_ a
provider ID in the models now, we can identify which provider to route
to and let that provider decide.

Note that we still try to look up our registry since it may have a
pre-registered alias. Just that we don't outright fail when we are not
able to look it up.

Also, updated inference router so that the responses have the _exact_
model that the request had.

## Test Plan

Added an integration test

Closes #3929<hr>This is an automatic backport of pull request #3928 done
by [Mergify](https://mergify.com).

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
This commit is contained in:
mergify[bot] 2025-11-12 13:41:27 -08:00 committed by GitHub
parent a6c3a9cadf
commit 641d5144be
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 214 additions and 55 deletions

View file

@ -64,10 +64,11 @@ def test_telemetry_format_completeness(mock_otlp_collector, llama_stack_client,
# Verify spans
spans = mock_otlp_collector.get_spans()
assert len(spans) == 5
# Expected spans: 1 root span + 3 autotraced method calls from routing/inference
assert len(spans) == 4, f"Expected 4 spans, got {len(spans)}"
# we only need this captured one time
logged_model_id = None
# Collect all model_ids found in spans
logged_model_ids = []
for span in spans:
attrs = span.attributes
@ -87,10 +88,10 @@ def test_telemetry_format_completeness(mock_otlp_collector, llama_stack_client,
args = json.loads(attrs["__args__"])
if "model_id" in args:
logged_model_id = args["model_id"]
logged_model_ids.append(args["model_id"])
assert logged_model_id is not None
assert logged_model_id == text_model_id
# At least one span should capture the fully qualified model ID
assert text_model_id in logged_model_ids, f"Expected to find {text_model_id} in spans, but got {logged_model_ids}"
# TODO: re-enable this once metrics get fixed
"""