fix(inference): enable routing of models with provider_data alone (#3928)

This PR enables routing of fully qualified model IDs of the form
`provider_id/model_id` even when the models are not registered with the
Stack.

Here's the situation: assume a remote inference provider which works
only when users provide their own API keys via
`X-LlamaStack-Provider-Data` header. By definition, we cannot list
models and hence update our routing registry. But because we _require_ a
provider ID in the models now, we can identify which provider to route
to and let that provider decide.

Note that we still try to look up our registry since it may have a
pre-registered alias. Just that we don't outright fail when we are not
able to look it up.

Also, updated inference router so that the responses have the _exact_
model that the request had.

## Test Plan

Added an integration test

Closes #3929

---------

Co-authored-by: ehhuang <ehhuang@users.noreply.github.com>
This commit is contained in:
Ashwin Bharambe 2025-10-28 11:16:37 -07:00 committed by GitHub
parent 94b0592240
commit f88416ef87
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 216 additions and 63 deletions

View file

@ -64,10 +64,11 @@ def test_telemetry_format_completeness(mock_otlp_collector, llama_stack_client,
# Verify spans
spans = mock_otlp_collector.get_spans()
assert len(spans) == 5
# Expected spans: 1 root span + 3 autotraced method calls from routing/inference
assert len(spans) == 4, f"Expected 4 spans, got {len(spans)}"
# we only need this captured one time
logged_model_id = None
# Collect all model_ids found in spans
logged_model_ids = []
for span in spans:
attrs = span.attributes
@ -87,10 +88,10 @@ def test_telemetry_format_completeness(mock_otlp_collector, llama_stack_client,
args = json.loads(attrs["__args__"])
if "model_id" in args:
logged_model_id = args["model_id"]
logged_model_ids.append(args["model_id"])
assert logged_model_id is not None
assert logged_model_id == text_model_id
# At least one span should capture the fully qualified model ID
assert text_model_id in logged_model_ids, f"Expected to find {text_model_id} in spans, but got {logged_model_ids}"
# TODO: re-enable this once metrics get fixed
"""