llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-06-27 18:50:41 +00:00

Author	SHA1	Message	Date
Ashwin Bharambe	8bbd52bb9f	chore: remove dependency on llama_models completely (#1344 )	2025-03-01 12:48:08 -08:00
Yuan Tang	18ab1985da	fix: Make remote::vllm compatible with vLLM <= v0.6.3 (#1325 ) # What does this PR do? This is to be consistent with OpenAI API and support vLLM <= v0.6.3 References: * https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice * https://github.com/vllm-project/vllm/pull/10000 This fixes the error when running older versions of vLLM: ``` 00:50:19.834 [START] /v1/inference/chat-completion INFO 2025-02-28 00:50:20,203 httpx:1025: HTTP Request: POST https://api-xeai-granite-3-1-8b-instruct.apps.int.stc.ai.preprod.us-east-1.aws.paas.redhat.com/v1/chat/completions "HTTP/1.1 400 Bad Request" Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 235, in endpoint return await maybe_await(value) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 201, in maybe_await return await value File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 89, in async_wrapper result = await method(self, args, kwargs) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 193, in chat_completion return await provider.chat_completion(params) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 89, in async_wrapper result = await method(self, args, kwargs) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 286, in chat_completion return await self._nonstream_chat_completion(request, self.client) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 292, in _nonstream_chat_completion r = client.chat.completions.create(params) File "/usr/local/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper return func(args, *kwargs) File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 879, in create return self._post( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1290, in post return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 967, in request return self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1071, in _request raise self._make_status_error_from_response(err.response) from None openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'value_error', 'loc': ('body',), 'msg': 'Value error, When using `tool_choice`, `tools` must be set.', 'input': {'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'What model are you?'}]}], 'model': 'granite-3-1-8b-instruct', 'max_tokens': 4096, 'stream': False, 'temperature': 0.0, 'tools': None, 'tool_choice': 'auto'}, 'ctx': {'error': ValueError('When using `tool_choice`, `tools` must be set.')}}]", 'type': 'BadRequestError', 'param': None, 'code': 400} INFO: 2600:1700:9d20:ac0::49:59736 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error 00:50:20.266 [END] /v1/inference/chat-completion [StatusCode.OK] (431.99ms) ``` ## Test Plan All existing tests pass. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-28 12:48:49 -05:00
Ben Browning	c64f0d5888	fix: Get builtin tool calling working in remote-vllm (#1236 ) # What does this PR do? This PR makes a couple of changes required to get the test `tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search` passing on the remote-vllm provider. First, we adjust agent_instance to also pass in the description and parameters of builtin tools. We need these parameters so we can pass the tool's expected parameters into vLLM. The meta-reference implementations may not have needed these for builtin tools, as they are able to take advantage of the Llama-model specific support for certain builtin tools. However, with vLLM, our server-side chat templates for tool calling treat all tools the same and don't separate out Llama builtin vs custom tools. So, we need to pass the full set of parameter definitions and list of required parameters for builtin tools as well. Next, we adjust the vllm streaming chat completion code to fix up some edge cases where it was returning an extra ChatCompletionResponseEvent with an empty ToolCall with empty string call_id, tool_name, and arguments properties. This is a bug discovered after the above fix, where after a successful tool invocation we were sending extra chunks back to the client with these empty ToolCalls. ## Test Plan With these changes, the following test that previously failed now passes: ``` VLLM_URL="http://localhost:8000/v1" \ INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ LLAMA_STACK_CONFIG=remote-vllm \ python -m pytest -v \ tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search \ --inference-model "meta-llama/Llama-3.2-3B-Instruct" ``` Additionally, I ran the remote-vllm client-sdk and provider inference tests as below to ensure they all still passed with this change: ``` VLLM_URL="http://localhost:8000/v1" \ INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ LLAMA_STACK_CONFIG=remote-vllm \ python -m pytest -v \ tests/client-sdk/inference/test_text_inference.py \ --inference-model "meta-llama/Llama-3.2-3B-Instruct" ``` ``` VLLM_URL="http://localhost:8000/v1" \ python -m pytest -s -v \ llama_stack/providers/tests/inference/test_text_inference.py \ --providers "inference=vllm_remote" ``` [//]: # (## Documentation) Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-02-26 15:25:47 -05:00
Ashwin Bharambe	81ce39a607	feat(api): Add options for supporting various embedding models (#1192 ) We need to support: - asymmetric embedding models (#934) - truncation policies (#933) - varying dimensional output (#932) ## Test Plan ```bash $ cd llama_stack/providers/tests/inference $ pytest -s -v -k fireworks test_embeddings.py \ --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k together test_embeddings.py \ --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k ollama test_embeddings.py \ --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784 ```	2025-02-20 22:27:12 -08:00
Ashwin Bharambe	6f9d622340	fix(api): update embeddings signature so inputs and outputs list align (#1161 ) See Issue #922 The change is slightly backwards incompatible but no callsite (in our client codebases or stack-apps) every passes a depth-2 `List[List[InterleavedContentItem]]` (which is now disallowed.) ## Test Plan ```bash $ cd llama_stack/providers/tests/inference $ pytest -s -v -k fireworks test_embeddings.py \ --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k together test_embeddings.py \ --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784 $ pytest -s -v -k ollama test_embeddings.py \ --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784 ``` Also ran `tests/client-sdk/inference/test_embeddings.py`	2025-02-20 21:43:13 -08:00
Ben Browning	6820718b71	fix: BuiltinTool JSON serialization in remote vLLM provider (#1183 ) # What does this PR do? The `tool_name` attribute of `ToolDefinition` instances can either be a str or a BuiltinTool enum type. This fixes the remote vLLM provider to use the value of those BuiltinTool enums when serializing to JSON instead of attempting to serialize the actual enum to JSON. Reference of how this is handled in some other areas, since I followed that same pattern for the remote vLLM provider here: - [remote nvidia provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/remote/inference/nvidia/openai_utils.py#L137-L140) - [meta reference provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L635-L636) There is opportunity to potentially reconcile the remove nvidia and remote vllm bits where they are both translating Llama Stack Inference APIs to OpenAI client requests, but that's a can of worms I didn't want to open for this bug fix. This explicitly fixes this error when using the remote vLLM provider and the agent tests: ``` TypeError: Object of type BuiltinTool is not JSON serializable ``` So, this is related to #1144 and addresses the immediate issue raised there. With this fix, `tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search` now gets past the JSON serialization error when using the remote vLLM provider and actually attempts to call the web search tool. I don't have any API keys setup for the actual web search providers yet, so I cannot verify everything works after that point. ## Test Plan I ran the `test_builtin_tool_web_search` locally with the remote vLLM provider like: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search --inference-model "meta-llama/Llama-3.2-3B-Instruct" ``` Before my change, that reproduced the `TypeError: Object of type BuiltinTool is not JSON serializable` error. After my change, that error is gone and the test actually attempts the web search. That failed for me locally, due to lack of API key, but it gets past the JSON serialization error. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-02-20 21:18:37 -08:00
Ashwin Bharambe	2608b6074f	Update embedding dimension singular	2025-02-20 16:14:46 -08:00
Ashwin Bharambe	07ccf908f7	ModelAlias -> ProviderModelEntry	2025-02-20 14:02:36 -08:00
Ashwin Bharambe	eddef0b2ae	chore: slight renaming of model alias stuff (#1181 ) Quick test by running: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk ```	2025-02-20 11:48:46 -08:00
Yuan Tang	5966079770	fix: More robust handling of the arguments in tool call response in remote::vllm (#1169 ) # What does this PR do? This fixes the following issue on the server side when the tool call response contains empty args. This happens when running `examples.agents.e2e_loop_with_client_tools` but `get_ticker_data` returns `[]`: ``` Traceback (most recent call last): File "/home/yutang/repos/llama-stack/llama_stack/distribution/server/server.py", line 208, in sse_generator async for item in event_gen: File "/home/yutang/repos/llama-stack/llama_stack/providers/inline/agents/meta_reference/agents.py", line 169, in _create_agent_turn_streaming async for event in agent.create_and_execute_turn(request): File "/home/yutang/repos/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 189, in create_and_execute_turn async for chunk in self.run( File "/home/yutang/repos/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 258, in run async for res in self._run( File "/home/yutang/repos/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 499, in _run async for chunk in await self.inference_api.chat_completion( File "/home/yutang/repos/llama-stack/llama_stack/distribution/routers/routers.py", line 182, in <genexpr> return (chunk async for chunk in await provider.chat_completion(**params)) File "/home/yutang/repos/llama-stack/llama_stack/providers/remote/inference/vllm/vllm.py", line 296, in _stream_chat_completion async for chunk in res: File "/home/yutang/repos/llama-stack/llama_stack/providers/remote/inference/vllm/vllm.py", line 162, in _process_vllm_chat_completion_stream_response arguments=json.loads(tool_call_buf.arguments), File "/home/yutang/.conda/envs/distribution-myenv/lib/python3.10/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/home/yutang/.conda/envs/distribution-myenv/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/home/yutang/.conda/envs/distribution-myenv/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) ``` ## Test Plan All existing tests in `tests/client-sdk/inference/test_text_inference.py` passed. [//]: # (## Documentation) --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-19 22:27:02 -08:00
Ashwin Bharambe	cdcbeb005b	chore: remove llama_models.llama3.api imports from providers (#1107 ) There should be a choke-point for llama3.api imports -- this is the prompt adapter. Creating a ChatFormat() object on demand is inexpensive. The underlying Tokenizer is a singleton anyway.	2025-02-19 19:01:29 -08:00
Ashwin Bharambe	314ee09ae3	chore: move all Llama Stack types from llama-models to llama-stack (#1098 ) llama-models should have extremely minimal cruft. Its sole purpose should be didactic -- show the simplest implementation of the llama models and document the prompt formats, etc. This PR is the complement to https://github.com/meta-llama/llama-models/pull/279 ## Test Plan Ensure all `llama` CLI `model` sub-commands work: ```bash llama model list llama model download --model-id ... llama model prompt-format -m ... ``` Ran tests: ```bash cd tests/client-sdk LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/ LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/ LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/ ``` Create a fresh venv `uv venv && source .venv/bin/activate` and run `llama stack build --template fireworks --image-type venv` followed by `llama stack run together --image-type venv` <-- the server runs Also checked that the OpenAPI generator can run and there is no change in the generated files as a result. ```bash cd docs/openapi_generator sh run_openapi_generator.sh ```	2025-02-14 09:10:59 -08:00
Sébastien Han	e4a1579e63	build: format codebase imports using ruff linter (#1028 ) # What does this PR do? - Configured ruff linter to automatically fix import sorting issues. - Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are applied. - Enabled the 'I' selection to focus on import-related linting rules. - Ran the linter, and formatted all codebase imports accordingly. - Removed the black dep from the "dev" group since we use ruff Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-13 10:06:21 -08:00
Ben Browning	dd1a366347	fix: logprobs support in remote-vllm provider (#1074 ) # What does this PR do? The remote-vllm provider was not passing logprobs options from CompletionRequest or ChatCompletionRequests through to the OpenAI client parameters. I manually verified this, as well as observed this provider failing `TestInference::test_completion_logprobs`. This was filed as issue #1073. This fixes that by passing the `logprobs.top_k` value through to the parameters we pass into the OpenAI client. Additionally, this fixes a bug in `test_text_inference.py` where it mistakenly assumed chunk.delta were of type `ContentDelta` for completion requests. The deltas are of type `ContentDelta` for chat completion requests, but for basic completion requests the deltas are of type string. This test was likely failing for other providers that did properly support logprobs because of this latter issue in the test, which was hit while fixing the above issue with the remote-vllm provider. (Closes #1073) ## Test Plan First, you need a vllm running. I ran one locally like this: ``` vllm serve meta-llama/Llama-3.2-3B-Instruct --port 8001 --enable-auto-tool-choice --tool-call-parser llama3_json ``` Next, run test_text_inference.py against this vllm using the remote vllm provider like this: ``` VLLM_URL="http://localhost:8001/v1" python -m pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py --providers "inference=vllm_remote" ``` Before my change, the test failed with this error: ``` llama_stack/providers/tests/inference/test_text_inference.py:155: in test_completion_logprobs assert 1 <= len(response.logprobs) <= 5 E TypeError: object of type 'NoneType' has no len() ``` After my change, the test passes. [//]: # (## Documentation) Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-02-13 11:00:00 -05:00
Yuan Tang	5e97dd9919	feat: Support tool calling for streaming chat completion in remote vLLM provider (#1063 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Closes https://github.com/meta-llama/llama-stack/issues/1046. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] ``` LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/inference/test_text_inference.py ================================================================= test session starts ================================================================= platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/yutang/.conda/envs/distribution-myenv/bin/python3.10 cachedir: .pytest_cache rootdir: /home/yutang/repos/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0 collected 14 items tests/client-sdk/inference/test_text_inference.py::test_text_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 7%] tests/client-sdk/inference/test_text_inference.py::test_text_completion_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 14%] tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote:...) [ 21%] tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote::vll...) [ 28%] tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 35%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet do humans live on?-Earth] PASSED [ 42%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 50%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What's the name of the Sun in latin?-Sol] PASSED [ 57%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What is the name of the US captial?-Washington] PASSED [ 64%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 71%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 78%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 85%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[meta-llama/Llama-3.1-8B-Instruct-True] PASSED [ 92%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[meta-llama/Llama-3.1-8B-Instruct-False] PASSED [100%] =============================================== 12 passed, 2 xfailed, 1 warning in 366.56s (0:06:06) ================================================ ``` --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-12 06:17:21 -08:00
Xi Yan	66d7e15c93	perf: ensure ToolCall in ChatCompletionResponse is subset of ChatCompletionRequest.tools (#1041 ) # What does this PR do? Problem - Using script: https://gist.github.com/thoraxe/6163b2145ce7b1c24c6026b64cf90085 - This hits an issue on server with `code_interpreter` not found, as we do not pass "builtin::code_interpreter" in AgentConfig's `toolgroups`. This is a general issue where model always tries to output `code_interpreter` in `ToolCall` even when we do not have `code_interpreter` available for execution. Reproduce Deeper Problem in chat-completion - Use script: https://gist.github.com/yanxi0830/163a9ad7b5db10556043fbfc7ecd7603 1. We currently always populate `code_interpreter` in `ToolCall` in ChatCompletionResponse if the model's response begins with `<\|python_tag\|>`. See `c5f5958498/models/llama3/api/chat_format.py (L200-L213)` <img width="913" alt="image" src="https://github.com/user-attachments/assets/328d313d-0a0b-495c-8715-61cca9ccc4a6" /> 2. This happens even if we do not pass the `code_interpreter` as a `tools` in ChatCompletionRequest. This PR Explicitly make sure that the tools returned in `ChatCompletionResponse.tool_calls` is always a tool requested by `ChatCompletionRequest.tools`. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Before <img width="913" alt="image" src="https://github.com/user-attachments/assets/328d313d-0a0b-495c-8715-61cca9ccc4a6" /> <img width="997" alt="image" src="https://github.com/user-attachments/assets/d3e82b62-b142-4939-954c-62843bec7110" /> After <img width="856" alt="image" src="https://github.com/user-attachments/assets/2c70ce55-c8d0-45ea-b10f-f70adc50d3d9" /> <img width="1000" alt="image" src="https://github.com/user-attachments/assets/b5e81826-c35b-4052-bf81-7afff93ce2ef" /> Unit Test ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/ ``` <img width="1002" alt="image" src="https://github.com/user-attachments/assets/04808517-eded-4122-97f5-7e5142de9779" /> Streaming - Chat Completion <img width="902" alt="image" src="https://github.com/user-attachments/assets/f477bc86-bd38-4729-b49e-a0a6ed3f835a" /> - Agent <img width="916" alt="image" src="https://github.com/user-attachments/assets/f4cc3417-23cd-46b1-953d-3a2271e79bbb" /> [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant)	2025-02-11 18:31:35 -08:00
Yuan Tang	dd37e58868	feat: Support tool calling for non-streaming chat completion in remote vLLM provider (#1034 ) # What does this PR do? This PR adds support for tool calling for non-streaming chat completion. Prior to this, tool calls were not passed to chat completion requests and the tools object needs to be restructured properly to be compatible with vLLM provider. ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/inference/test_text_inference.py ================================================================= test session starts ================================================================= platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/yutang/.conda/envs/distribution-myenv/bin/python3.10 cachedir: .pytest_cache rootdir: /home/yutang/repos/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0 collected 12 items tests/client-sdk/inference/test_text_inference.py::test_text_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 8%] tests/client-sdk/inference/test_text_inference.py::test_text_completion_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 16%] tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote:...) [ 25%] tests/client-sdk/inference/test_text_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote::vll...) [ 33%] tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 41%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet do humans live on?-Earth] PASSED [ 50%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 58%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What's the name of the Sun in latin?-Sol] PASSED [ 66%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What is the name of the US captial?-Washington] PASSED [ 75%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED [ 83%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.1-8B-Instruct] FAILED [ 91%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED [100%] ``` --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-11 21:08:29 -05:00
Yuan Tang	0a0ee5ca96	Fix incorrect handling of chat completion endpoint in remote::vLLM (#951 ) # What does this PR do? Fixes https://github.com/meta-llama/llama-stack/issues/949. ## Test Plan Verified that the correct chat completion endpoint is called after the change. Llama Stack server: ``` INFO: ::1:32838 - "POST /v1/inference/chat-completion HTTP/1.1" 200 OK 18:36:28.187 [END] /v1/inference/chat-completion [StatusCode.OK] (1276.12ms) ``` vLLM server: ``` INFO: ::1:36866 - "POST /v1/chat/completions HTTP/1.1" 200 OK ``` ```bash LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -s -v tests/client-sdk/inference/test_inference.py -k "test_image_chat_completion_base64 or test_image_chat_completion_non_streaming or test_image_chat_completion_streaming" ================================================================== test session starts =================================================================== platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/yutang/.conda/envs/distribution-myenv/bin/python3.10 cachedir: .pytest_cache rootdir: /home/yutang/repos/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0 collected 16 items / 12 deselected / 4 selected tests/client-sdk/inference/test_inference.py::test_image_chat_completion_non_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] PASSED tests/client-sdk/inference/test_inference.py::test_image_chat_completion_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] PASSED tests/client-sdk/inference/test_inference.py::test_image_chat_completion_base64[meta-llama/Llama-3.2-11B-Vision-Instruct-url] PASSED tests/client-sdk/inference/test_inference.py::test_image_chat_completion_base64[meta-llama/Llama-3.2-11B-Vision-Instruct-data] PASSED ``` Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-06 10:45:19 -08:00
ehhuang	c9ab72fa82	Support sys_prompt behavior in inference (#937 ) # What does this PR do? The current default system prompt for llama3.2 tends to overindex on tool calling and doesn't work well when the prompt does not require tool calling. This PR adds an option to override the default system prompt, and organizes tool-related configs into a new config object. - [ ] Addresses issue (#issue) ## Test Plan python -m unittest llama_stack.providers.tests.inference.test_prompt_adapter ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/937). * #938 * __->__ #937	2025-02-03 23:35:16 -08:00
Yuan Tang	34ab7a3b6c	Fix precommit check after moving to ruff (#927 ) Lint check in main branch is failing. This fixes the lint check after we moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We need to move to a `ruff.toml` file as well as fixing and ignoring some additional checks. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-02 06:46:45 -08:00
Aidan Do	910717c1fd	Add vLLM raw completions API (#823 ) # What does this PR do? Adds raw completions API to vLLM ## Test Plan <details> <summary>Setup</summary> ```bash # Run vllm server conda create -n vllm python=3.12 -y conda activate vllm pip install vllm # Run llamastack conda create --name llamastack-vllm python=3.10 conda activate llamastack-vllm export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct && \ pip install -e . && \ pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 && \ llama stack build --template remote-vllm --image-type conda && \ llama stack run ./distributions/remote-vllm/run.yaml \ --port 5000 \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env VLLM_URL=http://localhost:8000/v1 \| tee -a llama-stack.log ``` </details> <details> <summary>Integration</summary> ```bash # Run conda activate llamastack-vllm export VLLM_URL=http://localhost:8000/v1 pip install pytest pytest_html pytest_asyncio aiosqlite pytest llama_stack/providers/tests/inference/test_text_inference.py -v -k vllm # Results llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm_remote] PASSED [ 11%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm_remote] PASSED [ 22%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm_remote] SKIPPED [ 33%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm_remote] SKIPPED [ 44%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm_remote] PASSED [ 55%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm_remote] PASSED [ 66%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm_remote] PASSED [ 77%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm_remote] PASSED [ 88%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm_remote] PASSED [100%] ====================================== 7 passed, 2 skipped, 99 deselected, 1 warning in 9.80s ====================================== ``` </details> <details> <summary>Manual</summary> ```bash # Install pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 ``` Apply this diff ```diff diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py index 8dbb193..95173e2 100644 --- a/llama_stack/distribution/server/server.py +++ b/llama_stack/distribution/server/server.py @@ -250,7 +250,7 @@ class ClientVersionMiddleware: server_version_parts = tuple( map(int, self.server_version.split(".")[:2]) ) - if client_version_parts != server_version_parts: + if False and client_version_parts != server_version_parts: async def send_version_error(send): await send( diff --git a/llama_stack/templates/remote-vllm/run.yaml b/llama_stack/templates/remote-vllm/run.yaml index 4eac4da..32eb50e 100644 --- a/llama_stack/templates/remote-vllm/run.yaml +++ b/llama_stack/templates/remote-vllm/run.yaml @@ -94,7 +94,8 @@ metadata_store: type: sqlite db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/registry.db models: -- metadata: {} +- metadata: + llama_model: meta-llama/Llama-3.2-3B-Instruct model_id: ${env.INFERENCE_MODEL} provider_id: vllm-inference model_type: llm ``` Test 1: ```python from llama_stack_client import LlamaStackClient client = LlamaStackClient( base_url="http://localhost:5000", ) response = client.inference.completion( model_id="meta-llama/Llama-3.2-3B-Instruct", content="Hello, world client!", ) print(response) ``` Test 2 ``` from llama_stack_client import LlamaStackClient client = LlamaStackClient( base_url="http://localhost:5000", ) response = client.inference.completion( model_id="meta-llama/Llama-3.2-3B-Instruct", content="Hello, world client!", stream=True, ) for chunk in response: print(chunk.delta, end="", flush=True) ``` ``` I'm excited to introduce you to our latest project, a comprehensive guide to the best coffee shops in [City]. As a coffee connoisseur, you're in luck because we've scoured the city to bring you the top picks for the perfect cup of joe. In this guide, we'll take you on a journey through the city's most iconic coffee shops, highlighting their unique features, must-try drinks, and insider tips from the baristas themselves. From cozy cafes to trendy cafes, we've got you covered. Top 5 Coffee Shops in [City] 1. The Daily Grind: This beloved institution has been serving up expertly crafted pour-overs and lattes for over 10 years. Their expert baristas are always happy to guide you through their menu, which features a rotating selection of single-origin beans from around the world... ``` </details> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-22 22:58:27 -08:00
Xi Yan	3a9468ce9b	fix again vllm for non base64 (#818 ) # What does this PR do? - previous fix introduced regression for non base64 image - add back download, and base64 check ## Test Plan <img width="835" alt="image" src="https://github.com/user-attachments/assets/b70bf725-035a-4b42-b492-53daaf71458a" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-17 18:33:40 -08:00
Xi Yan	3e7496e835	fix vllm base64 image inference (#815 ) # What does this PR do? - fix base64 based image url for vllm - add a test case for base64 based image_url - fixes issue: https://github.com/meta-llama/llama-stack/issues/571 ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v ./tests/client-sdk/inference/test_inference.py::test_image_chat_completion_base64_url ``` <img width="991" alt="image" src="https://github.com/user-attachments/assets/d56381ba-6777-4d23-9da9-81f73ce93566" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-17 17:07:28 -08:00
Dinesh Yeduguru	8af6951106	remove conflicting default for tool prompt format in chat completion (#742 ) # What does this PR do? We are setting a default value of json for tool prompt format, which conflicts with llama 3.2/3.3 models since they use python list. This PR changes the defaults to None and in the code, we infer default based on the model. Addresses: #695 Tests: ❯ LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v tests/client-sdk/inference/test_inference.py -k "test_text_chat_completion" pytest llama_stack/providers/tests/inference/test_prompt_adapter.py	2025-01-10 10:41:53 -08:00
Yuan Tang	04d5b9814f	Fix assert message and call to completion_request_to_prompt in remote:vllm (#709 ) The current message is incorrect and model arg is not needed in `completion_request_to_prompt`. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-03 13:44:49 -08:00
Xi Yan	3c72c034e6	[remove import ] clean up import 's (#689 ) # What does this PR do? - as title, cleaning up `import `'s - upgrade tests to make them more robust to bad model outputs - remove import 's in llama_stack/apis/* (skip __init__ modules) <img width="465" alt="image" src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2" /> - run `sh run_openapi_generator.sh`, no types gets affected ## Test Plan ### Providers Tests agents ``` pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8 ``` inference ```bash # meta-reference torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py # together pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py ``` safety ``` pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B ``` memory ``` pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384 ``` scoring ``` pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py ``` datasetio ``` pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py ``` eval ``` pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py ``` ### Client-SDK Tests ``` LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk ``` ### llama-stack-apps ``` PORT=5000 LOCALHOST=localhost python -m examples.agents.hello $LOCALHOST $PORT python -m examples.agents.inflation $LOCALHOST $PORT python -m examples.agents.podcast_transcript $LOCALHOST $PORT python -m examples.agents.rag_as_attachments $LOCALHOST $PORT python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT # Vision model python -m examples.interior_design_assistant.app python -m examples.agent_store.app $LOCALHOST $PORT ``` ### CLI ``` which llama llama model prompt-format -m Llama3.2-11B-Vision-Instruct llama model list llama stack list-apis llama stack list-providers inference llama stack build --template ollama --image-type conda ``` ### Distributions Tests ollama ``` llama stack build --template ollama --image-type conda ollama run llama3.2:1b-instruct-fp16 llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct ``` fireworks ``` llama stack build --template fireworks --image-type conda llama stack run ./llama_stack/templates/fireworks/run.yaml ``` together ``` llama stack build --template together --image-type conda llama stack run ./llama_stack/templates/together/run.yaml ``` tgi ``` llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-27 15:45:44 -08:00
Ashwin Bharambe	b7a7caa9a8	Fix conversion to RawMessage everywhere	2024-12-17 14:00:43 -08:00
Ashwin Bharambe	8de8eb03c8	Update the "InterleavedTextMedia" type (#635 ) ## What does this PR do? This is a long-pending change and particularly important to get done now. Specifically: - we cannot "localize" (aka download) any URLs from media attachments anywhere near our modeling code. it must be done within llama-stack. - `PIL.Image` is infesting all our APIs via `ImageMedia -> InterleavedTextMedia` and that cannot be right at all. Anything in the API surface must be "naturally serializable". We need a standard `{ type: "image", image_url: "<...>" }` which is more extensible - `UserMessage`, `SystemMessage`, etc. are moved completely to llama-stack from the llama-models repository. See https://github.com/meta-llama/llama-models/pull/244 for the corresponding PR in llama-models. ## Test Plan ```bash cd llama_stack/providers/tests pytest -s -v -k "fireworks or ollama or together" inference/test_vision_inference.py pytest -s -v -k "(fireworks or ollama or together) and llama_3b" inference/test_text_inference.py pytest -s -v -k chroma memory/test_memory.py \ --env EMBEDDING_DIMENSION=384 --env CHROMA_DB_PATH=/tmp/foobar pytest -s -v -k fireworks agents/test_agents.py \ --safety-shield=meta-llama/Llama-Guard-3-8B \ --inference-model=meta-llama/Llama-3.1-8B-Instruct ``` Updated the client sdk (see PR ...), installed the SDK in the same environment and then ran the SDK tests: ```bash cd tests/client-sdk LLAMA_STACK_CONFIG=together pytest -s -v agents/test_agents.py LLAMA_STACK_CONFIG=ollama pytest -s -v memory/test_memory.py # this one needed a bit of hacking in the run.yaml to ensure I could register the vision model correctly INFERENCE_MODEL=llama3.2-vision:latest LLAMA_STACK_CONFIG=ollama pytest -s -v inference/test_inference.py ```	2024-12-17 11:18:31 -08:00
Dinesh Yeduguru	516e1a3e59	add embedding model by default to distribution templates (#617 ) # What does this PR do? Adds the sentence transformer provider and the `all-MiniLM-L6-v2` embedding model to the default models to register in the run.yaml for all providers. ## Test Plan llama stack build --template together --image-type conda llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml	2024-12-13 12:48:00 -08:00
Dinesh Yeduguru	96e158eaac	Make embedding generation go through inference (#606 ) This PR does the following: 1) adds the ability to generate embeddings in all supported inference providers. 2) Moves all the memory providers to use the inference API and improved the memory tests to setup the inference stack correctly and use the embedding models This is a merge from #589 and #598	2024-12-12 11:47:50 -08:00
Aidan Do	095125e463	[#391 ] Add support for json structured output for vLLM (#528 ) # What does this PR do? Addresses issue (#391) - Adds json structured output for vLLM - Enables structured output tests for vLLM > Give me a recipe for Spaghetti Bolognaise: ```json { "recipe_name": "Spaghetti Bolognaise", "preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.", "ingredients": [ "500g minced beef", "1 medium onion, finely chopped", "2 cloves garlic, minced", "1 carrot, finely chopped", " celery, finely chopped", "1 (28 oz) can whole peeled tomatoes", "1 tbsp tomato paste", "1 tsp dried basil", "1 tsp dried oregano", "1 tsp salt", "1/2 tsp black pepper", "1/2 tsp sugar", "1 lb spaghetti", "Grated Parmesan cheese, for serving", "Extra virgin olive oil, for serving" ], "steps": [ "Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.", "Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.", "Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.", "Add the tomato paste and cook for 1-2 minutes, stirring constantly.", "Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.", "Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.", "While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.", "Add the reserved pasta water to the sauce and stir to combine.", "Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.", "Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.", "Enjoy!" ] } ``` Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B parameter model 👍 ## Test Plan `pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -k llama_3b-vllm_remote` With the following setup: ```bash # Environment export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct export INFERENCE_PORT=8000 export VLLM_URL=http://localhost:8000/v1 # vLLM server sudo docker run --gpus all \ -v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \ --env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \ -p 8000:$INFERENCE_PORT \ --ipc=host \ --net=host \ vllm/vllm-openai:v0.6.3.post1 \ --model $INFERENCE_MODEL # llama-stack server llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \ --port 5001 \ --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct ``` Results: ``` llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED ================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================ ``` ## Sources - https://github.com/vllm-project/vllm/discussions/8300 - By default, vLLM uses https://github.com/dottxt-ai/outlines for structured outputs [[1](`32e7db2536/vllm/engine/arg_utils.py (L279-L280)`)] ## Before submitting [N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case) - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? [N/A?] Updated relevant documentation. Couldn't find any relevant documentation. Lmk if I've missed anything. - [x] Wrote necessary unit or integration tests.	2024-12-08 15:02:51 -08:00
Dinesh Yeduguru	6395dadc2b	use logging instead of prints (#499 ) # What does this PR do? This PR moves all print statements to use logging. Things changed: - Had to add `await start_trace("sse_generator")` to server.py to actually get tracing working. else was not seeing any logs - If no telemetry provider is provided in the run.yaml, we will write to stdout - by default, the logs are going to be in JSON, but we expose an option to configure to output in a human readable way.	2024-11-21 11:32:53 -08:00
Ashwin Bharambe	7bfcfe80b5	Add logs (prints :/) to dump out what URL vllm / tgi is connecting to	2024-11-19 15:50:26 -08:00
Dinesh Yeduguru	0850ad656a	unregister for memory banks and remove update API (#458 ) The semantics of an Update on resources is very tricky to reason about especially for memory banks and models. The best way to go forward here is for the user to unregister and register a new resource. We don't have a compelling reason to support update APIs. Tests: pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "chroma" --env CHROMA_HOST=localhost --env CHROMA_PORT=8000 pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "pgvector" --env PGVECTOR_DB=postgres --env PGVECTOR_USER=postgres --env PGVECTOR_PASSWORD=mysecretpassword --env PGVECTOR_HOST=0.0.0.0 $CONDA_PREFIX/bin/pytest -v -s -m "ollama" llama_stack/providers/tests/inference/test_model_registration.py --------- Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>	2024-11-14 17:12:11 -08:00
Dinesh Yeduguru	787e2034b7	model registration in ollama and vllm check against the available models in the provider (#446 ) tests: pytest -v -s -m "ollama" llama_stack/providers/tests/inference/test_text_inference.py pytest -v -s -m vllm_remote llama_stack/providers/tests/inference/test_text_inference.py --env VLLM_URL="http://localhost:9798/v1" ---------	2024-11-13 13:04:06 -08:00
Dinesh Yeduguru	fdff24e77a	Inference to use provider resource id to register and validate (#428 ) This PR changes the way model id gets translated to the final model name that gets passed through the provider. Major changes include: 1) Providers are responsible for registering an object and as part of the registration returning the object with the correct provider specific name of the model provider_resource_id 2) To help with the common look ups different names a new ModelLookup class is created. Tested all inference providers including together, fireworks, vllm, ollama, meta reference and bedrock	2024-11-12 20:02:00 -08:00
Ashwin Bharambe	afe4a53ae8	Check vLLM registration	2024-11-12 13:14:36 -08:00
Dinesh Yeduguru	ec644d3418	migrate model to Resource and new registration signature (#410 ) * resource oriented object design for models * add back llama_model field * working tests * register singature fix * address feedback --------- Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>	2024-11-08 16:12:57 -08:00
Ashwin Bharambe	3b54ce3499	remote::vllm now works with vision models	2024-11-06 16:07:17 -08:00
Ashwin Bharambe	994732e2e0	`impls` -> `inline`, `adapters` -> `remote` (#381 )	2024-11-06 14:54:05 -08:00

40 commits