llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

Author	SHA1	Message	Date
Rashmi Pawar	ace82836c1	feat: NVIDIA allow non-llama model registration (#1859 ) # What does this PR do? Adds custom model registration functionality to NVIDIAInferenceAdapter which let's the inference happen on: - post-training model - non-llama models in API Catalogue(behind https://integrate.api.nvidia.com and endpoints compatible with AyncOpenAI) ## Example Usage: ```python from llama_stack.apis.models import Model, ModelType from llama_stack.distribution.library_client import LlamaStackAsLibraryClient client = LlamaStackAsLibraryClient("nvidia") _ = client.initialize() client.models.register( model_id=model_name, model_type=ModelType.llm, provider_id="nvidia" ) response = client.inference.chat_completion( model_id=model_name, messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}], ) ``` ## Test Plan ```bash pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0 collected 6 items tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%] ============================================================ warnings summary ============================================================ ../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076 /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/ warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================== 6 passed, 1 warning in 1.51s ====================================================== ``` [//]: # (## Documentation) Updated Readme.md cc: @dglogo, @sumitb, @mattf	2025-04-24 17:13:33 -07:00
Jash Gulabrai	cc77f79f55	feat: Add NVIDIA Eval integration (#1890 ) # What does this PR do? This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack eval module. The integration enables users to evaluate models via the Llama Stack interface. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] 1. Added unit tests and successfully ran from root of project: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py` ``` tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED ``` 2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv` Documentation added to `llama_stack/providers/remote/eval/nvidia/README.md` --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-24 17:12:42 -07:00
Ben Browning	0b6cd45950	fix: Additional streaming error handling (#2007 ) # What does this PR do? This expands the `test_sse` test suite and fixes some edge cases with bugs in our SSE error handling to ensure streaming clients always get a proper error response. First, we handle the case where a client disconnects before we actually start streaming the response back. Previously we only handled the case where a client disconnected as we were streaming the response, but there was an edge case where a client disconnecting before we streamed any response back did not trigger our logic to cleanly handle that disconnect. Second, we handle the case where an error is thrown from the server before the actual async generator gets created from the provider. This happens in scenarios like the newly merged OpenAI API input validation, where we eagerly raise validation errors before returning the async generator object that streams the responses back. ## Test Plan Tested via: ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Both test cases failed before, and passed afterwards. The test cases were written based on me experimenting with actual clients that would do bad things like randomly disconnect or send invalid input in streaming mode and I hit these two cases, where things were misbehaving in our error handling. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-24 17:01:45 -07:00
Derek Higgins	c8797f1125	fix: Including tool call in chat (#1931 ) Include the tool call details with the chat when doing Rag with Remote vllm Fixes: #1929 With this PR the tool call is included in the chat returned to vllm, the model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as expected. Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-04-24 16:59:10 -07:00
ehhuang	7ed137e963	fix: meta ref inference (#2022 ) MAX_BATCH_SIZE=10 LLAMA_MODELS_DEBUG=1 LLAMA_STACK_PORT=5002 LLAMA_STACK_LOGGING='all=info' llama stack run meta-reference-gpu --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct --env INFERENCE_CHECKPOINT_DIR=... LLAMA_STACK_CONFIG=http://localhost:5002/ pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-24 13:03:35 -07:00
Ashwin Bharambe	a5d6ab16b2	fix: meta-reference parallel utils bug, use isinstance not equality	2025-04-24 11:27:49 -07:00
Ilya Kolchinsky	e664ba91d8	fix: prevent the knowledge search tool from confusing the model with long content (#1908 ) # What does this PR do? This PR addresses the content dominance problem that frequently arises with multiple models when executing queries with the RAG tool. When the retrieved content is too large, it disproportionately influences the generation process, causing the model to ignore the original question and to provide meaningless comments on the retrieved information instead. This situation is especially common with agentic RAG, which is the standard way of doing RAG in Llama Stack, since directly manipulating the prompt combining the query with the retrieved content is not possible. This PR appends a grounding message to the results returned by the knowledge search tool, reminding the model about the original query and the purpose of the inference call. This makes the problem significantly less likely to occur. ## Test Plan Running the following script before the fix demonstrates the content dominance problem where the model insists to comment on the retrieved content and refuses to address the question. Running the script after the fix results in getting the correct answer. ``` import os import uuid from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient # the server endpoint LLAMA_STACK_SERVER_URL = "http://localhost:8321" # inference settings MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct" SYSTEM_PROMPT = "You are a helpful assistant. " # RAG settings VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2" VECTOR_DB_EMBEDDING_DIMENSION = 384 VECTOR_DB_CHUNK_SIZE = 512 # initialize the server connection client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL)) # init the RAG retrieval parameters vector_db_id = f"test_vector_db_{uuid.uuid4()}" vector_providers = [ provider for provider in client.providers.list() if provider.api == "vector_io" ] vector_provider_to_use = vector_providers[0] # define and register the document collection to be used client.vector_dbs.register( vector_db_id=vector_db_id, embedding_model=VECTOR_DB_EMBEDDING_MODEL, embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION, provider_id=vector_provider_to_use.provider_id, ) # ingest the documents into the newly created document collection urls = [ ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"), ] documents = [ RAGDocument( document_id=f"num-{i}", content=url, mime_type=url_type, metadata={}, ) for i, (url, url_type) in enumerate(urls) ] client.tool_runtime.rag_tool.insert( documents=documents, vector_db_id=vector_db_id, chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE, ) queries = [ "How to install OpenShift?", ] # initializing the agent agent = Agent( client, model=MODEL_ID, instructions=SYSTEM_PROMPT, # we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools tools=[ dict( name="builtin::rag/knowledge_search", args={ "vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval }, ) ], ) for prompt in queries: print(f"User> {prompt}") # create a new turn with a new session ID for each prompt response = agent.create_turn( messages=[ { "role": "user", "content": prompt, } ], session_id=agent.create_session(f"rag-session_{uuid.uuid4()}") ) # print the response, including tool calls output for log in AgentEventLogger().log(response): print(log.content, end='') ```	2025-04-24 16:38:38 +02:00
Sébastien Han	14e60e3c02	feat: include run.yaml in the container image (#2005 ) As part of the build process, we now include the generated run.yaml (based of the provided build configuration file) into the container. We updated the entrypoint to use this run configuration as well. Given this simple distribution configuration: ``` # build.yaml version: '2' distribution_spec: description: Use (an external) Ollama server for running LLM inference providers: inference: - remote::ollama vector_io: - inline::faiss safety: - inline::llama-guard agents: - inline::meta-reference telemetry: - inline::meta-reference eval: - inline::meta-reference datasetio: - remote::huggingface - inline::localfs scoring: - inline::basic - inline::llm-as-judge - inline::braintrust tool_runtime: - remote::brave-search - remote::tavily-search - inline::code-interpreter - inline::rag-runtime - remote::model-context-protocol - remote::wolfram-alpha container_image: "registry.access.redhat.com/ubi9" image_type: container image_name: test ``` Build it: ``` llama stack build --config build.yaml ``` Run it: ``` podman run --rm \ -p 8321:8321 \ -e OLLAMA_URL=http://host.containers.internal:11434 \ --name llama-stack-server \ localhost/leseb-test:0.2.2 ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-24 11:29:53 +02:00
Ben Browning	fa5dfee07b	fix: Return HTTP 400 for OpenAI API validation errors (#2002 ) # What does this PR do? When clients called the Open AI API with invalid input that wasn't caught by our own Pydantic API validation but instead only caught by the backend inference provider, that backend inference provider was returning a HTTP 400 error. However, we were wrapping that into a HTTP 500 error, obfuscating the actual issue from calling clients and triggering OpenAI client retry logic. This change adjusts our existing `translate_exception` method in `server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing through the string representation of the error message to the calling user so they can see the actual input validation error and correct it. I tried changing this in a few other places, but ultimately `translate_exception` was the only real place to handle this for both streaming and non-streaming requests across all inference providers that use the OpenAI server APIs. This also tightens up our validation a bit for the OpenAI chat completions API, to catch empty `messages` parameters, invalid `tool_choice` parameters, invalid `tools` items, or passing `tool_choice` when `tools` isn't given. Lastly, this extends our OpenAI API chat completions verifications to also check for consistent input validation across providers. Providers behind Llama Stack should automatically pass all the new tests due to the input validation added here, but some of the providers fail this test when not run behind Llama Stack due to differences in how they handle input validation and errors. (Closes #1951) ## Test Plan To test this, start an OpenAI API verification stack: ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` Then, run the new verification tests with your provider(s) of choice: ``` python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider openai-llama-stack python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider together-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 17:48:32 +02:00
Michael Clifford	64f747fe09	feat: add tool name to chat output in playground (#1996 ) # What does this PR do? This PR adds the name of the tool that is used by the agent on the "tools" page of the playground. See image below for an example. ![Screenshot 2025-04-18 at 3 14 18 PM](https://github.com/user-attachments/assets/04e97783-4003-4121-9446-9e0ad7209256) ## Test Plan Run the playground and navigate to the tools page. There users can see that this additional text is present when tools are invoked and absent when they are not. ``` streamlit run llama_stack/distribution/ui/app.py ``` Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-23 15:57:54 +02:00
Ben Browning	dc46725f56	fix: properly handle streaming client disconnects (#2000 ) # What does this PR do? Previously, when a streaming client would disconnect before we were finished streaming the entire response, an error like the below would get raised from the `sse_generator` function in `llama_stack/distribution/server/server.py`: ``` AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'? ``` This was because we were calling `aclose` on a coroutine instead of the awaited value from that coroutine. This change fixes that, so that we save off the awaited value and then can call `aclose` on it if we encounter an `asyncio.CancelledError`, like we see when a client disconnects before we're finished streaming. The other changes in here are to add a simple set of tests for the happy path of our SSE streaming and this client disconnect path. That unfortunately requires adding one more dependency into our unit test section of pyproject.toml since `server.py` requires loading some of the telemetry code for me to test this functionality. ## Test Plan I wrote the tests in `tests/unit/server/test_sse.py` first, verified the client disconnected test failed before my change, and that it passed afterwards. ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 15:44:28 +02:00
Ilya Kolchinsky	deee355952	fix: Added lazy initialization of the remote vLLM client to avoid issues with expired asyncio event loop (#1969 ) # What does this PR do? Closes #1968. The asynchronous client in `VLLMInferenceAdapter` is now initialized directly before first use and not in `VLLMInferenceAdapter.initialize`. This prevents issues arising due to accessing an expired event loop from a completed `asyncio.run`. ## Test Plan Ran unit tests, including `test_remote_vllm.py`. Ran the code snippet mentioned in #1968. --------- Co-authored-by: Sébastien Han <seb@redhat.com>	2025-04-23 15:33:19 +02:00
Ilya Kolchinsky	d39462d073	feat: Hide tool output under an expander in Playground UI (#2003 ) # What does this PR do? Now, tool outputs and retrieved chunks from the vector DB (i.e., everything except for the actual model reply) are hidden under an expander form when presented to the user. # Test Plan Navigate to the RAG page in the Playground UI.	2025-04-23 15:32:12 +02:00
Ben Browning	825ce39879	fix: Together provider shutdown and default to non-streaming (#2001 ) # What does this PR do? The together inference provider was throwing a stack trace every time it shut down, as it was trying to call a non-existent `close` method on the AsyncTogether client. While fixing that, I also adjusted its shutdown logic to close the OpenAI client if we've created one of those, as that client does have a `close` method. In testing that, I also realized we were defaulting to treating all requests as streaming requests instead of defaulting to non-streaming. So, this flips that default to non-streaming to match how the other providers work. ## Test Plan I tested this by ensuring the together inference provider no longer spits out a long stack trace when shutting it down and by running the OpenAI API chat completion verification suite to ensure the change in default streaming logic didn't mess anything else up. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-22 17:47:53 +02:00
Michael Clifford	e4d001c4e4	feat: cleanup sidebar formatting on tools playground (#1998 ) # What does this PR do? This PR cleans up the sidebar on the tools page of the playground in the following ways: * created a clearer hierarchy of configuration options and tool selections. * Removed the `mcp::` or `builtin::` prefixes from the tool selection buttons. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Run the playground and see the updated sidebar does not cause any new errors. ``` streamlit run llama_stack/distribution/ui/app.py ``` [//]: # (## Documentation) Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-22 10:40:37 +02:00
Kevin Postlethwait	3110ad1e7c	fix: update ref to raw_errors due to new version of pydantic (#1995 ) `37da47ef8e (diff-4d7c51b1efe9043e44439a949dfd92e5827321b34082903477fd04876edb7552)` Pydantic was updated from v1 to v2 in this commit which caused this breaking change # What does this PR do? Part of #1857 This won't fix the Validation error with the example, but it will correctly supply user with a proper error rather than a 5xx code. Signed-off-by: Kevin <kpostlet@redhat.com>	2025-04-21 11:50:12 -07:00
Ben Browning	602e949a46	fix: OpenAI Completions API and Fireworks (#1997 ) # What does this PR do? We were passing a dict into the compat mixin for OpenAI Completions when using Llama models with Fireworks, and that was breaking some strong typing code that was added in openai_compat.py. We shouldn't have been converting these params to a dict in that case anyway, so this adjusts things to pass the params in as their actual original types when calling the OpenAIChatCompletionToLlamaStackMixin. ## Test Plan All of the fireworks provider verification tests were failing due to some OpenAI compatibility cleanup in #1962. The changes in that PR were good to make, and this just cleans up the fireworks provider code to stop passing in untyped dicts to some of those `openai_compat.py` methods since we have the original strongly-typed parameters we can pass in. ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` ``` python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=fireworks-llama-stack ``` Before this PR, all of the fireworks OpenAI verification tests were failing. Now, most of them are passing. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-21 11:49:12 -07:00
Jash Gulabrai	0d06c654d0	feat: Update NVIDIA to GA docs; remove notebook reference until ready (#1999 ) # What does this PR do? - Update NVIDIA documentation links to GA docs - Remove reference to notebooks until merged [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-18 19:13:18 -04:00
Sébastien Han	94f83382eb	feat: allow building distro with external providers (#1967 ) # What does this PR do? We can now build a distribution that includes external providers. Closes: https://github.com/meta-llama/llama-stack/issues/1948 ## Test Plan Build a distro with an external provider following the doc instructions. [//]: # (## Documentation) Added. Rendered: ![Screenshot 2025-04-18 at 11 26 39](https://github.com/user-attachments/assets/afcf3d50-8d30-48c3-8d24-06a4b3662881) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-18 17:18:28 +02:00
Yuan Tang	c4570bcb48	docs: Add tips for debugging remote vLLM provider (#1992 ) # What does this PR do? This is helpful when debugging issues with vLLM + Llama Stack after this PR https://github.com/vllm-project/vllm/pull/15593 --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-04-18 14:47:47 +02:00
Matthew Farrellee	9845631d51	feat: update nvidia inference provider to use model_store (#1988 ) # What does this PR do? NVIDIA Inference provider was using the ModelRegistryHelper to map input model ids to provider model ids. this updates it to use the model_store. ## Test Plan `LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest -v tests/integration/inference/{test_embedding.py,test_text_inference.py,test_openai_completion.py} --embedding-model nvidia/llama-3.2-nv-embedqa-1b-v2 --text-model=meta-llama/Llama-3.1-70B-Instruct`	2025-04-18 10:16:43 +02:00
Alexey Rybak	e72b1076ca	fix(build): add UBI 9 compiler tool‑chain (#1983 ) # What does this PR do? Fixes the UBI 9 container build failure ( `error: command 'gcc' failed` when installing `polyleven`, `faiss`, etc.) by installing the missing compiler tool‑chain: - `python3.11-devel gcc` make added to the UBI 9 `dnf install` line. ### Closes #1970 ## Test Plan - Build a distro with an UBI image	2025-04-18 09:49:10 +02:00
ehhuang	2976b5d992	fix: OAI compat endpoint for meta reference inference provider (#1962 ) Test plan: python tests/verifications/generate_report.py --providers fireworks,together,llama_meta_ref,openai Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-17 11:16:04 -07:00
Alexey Rybak	326cbba579	feat(agents): add agent naming functionality (#1922 ) # What does this PR do? Allow users to name an agent and use the name in telemetry instead of relying on randomly generated agent_ids. This improves the developer experience by making it easier to find specific agents in telemetry logs. Closes #1832 ## Test Plan - Added tests to verify the agent name is properly stored and retrieved - Ran `uv run -- pytest -v tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering` from the root of the project and made sure the tests pass - Ran `uv run -- pytest -v tests/integration/telemetry/test_telemetry.py::test_agent_query_spans` to verify existing code without agent names still works correctly ## Use Example ``` agent = Agent( llama_stack_client, model=text_model_id, name="CustomerSupportAgent", # New parameter instructions="You are a helpful customer support assistant" ) session_id = agent.create_session(f"test-session-{uuid4()}") ``` ## Implementation Notes - Agent names are optional string parameters with no additional validation - Names are not required to be unique - multiple agents can have the same name - The agent_id remains the unique identifier for an agent --------- Co-authored-by: raghotham <raghotham@gmail.com>	2025-04-17 07:02:47 -07:00
Ben Browning	5b8e75b392	fix: OpenAI spec cleanup for assistant requests (#1963 ) # What does this PR do? Some of our multi-turn verification tests were failing because I had accidentally marked content as a required field in the OpenAI chat completion request assistant messages, but it's actually optional. It is required for messages from other roles, but assistant is explicitly allowed to be optional. Similarly, the assistant message tool_calls field should default to None instead of an empty list. These two changes get the openai-llama-stack verification test back to 100% passing, just like it passes 100% when not behind Llama Stack. They also increase the pass rate of some of the other providers in the verification test, but don't get them to 100%. ## Test Plan I started a Llama Stack server setup to run all the verification tests (requires OPENAI_API_KEY env variable) ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` Then, I manually ran the verification tests to see which were failing, fix them, and ran them again after these changes to ensure they were all passing. ``` python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=openai-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-17 06:56:10 -07:00
Matthew Farrellee	4205376653	chore: add meta/llama-3.3-70b-instruct as supported nvidia inference provider model (#1985 ) see https://build.nvidia.com/meta/llama-3_3-70b-instruct	2025-04-17 06:50:40 -07:00
Jash Gulabrai	2ae1d7f4e6	docs: Add NVIDIA platform distro docs (#1971 ) # What does this PR do? Add NVIDIA platform docs that serve as a starting point for Llama Stack users and explains all supported microservices. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-17 05:54:30 -07:00
Jash Gulabrai	45e08ff417	fix: Handle case when Customizer Job status is unknown (#1965 ) # What does this PR do? This PR handles the case where a Customization Job's status is `unknown`. Since we don't map `unknown` to a valid `JobStatus`, the PostTraining provider throws an exception when fetching/listing a job. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_supervised_fine_tuning.py` succeeds [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-17 10:27:07 +02:00
Alexey Rybak	8f57b08f2c	fix(build): always pass path when no template/config provided (#1982 ) # What does this PR do? Fixes a crash that occurred when building a stack as a container image via the interactive wizard without supplying --template or --config. - Root cause: template_or_config was None; only the container path relies on that parameter, which later reaches subprocess.run() and triggers `TypeError: expected str, bytes or os.PathLike object, not NoneType.` - Change: in `_run_stack_build_command_from_build_config` we now fall back to the freshly‑written build‑spec file whenever both optional sources are missing. Also adds a spy‑based unit test that asserts a valid string path is passed to build_image() for container builds. ### Closes #1976 ## Test Plan - New unit test: test_build_path.py. Monkey‑patches build_image, captures the fourth argument, and verifies it is a real path - Manual smoke test: ``` llama stack build --image-type container # answer wizard prompts ``` Build proceeds into Docker without raising the previous TypeError. ## Future Work Harmonise `build_image` arguments so every image type receives the same inputs, eliminating this asymmetric special‑case.	2025-04-17 10:20:43 +02:00
Sébastien Han	6ed92e03bc	fix: print traceback on build failure (#1966 ) # What does this PR do? Build failures are hard to read, sometimes we get errors like: ``` Error building stack: 'key' ``` Which are difficult to debug without a proper trace. ## Test Plan If `llama stack build` fails you get a traceback now. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-17 09:45:21 +02:00
Michael Clifford	f12011794b	fix: Updated tools playground to allow vdb selection (#1960 ) # What does this PR do? This PR lets users select an existing vdb to use with their agent on the tools page of the playground. The drop down menu that lets users select a vdb only appears when the rag tool is selected. Without this change, there is no way for a user to specify which vdb they want their rag tool to use on the tools page. I have intentionally left the RAG options sparse here since the full RAG options are exposed on the RAG page. ## Test Plan Without these changes the RAG tool will throw the following error: `name: knowledge_search) does not have any content ` With these changes the RAG tool works as expected. Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-17 09:29:40 +02:00
Jash Gulabrai	30fc66923b	fix: Add llama-3.2-1b-instruct to NVIDIA fine-tuned model list (#1975 ) # What does this PR do? Adds `meta/llama-3.2-1b-instruct` to list of models that NeMo Customizer can fine-tune. This is the model our example notebooks typically use for fine-tuning. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-16 15:02:08 -07:00
Daniel Alvarez Sanchez	b5a9ef4c6d	fix: Do not send an empty 'tools' list to remote vllm (#1957 ) Fixes: #1955 Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and before) when there are no tools configured which causes the issue described in #1955 p. This patch avoids sending the 'tools' param to the vLLM altogether instead of an empty list. It also adds a small unit test to avoid regressions. The OpenAI [specification](https://platform.openai.com/docs/api-reference/chat/create) does not explicitly state that the list cannot be empty but I found this out through experimentation and it might depend on the actual remote vllm. In any case, as this parameter is Optional, is best to skip it altogether if there's no tools configured. Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>	2025-04-15 20:31:12 -04:00
Michael Clifford	093881071a	fix: add max_tokens slider to playground tools page (#1958 ) # What does this PR do? This PR adds a `max_tokens` slider to playground tools page. I have found that in some instances the llama stack server throws a 500 error if the max_tokens value is not explicitly set in the agent's `sampling_params`. This PR, uses the same implementation of the `max_tokens` slider from the chat page, and includes it on the tools page. ## Test Plan 1. Attempting to call a tool without these changes results in a `500: Internal server error: An unexpected error occurred`. 2. Attempting to call a tool with these changes results in the expected output. Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-15 09:11:08 -07:00
Dmitry Rogozhkin	71ed47ea76	docs: add example for intel gpu in vllm remote (#1952 ) # What does this PR do? PR adds instructions to setup vLLM remote endpoint for vllm-remote llama stack distribution. ## Test Plan * Verified with manual tests of the configured vllm-remote against vllm endpoint running on the system with Intel GPU * Also verified with ci pytests (see cmdline below). Test passes in the same capacity as it does on the A10 Nvidia setup (some tests do fail which seems to be known issues with vllm remote llama stack distribution) ``` pytest -s -v tests/integration/inference/test_text_inference.py \ --stack-config=http://localhost:5001 \ --text-model=meta-llama/Llama-3.2-3B-Instruct ``` CC: @ashwinb Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-04-15 07:56:23 -07:00
Charlie Doern	83b5523e2d	feat: add `--providers` to llama stack build (#1718 ) # What does this PR do? allow users to specify only the providers they want in the llama stack build command. If a user wants a non-interactive build, but doesn't want to use a template, `--providers` allows someone to specify something like `--providers inference=remote::ollama` for a distro with JUST ollama ## Test Plan `llama stack build --providers inference=remote::ollama --image-type venv` <img width="1084" alt="Screenshot 2025-03-20 at 9 34 14 AM" src="https://github.com/user-attachments/assets/502b5fa2-edab-4267-a595-4f987204a6a9" /> `llama stack run --image-type venv /Users/charliedoern/projects/Documents/llama-stack/venv-run.yaml` <img width="1149" alt="Screenshot 2025-03-20 at 9 35 19 AM" src="https://github.com/user-attachments/assets/433765f3-6b7f-4383-9241-dad085b69228" /> --------- Signed-off-by: Charlie Doern <cdoern@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Sébastien Han <seb@redhat.com>	2025-04-15 14:17:03 +02:00
Peter Double	86c6f1f112	fix: FastAPI built-in paths bypass custom routing (Docs) and update r… (#1841 ) ## What does this PR do? This PR improves the server's request routing logic by ensuring built-in FastAPI paths such as `/docs`, `/redoc`, `/openapi.json`, `/favicon.ico`, and `/static` bypass the custom `TracingMiddleware`. This prevents unnecessary tracing logic for documentation and static file requests, ensuring better performance and cleaner logs. Additionally, it adds proper metadata (`title`, `description`, and `version`) to the FastAPI application initialization and updates the requirements document accordingly. [//]: # (Closes #1822 ) --- ## Test Plan - Ran the server locally with `uvicorn` using the provided `run.yaml` config - Verified that: - FastAPI docs (`/docs`, `/redoc`) load correctly without triggering the custom tracing middleware - All other routes still go through the middleware and trace logic - Application metadata appears as expected in the OpenAPI docs To reproduce: 1. Start the server with `python server.py --template <template-name>` 2. Navigate to `/docs` and `/redoc` 3. Confirm that no extra trace headers are added for those routes 4. Confirm other API endpoints behave as expected and include `x-trace-id` in the response headers [//]: # (## Documentation) --- Froze the requirements file to include many of the other libraries that have been added in the past few releases to make install easier. --------- Co-authored-by: Sébastien Han <seb@redhat.com>	2025-04-14 13:28:25 -04:00
Nathan Weinberg	cf158f2cb9	feat: allow ollama to use 'latest' if available but not specified (#1903 ) # What does this PR do? ollama's CLI supports running models via commands such as 'ollama run llama3.2' this syntax does not work with the INFERENCE_MODEL llamastack var as currently specifying a tag such as 'latest' is required this commit will check to see if the 'latest' model is available and use that model if a user passes a model name without a tag but the 'latest' is available in ollama ## Test Plan Behavior pre-code change ```bash $ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run ... INFO 2025-04-08 13:42:42,842 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`... Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 502, in <module> main() File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 401, in main impls = asyncio.run(construct_stack(config)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/base_events.py", line 691, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 222, in construct_stack await register_resources(run_config, impls) File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 99, in register_resources await method(*obj.model_dump()) File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper result = await method(self, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 294, in register_model registered_model = await self.register_object(model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 228, in register_object registered_obj = await register_object_with_provider(obj, p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 77, in register_object_with_provider return await p.register_model(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper result = await method(self, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/remote/inference/ollama/ollama.py", line 315, in register_model raise ValueError( ValueError: Model 'llama3.2' is not available in Ollama. Available models: llama3.2:latest ++ error_handler 108 ++ echo 'Error occurred in script at line: 108' Error occurred in script at line: 108 ++ exit 1 ``` Behavior post-code change ```bash $ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run ... INFO 2025-04-08 13:58:17,365 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`... WARNING 2025-04-08 13:58:18,190 llama_stack.providers.remote.inference.ollama.ollama:317 inference: Imprecise provider resource id was used but 'latest' is available in Ollama - using 'llama3.2:latest' INFO 2025-04-08 13:58:18,191 llama_stack.providers.remote.inference.ollama.ollama:308 inference: Pulling embedding model `all-minilm:latest` if necessary... INFO 2025-04-08 13:58:18,799 __main__:478 server: Listening on ['::', '0.0.0.0']:8321 INFO: Started server process [28378] INFO: Waiting for application startup. INFO 2025-04-08 13:58:18,803 __main__:148 server: Starting up INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) ... ``` ## Documentation Did not document this anywhere but happy to do so if there is an appropriate place Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-14 09:03:54 -07:00
Ihar Hrachyshka	3ed4316ed5	feat: Implement async job execution for torchtune training (#1437 ) # What does this PR do? Now a separate thread is started to execute training jobs. Training requests now return job ID before the job completes. (Which fixes API timeouts for any jobs that take longer than a minute.) Note: the scheduler code is meant to be spun out in the future into a common provider service that can be reused for different APIs and providers. It is also expected to back the /jobs API proposed here: https://github.com/meta-llama/llama-stack/discussions/1238 Hence its somewhat generalized form which is expected to simplify its adoption elsewhere in the future. Note: this patch doesn't attempt to implement missing APIs (e.g. cancel or job removal). This work will belong to follow-up PRs. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] Added unit tests for the scheduler module. For the API coverage, did manual testing and was able to run a training cycle on GPU. The initial call returned job ID before the training completed, as (now) expected. Artifacts are returned as expected. ``` JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50') ``` The integration test is currently disabled for the provider. I will look into how it can be enabled in a different PR / issue context. [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-04-14 08:59:11 -07:00
Ben Browning	7641a5cd0b	fix: 100% OpenAI API verification for together and fireworks (#1946 ) # What does this PR do? TLDR: Changes needed to get 100% passing tests for OpenAI API verification tests when run against Llama Stack with the `together`, `fireworks`, and `openai` providers. And `groq` is better than before, at 88% passing. This cleans up the OpenAI API support for image message types (specifically `image_url` types) and handling of the `response_format` chat completion parameter. Both of these required a few more Pydantic model definitions in our Inference API, just to move from the not-quite-right stubs I had in place to something fleshed out to match the actual OpenAI API specs. As part of testing this, I also found and fixed a bug in the litellm implementation of openai_completion and openai_chat_completion, so the providers based on those should actually be working now. The method `prepare_openai_completion_params` in `llama_stack/providers/utils/inference/openai_compat.py` was improved to actually recursively clean up input parameters, including handling of lists, dicts, and dumping of Pydantic models to dicts. These changes were required to get to 100% passing tests on the OpenAI API verification against the `openai` provider. With the above, the together.ai provider was passing as well as it is without Llama Stack. But, since we have Llama Stack in the middle, I took the opportunity to clean up the together.ai provider so that it now also passes the OpenAI API spec tests we have at 100%. That means together.ai is now passing our verification test better when using an OpenAI client talking to Llama Stack than it is when hitting together.ai directly, without Llama Stack in the middle. And, another round of work for Fireworks to improve translation of incoming OpenAI chat completion requests to Llama Stack chat completion requests gets the fireworks provider passing at 100%. The server-side fireworks.ai tool calling support with OpenAI chat completions and Llama 4 models isn't great yet, but by pointing the OpenAI clients at Llama Stack's API we can clean things up and get everything working as expected for Llama 4 models. ## Test Plan ### OpenAI API Verification Tests I ran the OpenAI API verification tests as below and 100% of the tests passed. First, start a Llama Stack server that runs the `openai` provider with the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template setup to do this out of the box, so I added a `tests/verifications/openai-api-verification-run.yaml` to do this. First, ensure you have the necessary API key environment variables set: ``` export TOGETHER_API_KEY="..." export FIREWORKS_API_KEY="..." export OPENAI_API_KEY="..." ``` Then, run a Llama Stack server that serves up all these providers: ``` llama stack run \ --image-type venv \ tests/verifications/openai-api-verification-run.yaml ``` Finally, generate a new verification report against all these providers, both with and without the Llama Stack server in the middle. ``` python tests/verifications/generate_report.py \ --run-tests \ --provider \ together \ fireworks \ groq \ openai \ together-llama-stack \ fireworks-llama-stack \ groq-llama-stack \ openai-llama-stack ``` You'll see that most of the configurations with Llama Stack in the middle now pass at 100%, even though some of them do not pass at 100% when hitting the backend provider's API directly with an OpenAI client. ### OpenAI Completion Integration Tests with vLLM: I also ran the smaller `test_openai_completion.py` test suite (that's not yet merged with the verification tests) on multiple of the providers, since I had to adjust the method signature of openai_chat_completion a bit and thus had to touch lots of these providers to match. Here's the tests I ran there, all passing: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### OpenAI Completion Integration Tests with ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ### OpenAI Completion Integration Tests with together.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo" ``` ### OpenAI Completion Integration Tests with fireworks.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct" --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-14 08:56:29 -07:00
Sébastien Han	69554158fa	feat: add health to all providers through providers endpoint (#1418 ) The `/v1/providers` now reports the health status of each provider when implemented. ``` curl -L http://127.0.0.1:8321/v1/providers\|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4072 100 4072 0 0 246k 0 --:--:-- --:--:-- --:--:-- 248k { "data": [ { "api": "inference", "provider_id": "ollama", "provider_type": "remote::ollama", "config": { "url": "http://localhost:11434" }, "health": { "status": "OK" } }, { "api": "vector_io", "provider_id": "faiss", "provider_type": "inline::faiss", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/faiss_store.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "safety", "provider_id": "llama-guard", "provider_type": "inline::llama-guard", "config": { "excluded_categories": [] }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "agents", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "persistence_store": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/agents_store.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "telemetry", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "service_name": "llama-stack", "sinks": "console,sqlite", "sqlite_db_path": "/Users/leseb/.llama/distributions/ollama/trace_store.db" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "eval", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/meta_reference_eval.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "datasetio", "provider_id": "huggingface", "provider_type": "remote::huggingface", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/huggingface_datasetio.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "datasetio", "provider_id": "localfs", "provider_type": "inline::localfs", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/localfs_datasetio.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "basic", "provider_type": "inline::basic", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "llm-as-judge", "provider_type": "inline::llm-as-judge", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "braintrust", "provider_type": "inline::braintrust", "config": { "openai_api_key": "******" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "brave-search", "provider_type": "remote::brave-search", "config": { "api_key": "****", "max_results": 3 }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "tavily-search", "provider_type": "remote::tavily-search", "config": { "api_key": "****", "max_results": 3 }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "code-interpreter", "provider_type": "inline::code-interpreter", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "rag-runtime", "provider_type": "inline::rag-runtime", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "model-context-protocol", "provider_type": "remote::model-context-protocol", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "wolfram-alpha", "provider_type": "remote::wolfram-alpha", "config": { "api_key": "******" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } } ] } ``` Per providers too: ``` curl -L http://127.0.0.1:8321/v1/providers/ollama {"api":"inference","provider_id":"ollama","provider_type":"remote::ollama","config":{"url":"http://localhost:11434"},"health":{"status":"OK"}} ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-14 11:59:36 +02:00
Ashwin Bharambe	429f6de7d7	fix: misc fixes for tests kill horrible warnings	2025-04-12 17:12:11 -07:00
Ashwin Bharambe	8b4158169f	fix: dont check protocol compliance for experimental methods	2025-04-12 16:26:32 -07:00
ehhuang	ad86a68a32	feat: support '-' in tool names (#1807 ) # What does this PR do? titled ## Test Plan added new unit tests pytest -s -v tests/unit/models/llama/llama3/test_tool_utils.py	2025-04-12 14:23:03 -07:00
ehhuang	1e5bf6c19d	feat: update default tool use prompt (#1803 ) # What does this PR do? User reports in https://github.com/meta-llama/llama-stack/issues/1769#issuecomment-2755564632 that Agent uses tool even on a prompt 'Hello'. Updated the default prompt. Also move the instruction part out of `function_description` so that user can override it if desired. ## Test Plan <img width="1344" alt="image" src="https://github.com/user-attachments/assets/c606d65d-071f-4211-a719-b4742676acda" /> Also performance on 100 hotpotqa questions are similar to the current prompt.	2025-04-12 11:54:22 -07:00
Ashwin Bharambe	f34f22f8c7	feat: add batch inference API to llama stack inference (#1945 ) # What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.	2025-04-12 11:41:12 -07:00
Nathan Weinberg	854c2ad264	fix: misleading help text for 'llama stack build' and 'llama stack run' (#1910 ) # What does this PR do? current text for 'llama stack build' and 'llama stack run' says that if no argument is passed to '--image-name' that the active Conda environment will be used in reality, the active enviroment is used whether it is from conda, virtualenv, etc. ## Test Plan N/A ## Documentation N/A Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-12 01:19:11 -07:00
Charlie Doern	0751a960a5	feat: make training config fields optional (#1861 ) # What does this PR do? Today, supervised_fine_tune itself and the `TrainingConfig` class have a bunch of required fields that a provider implementation might not need. for example, if a provider wants to handle hyperparameters in its configuration as well as any type of dataset retrieval, optimizer or LoRA config, a user will still need to pass in a virtually empty `DataConfig`, `OptimizerConfig` and `AlgorithmConfig` in some cases. Many of these fields are intended to work specifically with llama models and knobs intended for customizing inline. Adding remote post_training providers will require loosening these arguments, or forcing users to pass in empty objects to satisfy the pydantic models. Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-04-12 01:13:45 -07:00
Ashwin Bharambe	70a7e4d51e	fix: unhide python_start, python_end	2025-04-11 20:30:44 -07:00
Aidan Reilly	51492bd9b6	docs: Update docs and fix warning in start-stack.sh (#1937 ) Small docs update and an update for `start-stack.sh` with missing color and if statment logic. # What does this PR do? 1. Makes a small change to start-stack.sh to resolve this error: ```cmd /home/aireilly/.local/lib/python3.13/site-packages/llama_stack/distribution/start_stack.sh: line 76: [: missing ]' ``` 2. Adds a missing $GREEN colour to start-stack.sh 3. Updated `docs/source/getting_started/detailed_tutorial.md` with some small changes and corrections. ## Test Plan Procedures described in `docs/source/getting_started/detailed_tutorial.md` were verified on Linux Fedora 41.	2025-04-11 16:26:17 -07:00

1 2 3 4 5 ...

1069 commits