llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
snova-edwardm	aa65610e75	Sambanova - LlamaGuard (#886 ) # What does this PR do? - Fix loading SambaNovaImpl issue - Add LlamaGuard model support for inference ## Test Plan Run the following unit test scripts and results ### Embedding ``` pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_embeddings.py --inference-model meta-llama/Llama-3.2-11B-Vision-Instruct --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY} ``` ``` llama_stack/providers/tests/inference/test_embeddings.py::TestEmbeddings::test_embeddings[-sambanova] SKIPPED (This test is only applicable for embedding models) llama_stack/providers/tests/inference/test_embeddings.py::TestEmbeddings::test_batch_embeddings[-sambanova] SKIPPED (This test is only applicable for embedding models) =================================================================================================================== 2 skipped, 1 warning in 0.32s =================================================================================================================== ``` ### Vision ``` pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_vision_inference.py --inference-model meta-llama/Llama-3.2-11B-Vision-Instruct --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY} ``` ``` llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming[-sambanova-image0-expected_strings0] PASSED llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming[-sambanova-image1-expected_strings1] PASSED llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_streaming[-sambanova] PASSED =================================================================================================================== 3 passed, 1 warning in 2.68s ==================================================================================================================== ``` ### Text ``` pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY} ``` ``` llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-sambanova] PASSED =================================================================================================================== 1 passed, 1 warning in 0.46s ==================================================================================================================== ``` ``` pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY} ``` ``` llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-sambanova] PASSED =================================================================================================================== 1 passed, 1 warning in 0.48s ==================================================================================================================== ``` ## Before submitting - [] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [Y] Ran pre-commit to handle lint / formatting issues. - [Y] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [Y] Updated relevant documentation. - [Y] Wrote necessary unit or integration tests.	2025-01-27 15:46:30 -08:00
Hardik Shah	2cebb24d3a	Update doc templates for running safety on self-hosted templates (#874 )	2025-01-24 11:28:20 -08:00
snova-edwardm	22dc684da6	Sambanova inference provider (#555 ) # What does this PR do? This PR adds SambaNova as one of the Provider - Add SambaNova as a provider ## Test Plan Test the functional command ``` pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_embeddings.py llama_stack/providers/tests/inference/test_prompt_adapter.py llama_stack/providers/tests/inference/test_text_inference.py llama_stack/providers/tests/inference/test_vision_inference.py --env SAMBANOVA_API_KEY=<sambanova-api-key> ``` Test the distribution template: ``` # Docker LLAMA_STACK_PORT=5001 docker run -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ llamastack/distribution-sambanova \ --port $LLAMA_STACK_PORT \ --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY # Conda llama stack build --template sambanova --image-type conda llama stack run ./run.yaml \ --port $LLAMA_STACK_PORT \ --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY ``` ## Source [SambaNova API Documentation](https://cloud.sambanova.ai/apis) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [Y] Ran pre-commit to handle lint / formatting issues. - [Y] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [Y] Updated relevant documentation. - [Y ] Wrote necessary unit or integration tests. --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-01-23 12:20:28 -08:00
Marut Pandya	e2b5456e48	Add Runpod Provider + Distribution (#362 ) Add Runpod as a inference provider for openAI compatible managed endpoints. Testing - Configured llama stack from scratch, set `remote::runpod` as a inference provider. - Added Runpod Endpoint URL and API key. - Started llama-stack server - llama stack run my-local-stack --port 3000 ``` curl http://localhost:5000/inference/chat_completion \ -H "Content-Type: application/json" \ -d '{ "model": "Llama3.1-8B-Instruct", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write me a 2 sentence poem about the moon"} ], "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} }' ``` --------- Signed-off-by: pandyamarut <pandyamarut@gmail.com>	2025-01-23 12:19:02 -08:00
Xi Yan	74f6af8bbe	[CICD] add simple test step for docker build workflow, fix prefix bug (#821 ) # What does this PR do? Main Thing - Add a simple test step before publishing docker image in workflow Side Fix - Docker push action fails recently due to extra prefix introduced. E.g. see: https://github.com/meta-llama/llama-stack/pull/802#issuecomment-2599507062 cc @terrytangyuan ## Test Plan 1. Release a TestPyPi version on this code: 0.0.63.dev51206766 `3581203331` ``` # 1. build docker image TEST_PYPI_VERSION=0.0.63.dev51206766 llama stack build --template fireworks # 2. test the docker image cd distributions/fireworks && docker compose up ``` 4. Test the full build + test docker flow using TestPyPi from (1): `1284218494` <img width="1049" alt="image" src="https://github.com/user-attachments/assets/c025893d-5ce2-48ff-aa90-de00e105ee09" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-18 15:16:05 -08:00
Dinesh Yeduguru	3d4c53dfec	add mcp runtime as default to all providers (#816 ) # What does this PR do? This is needed to have the notebook work with MCP	2025-01-17 16:40:58 -08:00
Yuan Tang	6da3053c0e	More generic image type for OCI-compliant container technologies (#802 ) It's a more generic term and applicable to alternatives of Docker, such as Podman or other OCI-compliant technologies. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-17 16:37:42 -08:00
Xi Yan	9d005154d7	fix vllm template (#813 ) # What does this PR do? - Fix vLLM template to resolve https://github.com/meta-llama/llama-stack/issues/805 - Fix agents test with shields ## Test Plan ``` vllm serve meta-llama/Llama-3.1-8B-Instruct VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack run ./llama_stack/templates/remote-vllm/run.yaml ``` ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v ./tests/client-sdk/ ``` <img width="1245" alt="image" src="https://github.com/user-attachments/assets/9af27684-5a9c-4187-b338-cbfc5211bd99" /> - custom tool flaky due to model outputs - /completions API not implemented Vision Model - 11B-Vision-Instruct <img width="1240" alt="image" src="https://github.com/user-attachments/assets/1d3b3b17-fa09-43a7-b56c-3f77263825c5" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-17 15:34:29 -08:00
Xi Yan	d1f3b032c9	cerebras template update for memory (#792 ) # What does this PR do? - we no longer have meta-reference as memory provider, update cerebras template ## Test Plan ``` python llama_stack/scripts/distro_codegen.py ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-16 16:07:53 -08:00
Ashwin Bharambe	03ac84a829	Update default port from 5000 -> 8321	2025-01-16 15:26:48 -08:00
Xi Yan	a6b9f2cec7	fix cerebras template (#790 ) # What does this PR do? - fix cerebras template ## Test Plan ``` llama stack build --template cerebras --image-type conda llama stack run cerebras LLAMA_STACK_BASE_URL="http://localhost:5000" pytest -v tests/client-sdk/ --html=report.html --self-contained-html ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-16 13:53:06 -08:00
Xi Yan	b76bef169c	fix nvidia inference provider (#781 ) # What does this PR do? - fixes to nvidia inference provider to account for strategy update - update nvidia templates ## Test Plan ``` llama stack run ./llama_stack/templates/nvidia/run.yaml --port 5000 LLAMA_STACK_BASE_URL="http://localhost:5000" pytest -v tests/client-sdk/inference/test_inference.py --html=report.html --self-contained-html ``` <img width="1288" alt="image" src="https://github.com/user-attachments/assets/d20f9aea-525e-47de-a5be-586e022e0d55" /> NOTE - vision inference broken - tool calling broken - /completion broken cc @mattf @cdgamarose-nv for improving NVIDIA inference adapter ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-15 18:49:36 -08:00
Xi Yan	27e07b44b5	remove inline-nvidia templates	2025-01-15 14:15:56 -08:00
cdgamarose-nv	b3202bcf77	add nvidia distribution (#565 ) # What does this PR do? adds nvidia template for creating a distribution using inference adapter for NVIDIA NIMs. ## Test Plan Please describe: Build llama stack distribution for nvidia using the template, docker and conda. ```bash (.venv) local-cdgamarose@a4u8g-0006:~/llama-stack$ llama-stack-client configure --endpoint http://localhost:5000 Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5000 (.venv) local-cdgamarose@a4u8g-0006:~/llama-stack$ llama-stack-client models list ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ identifier ┃ provider_id ┃ provider_resource_id ┃ metadata ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩ │ Llama3.1-8B-Instruct │ nvidia │ meta/llama-3.1-8b-instruct │ {} │ │ meta-llama/Llama-3.2-3B-Instruct │ nvidia │ meta/llama-3.2-3b-instruct │ {} │ └──────────────────────────────────┴─────────────┴────────────────────────────┴──────────┘ (.venv) local-cdgamarose@a4u8g-0006:~/llama-stack$ llama-stack-client inference chat-completion --message "hello, write me a 2 sentence poem" ChatCompletionResponse( completion_message=CompletionMessage( content='Here is a 2 sentence poem:\n\nThe sun sets slow and paints the sky, \nA gentle hue of pink that makes me sigh.', role='assistant', stop_reason='end_of_turn', tool_calls=[] ), logprobs=None ) ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [x] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>	2025-01-15 14:04:43 -08:00
Yuan Tang	300e6e2702	Fix issue when generating distros (#755 ) Addressed comment https://github.com/meta-llama/llama-stack/pull/723#issuecomment-2581902075. cc @yanxi0830 I am not 100% sure if the diff is correct though but this is the result of running `python llama_stack/scripts/distro_codegen.py`. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-15 05:34:08 -08:00
raghotham	ff182ff6de	rename LLAMASTACK_PORT to LLAMA_STACK_PORT for consistency with other env vars (#744 ) # What does this PR do? Rename environment var for consistency ## Test Plan No regressions ## Sources ## Before submitting - [X] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [X] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-10 11:09:49 -08:00
Dinesh Yeduguru	a5c57cd381	agents to use tools api (#673 ) # What does this PR do? PR #639 introduced the notion of Tools API and ability to invoke tools through API just as any resource. This PR changes the Agents to start using the Tools API to invoke tools. Major changes include: 1) Ability to specify tool groups with AgentConfig 2) Agent gets the corresponding tool definitions for the specified tools and pass along to the model 3) Attachements are now named as Documents and their behavior is mostly unchanged from user perspective 4) You can specify args that can be injected to a tool call through Agent config. This is especially useful in case of memory tool, where you want the tool to operate on a specific memory bank. 5) You can also register tool groups with args, which lets the agent inject these as well into the tool call. 6) All tests have been migrated to use new tools API and fixtures including client SDK tests 7) Telemetry just works with tools API because of our trace protocol decorator ## Test Plan ``` pytest -s -v -k fireworks llama_stack/providers/tests/agents/test_agents.py \ --safety-shield=meta-llama/Llama-Guard-3-8B \ --inference-model=meta-llama/Llama-3.1-8B-Instruct pytest -s -v -k together llama_stack/providers/tests/tools/test_tools.py \ --safety-shield=meta-llama/Llama-Guard-3-8B \ --inference-model=meta-llama/Llama-3.1-8B-Instruct LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py ``` run.yaml: https://gist.github.com/dineshyv/0365845ad325e1c2cab755788ccc5994 Notebook: https://colab.research.google.com/drive/1ck7hXQxRl6UvT-ijNRZ-gMZxH1G3cN2d?usp=sharing	2025-01-08 19:01:00 -08:00
Xi Yan	7a4383e4c1	add 3.3 to together inference provider (#729 ) # What does this PR do? - add llama3.3 model for together - fix fireworks distro_codegen ``` python llama_stack/scripts/distro_codegen.py ``` ## Test Plan <img width="1132" alt="image" src="https://github.com/user-attachments/assets/bf94b933-9200-4e73-878e-d1a95d450a88" /> Tests ``` pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.3-70B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py ``` <img width="1139" alt="image" src="https://github.com/user-attachments/assets/407dc98b-8de3-4841-8cb1-75e4b5128544" /> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2025-01-06 15:39:41 -08:00
Botao Chen	36b4fe02cc	[4/n][torchtune integration] support lazy load model during inference (#620 ) ## What does this PR do? In this PR, we refactor the meta reference inference logic to support - load the model during registering model instead of during spinning up server - support inference finetuned model checkpoint on top of native llama model ## Why need these changes To solve the existing pain points that - user cannot lazy load the model and hot switch the inference checkpoint after spinning up the server - this blocks us doing inference and eval on the same sever for a finetuned checkpoint after post training - user cannot do inference on a finetuned checkpoint on top of native llama models ## Expect user experience change - The inference model won't be loaded when spinning up server. Instead, it will be loaded during register model. If user add the model as models resource in run.yaml, it will be registered and loaded automatically when starting server. There is an optional flag 'skip_initialize' in model metadata to skip model loading during registration. - There is an optional flag 'llama_model' in model metadata to identify the base model of the Model class for validation and initialize model arch. model identifier no longer needs to be a native llama model - the default inference model name updates from 'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct' - It aligns with the checkpoint folder name after running 'llama model download' - It aligns with the descriptor name defined in llama-models SKU list `bf5b0c4fe7/models/datatypes.py (L95)` ## test run python llama_stack/scripts/distro_codegen.py run unit test - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_model_registration.py test post training experience on server side run: llama stack run llama_stack/templates/experimental-post-training/run.yaml server is spinning up without model loaded <img width="812" alt="Screenshot 2024-12-17 at 1 24 50 PM" src="https://github.com/user-attachments/assets/ce1f606b-3b6f-452f-b48e-b3761ffd90f3" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 models register Llama3.2-3B-Instruct register model successfully and the model is loaded <img width="1111" alt="Screenshot 2024-12-17 at 1 26 30 PM" src="https://github.com/user-attachments/assets/56e02131-cf7d-4de5-8f63-fbdcb8c55c26" /> <img width="1541" alt="Screenshot 2024-12-17 at 1 26 09 PM" src="https://github.com/user-attachments/assets/a83255a1-20f5-40a2-af51-55641410a115" /> if add "skip_initialize" in metadata, model is registered but isn't loaded on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" Inference the model succesfully <img width="1121" alt="Screenshot 2024-12-17 at 1 27 33 PM" src="https://github.com/user-attachments/assets/8e708545-3fe7-4a73-8754-1470fa5f1e75" /> test inference experience run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml model is loaded since the model is in resouce list in run.yaml <img width="1537" alt="Screenshot 2024-12-17 at 1 30 19 PM" src="https://github.com/user-attachments/assets/5c8af817-66eb-43f8-bf4c-f5e24b0a12c6" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" inference successfully <img width="1123" alt="Screenshot 2024-12-17 at 1 31 08 PM" src="https://github.com/user-attachments/assets/471809aa-c65e-46dc-a37e-7094fb857f97" /> ## inference on a finetuned model register a finetuned model that finetuned by post training api (torchtune) - the model is registered and loaded successfully - the model is shown up in the model list <img width="974" alt="Screenshot 2024-12-18 at 3 56 33 PM" src="https://github.com/user-attachments/assets/2994b4f5-4fa9-40c6-acc6-4b971479f3e2" /> run inference <img width="977" alt="Screenshot 2024-12-18 at 3 57 59 PM" src="https://github.com/user-attachments/assets/d117abbc-b2a0-41d8-a028-1a13128787b2" />	2024-12-18 16:30:53 -08:00
Dinesh Yeduguru	516e1a3e59	add embedding model by default to distribution templates (#617 ) # What does this PR do? Adds the sentence transformer provider and the `all-MiniLM-L6-v2` embedding model to the default models to register in the run.yaml for all providers. ## Test Plan llama stack build --template together --image-type conda llama stack run ~/.llama/distributions/llamastack-together/together-run.yaml	2024-12-13 12:48:00 -08:00
Ashwin Bharambe	e951852848	Miscellaneous fixes around telemetry, library client and run yaml autogen Also add a `venv` image-type for llama stack build	2024-12-08 20:40:22 -08:00
Xi Yan	7301403ce3	Add eval/scoring/datasetio API providers to distribution templates & UI developer guide (#564 ) # What does this PR do? - add /eval, /scoring, /datasetio API providers to distribution templates - regenerate build.yaml / run.yaml files - fix `template.py` to take in list of providers instead of only first one - override memory provider as faiss default for all distro (as only 1 memory provider is needed to start basic flow, chromadb/pgvector need additional setup step). ``` python llama_stack/scripts/distro_codegen.py ``` - updated README to start UI via conda builds. ## Test Plan ``` python llama_stack/scripts/distro_codegen.py ``` - Use newly generated `run.yaml` to start server ``` llama stack run ./llama_stack/templates/together/run.yaml ``` <img width="1191" alt="image" src="https://github.com/user-attachments/assets/62f7d179-0cd0-427c-b6e8-e087d4648f09"> #### Registration ``` ❯ llama-stack-client datasets register \ --dataset-id "mmlu" \ --provider-id "huggingface" \ --url "https://huggingface.co/datasets/llamastack/evals" \ --metadata '{"path": "llamastack/evals", "name": "evals__mmlu__details", "split": "train"}' \ --schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string", "chat_completion_input": {"type": "string"}}}' ❯ llama-stack-client datasets list ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ identifier ┃ provider_id ┃ metadata ┃ type ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ │ mmlu │ huggingface │ {'path': 'llamastack/evals', 'name': │ dataset │ │ │ │ 'evals__mmlu__details', 'split': │ │ │ │ │ 'train'} │ │ └────────────┴─────────────┴─────────────────────────────────────────┴─────────┘ ``` ``` ❯ llama-stack-client datasets register \ --dataset-id "simpleqa" \ --provider-id "huggingface" \ --url "https://huggingface.co/datasets/llamastack/evals" \ --metadata '{"path": "llamastack/evals", "name": "evals__simpleqa", "split": "train"}' \ --schema '{"input_query": {"type": "string"}, "expected_answer": {"type": "string", "chat_completion_input": {"type": "string"}}}' ❯ llama-stack-client datasets list ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ identifier ┃ provider_id ┃ metadata ┃ type ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ │ mmlu │ huggingface │ {'path': 'llamastack/evals', 'name': 'evals__mmlu__details', │ dataset │ │ │ │ 'split': 'train'} │ │ │ simpleqa │ huggingface │ {'path': 'llamastack/evals', 'name': 'evals__simpleqa', │ dataset │ │ │ │ 'split': 'train'} │ │ └────────────┴─────────────┴───────────────────────────────────────────────────────────────┴─────────┘ ``` ``` ❯ llama-stack-client eval_tasks register \ > --eval-task-id meta-reference-mmlu \ > --provider-id meta-reference \ > --dataset-id mmlu \ > --scoring-functions basic::regex_parser_multiple_choice_answer ❯ llama-stack-client eval_tasks register \ --eval-task-id meta-reference-simpleqa \ --provider-id meta-reference \ --dataset-id simpleqa \ --scoring-functions llm-as-judge::405b-simpleqa ❯ llama-stack-client eval_tasks list ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ dataset_id ┃ identifier ┃ metadata ┃ provider_id ┃ provider_resour… ┃ scoring_functio… ┃ type ┃ ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ mmlu │ meta-reference-… │ {} │ meta-reference │ meta-reference-… │ ['basic::regex_… │ eval_task │ │ simpleqa │ meta-reference-… │ {} │ meta-reference │ meta-reference-… │ ['llm-as-judge:… │ eval_task │ └────────────┴──────────────────┴──────────┴────────────────┴──────────────────┴──────────────────┴───────────┘ ``` #### Test with UI ``` streamlit run app.py ``` ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-05 16:29:32 -08:00
Henry Tu	64c6df8392	Cerebras Inference Integration (#265 ) Adding Cerebras Inference as an API provider. ## Testing ### Conda ``` $ llama stack build --template cerebras --image-type conda $ llama stack run ~/.llama/distributions/llamastack-cerebras/cerebras-run.yaml ... Listening on ['::', '0.0.0.0']:5000 INFO: Started server process [12443] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit) ``` ### Chat Completion ``` $ curl --location 'http://localhost:5000/alpha/inference/chat-completion' --header 'Content-Type: application/json' --data '{ "model_id": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "What is the temperature in Seattle right now?" } ], "stream": false, "sampling_params": { "strategy": "top_p", "temperature": 0.5, "max_tokens": 100 }, "tool_choice": "auto", "tool_prompt_format": "json", "tools": [ { "tool_name": "getTemperature", "description": "Gets the current temperature of a location.", "parameters": { "location": { "param_type": "string", "description": "The name of the place to get the temperature from in degress celsius.", "required": true } } } ] }' ``` #### Non-Streaming Response ``` { "completion_message": { "role": "assistant", "content": "", "stop_reason": "end_of_message", "tool_calls": [ { "call_id": "6f42fdcc-6cbb-46ad-a17b-5d20ac64b678", "tool_name": "getTemperature", "arguments": { "location": "Seattle" } } ] }, "logprobs": null } ``` #### Streaming Response ``` data: {"event":{"event_type":"start","delta":"","logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"","parse_status":"started"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"{\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"type","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"function","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"name","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"get","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"Temperature","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"parameters","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" {\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"location","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"Seattle","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":"\"}}","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}} data: {"event":{"event_type":"progress","delta":{"content":{"call_id":"e742df1f-0ae9-40ad-a49e-18e5c905484f","tool_name":"getTemperature","arguments":{"location":"Seattle"}},"parse_status":"success"},"logprobs":null,"stop_reason":"end_of_message"}} data: {"event":{"event_type":"complete","delta":"","logprobs":null,"stop_reason":"end_of_message"}} ``` ### Completion ``` $ curl --location 'http://localhost:5000/alpha/inference/completion' --header 'Content-Type: application/json' --data '{ "model_id": "meta-llama/Llama-3.1-8B-Instruct", "content": "1,2,3,", "stream": true, "sampling_params": { "strategy": "top_p", "temperature": 0.5, "max_tokens": 10 }, "tool_choice": "auto", "tool_prompt_format": "json", "tools": [ { "tool_name": "getTemperature", "description": "Gets the current temperature of a location.", "parameters": { "location": { "param_type": "string", "description": "The name of the place to get the temperature from in degress celsius.", "required": true } } } ] }' ``` #### Non-Streaming Response ``` { "content": "4,5,6,7,8,", "stop_reason": "out_of_tokens", "logprobs": null } ``` #### Streaming Response ``` data: {"delta":"4","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"5","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"6","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"7","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"8","stop_reason":null,"logprobs":null} data: {"delta":",","stop_reason":null,"logprobs":null} data: {"delta":"","stop_reason":null,"logprobs":null} data: {"delta":"","stop_reason":"out_of_tokens","logprobs":null} ``` ### Pre-Commit Checks ``` trim trailing whitespace.................................................Passed check python ast.........................................................Passed check for merge conflicts................................................Passed check for added large files..............................................Passed fix end of files.........................................................Passed Insert license in comments...............................................Passed flake8...................................................................Passed Format files with µfmt...................................................Passed ``` ### Testing with `test_inference.py` ``` $ export CEREBRAS_API_KEY=<insert API key here> $ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "cerebras and llama_8b" /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) =================================================== test session starts =================================================== platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12 cachedir: .pytest_cache rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack configfile: pyproject.toml plugins: anyio-4.6.2.post1, asyncio-0.24.0 asyncio: mode=Mode.STRICT, default_loop_scope=None collected 128 items / 120 deselected / 8 selected llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-cerebras] Resolved 4 providers inner-inference => cerebras models => __routing_table__ inference => __autorouted__ inspect => __builtin__ Models: meta-llama/Llama-3.1-8B-Instruct served by cerebras PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_8b-cerebras] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-cerebras] SKIPPED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-cerebras] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-cerebras] PASSED ================================ 6 passed, 2 skipped, 120 deselected, 6 warnings in 3.95s ================================= ``` I ran `python llama_stack/scripts/distro_codegen.py` to run codegen.	2024-12-03 21:15:32 -08:00
Ashwin Bharambe	cd6ccb664c	Integrate distro docs into the restructured docs	2024-11-20 23:20:05 -08:00
Ashwin Bharambe	2411a44833	Update more distribution docs to be simpler and partially codegen'ed	2024-11-20 22:03:44 -08:00
Ashwin Bharambe	89f5093dfc	Fix tgi doc	2024-11-19 21:06:11 -08:00
Xi Yan	2da93c8835	fix 3.2-1b fireworks	2024-11-19 14:20:07 -08:00
Xi Yan	189df6358a	codegen docs	2024-11-19 14:16:00 -08:00
Ashwin Bharambe	1619d37cc6	codegen per-distro dependencies; not hooked into setup.py yet	2024-11-19 09:54:30 -08:00
Ashwin Bharambe	2a31163178	Auto-generate distro yamls + docs (#468 ) # What does this PR do? Automatically generates - build.yaml - run.yaml - run-with-safety.yaml - parts of markdown docs for the distributions. ## Test Plan At this point, this only updates the YAMLs and the docs. Some testing (especially with ollama and vllm) has been performed but needs to be much more tested.	2024-11-18 14:57:06 -08:00
Ashwin Bharambe	96e7ef646f	add support for ${env.FOO_BAR} placeholders in run.yaml files (#439 ) # What does this PR do? We'd like our docker steps to require _ZERO EDITS_ to a YAML file in order to get going. This is often not possible because depending on the provider, we do need some configuration input from the user. Environment variables are the best way to obtain this information. This PR allows our run.yaml to contain `${env.FOO_BAR}` placeholders which can be replaced using `docker run -e FOO_BAR=baz` (and similar `docker compose` equivalent). ## Test Plan For remote-vllm, example `run.yaml` snippet looks like this: ```yaml providers: inference: # serves main inference model - provider_id: vllm-0 provider_type: remote::vllm config: # NOTE: replace with "localhost" if you are running in "host" network mode url: ${env.LLAMA_INFERENCE_VLLM_URL:http://host.docker.internal:5100/v1} max_tokens: ${env.MAX_TOKENS:4096} api_token: fake # serves safety llama_guard model - provider_id: vllm-1 provider_type: remote::vllm config: # NOTE: replace with "localhost" if you are running in "host" network mode url: ${env.LLAMA_SAFETY_VLLM_URL:http://host.docker.internal:5101/v1} max_tokens: ${env.MAX_TOKENS:4096} api_token: fake ``` `compose.yaml` snippet looks like this: ```yaml llamastack: depends_on: - vllm-0 - vllm-1 # image: llamastack/distribution-remote-vllm image: llamastack/distribution-remote-vllm:test-0.0.52rc3 volumes: - ~/.llama:/root/.llama - ~/local/llama-stack/distributions/remote-vllm/run.yaml:/root/llamastack-run-remote-vllm.yaml # network_mode: "host" environment: - LLAMA_INFERENCE_VLLM_URL=${LLAMA_INFERENCE_VLLM_URL:-http://host.docker.internal:5100/v1} - LLAMA_INFERENCE_MODEL=${LLAMA_INFERENCE_MODEL:-Llama3.1-8B-Instruct} - MAX_TOKENS=${MAX_TOKENS:-4096} - SQLITE_STORE_DIR=${SQLITE_STORE_DIR:-$HOME/.llama/distributions/remote-vllm} - LLAMA_SAFETY_VLLM_URL=${LLAMA_SAFETY_VLLM_URL:-http://host.docker.internal:5101/v1} - LLAMA_SAFETY_MODEL=${LLAMA_SAFETY_MODEL:-Llama-Guard-3-1B} ```	2024-11-13 11:25:58 -08:00
Ashwin Bharambe	e51107e019	Fix compose.yaml	2024-11-12 15:43:30 -08:00
Ashwin Bharambe	e4f14eafe2	Use GPUs 0 and 1	2024-11-12 14:21:22 -08:00
Ashwin Bharambe	1245a625ce	Update vllm compose and run YAMLs	2024-11-12 14:17:38 -08:00
Ashwin Bharambe	2c294346ae	Update provider types and prefix with inline::	2024-11-12 12:54:44 -08:00
Ashwin Bharambe	3d7561e55c	Rename all inline providers with an inline:: prefix (#423 )	2024-11-11 22:19:16 -08:00
Suraj Subramanian	b78ee3a0a5	fix duplicate `deploy` in compose.yaml (#417 )	2024-11-11 10:51:14 -08:00
Ashwin Bharambe	c1f7ba3aed	Split safety into (llama-guard, prompt-guard, code-scanner) (#400 ) Splits the meta-reference safety implementation into three distinct providers: - inline::llama-guard - inline::prompt-guard - inline::code-scanner Note that this PR is a backward incompatible change to the llama stack server. I have added deprecation_error field to ProviderSpec -- the server reads it and immediately barfs. This is used to direct the user with a specific message on what action to perform. An automagical "config upgrade" is a bit too much work to implement right now :/ (Note that we will be gradually prefixing all inline providers with inline:: -- I am only doing this for this set of new providers because otherwise existing configuration files will break even more badly.)	2024-11-11 09:29:18 -08:00
Ashwin Bharambe	4986e46188	Distributions updates (slight updates to ollama, add inline-vllm and remote-vllm) (#408 ) * remote vllm distro * add inline-vllm details, fix things * Write some docs	2024-11-08 18:09:39 -08:00
Xi Yan	b1d7376730	kill tgi/cpu	2024-11-08 10:33:45 -08:00
Dinesh Yeduguru	093c9f1987	add bedrock distribution code (#358 ) * add bedrock distribution code * fix linter error * add bedrock shields support * linter fixes * working bedrock safety * change to return only one violation * remove env var reading * refereshable boto credentials * remove env vars * address raghu's feedback * fix session_ttl passing --------- Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>	2024-11-06 14:39:11 -08:00
Ashwin Bharambe	fb2678b134	Fix shield_type and routing table breakage	2024-11-04 19:57:15 -08:00
Xi Yan	c810a4184d	[docs] update documentations (#356 ) * move docs -> source * Add files via upload * mv image * Add files via upload * colocate iOS setup doc * delete image * Add files via upload * fix * delete image * Add files via upload * Update developer_cookbook.md * toctree * wip subfolder * docs update * subfolder * updates * name * updates * index * updates * refactor structure * depth * docs * content * docs * getting started * distributions * fireworks * fireworks * update * theme * theme * theme * pdj theme * pytorch theme * css * theme * agents example * format * index * headers * copy button * test tabs * test tabs * fix * tabs * tab * tabs * sphinx_design * quick start commands * size * width * css * css * download models * asthetic fix * tab format * update * css * width * css * docs * tab based * tab * tabs * docs * style * image * css * color * typo * update docs * missing links * list templates * links * links update * troubleshooting * fix * distributions * docs * fix table * kill llamastack-local-gpu/cpu * Update index.md * Update index.md * mv ios_setup.md * Update ios_setup.md * Add remote_or_local.gif * Update ios_setup.md * release notes * typos * Add ios_setup to index * nav bar * hide torctree * ios image * links update * rename * rename * docs * rename * links * distributions * distributions * distributions * distributions * remove release * remote --------- Co-authored-by: dltn <6599399+dltn@users.noreply.github.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2024-11-04 16:52:38 -08:00
Xi Yan	ae671eaf7a	distro readmes with model serving instructions (#339 ) * readme updates * quantied compose * dell tgi * config update * readme * update model serving readmes * update * update * config	2024-10-28 17:47:14 -07:00
Xi Yan	a70a4706fc	update distributions compose/readme (#338 ) * readme updates * quantied compose * dell tgi * config update	2024-10-28 16:34:43 -07:00
Xi Yan	985ff4d6ce	update distributions/readmes	2024-10-28 15:10:40 -07:00
Xi Yan	04a4784287	Update README.md	2024-10-28 13:25:44 -07:00
Xi Yan	3fa1eaf37d	Update README.md	2024-10-28 13:18:55 -07:00
Xi Yan	0d4215e125	Update README.md	2024-10-28 13:18:34 -07:00
Xi Yan	8f5a850de9	Update README.md	2024-10-28 13:16:23 -07:00

1 2

60 commits