# What does this PR do? Adds raw completions API to vLLM ## Test Plan <details> <summary>Setup</summary> ```bash # Run vllm server conda create -n vllm python=3.12 -y conda activate vllm pip install vllm # Run llamastack conda create --name llamastack-vllm python=3.10 conda activate llamastack-vllm export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct && \ pip install -e . && \ pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 && \ llama stack build --template remote-vllm --image-type conda && \ llama stack run ./distributions/remote-vllm/run.yaml \ --port 5000 \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env VLLM_URL=http://localhost:8000/v1 | tee -a llama-stack.log ``` </details> <details> <summary>Integration</summary> ```bash # Run conda activate llamastack-vllm export VLLM_URL=http://localhost:8000/v1 pip install pytest pytest_html pytest_asyncio aiosqlite pytest llama_stack/providers/tests/inference/test_text_inference.py -v -k vllm # Results llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm_remote] PASSED [ 11%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm_remote] PASSED [ 22%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm_remote] SKIPPED [ 33%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm_remote] SKIPPED [ 44%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm_remote] PASSED [ 55%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm_remote] PASSED [ 66%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm_remote] PASSED [ 77%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm_remote] PASSED [ 88%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm_remote] PASSED [100%] ====================================== 7 passed, 2 skipped, 99 deselected, 1 warning in 9.80s ====================================== ``` </details> <details> <summary>Manual</summary> ```bash # Install pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 ``` Apply this diff ```diff diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py index 8dbb193..95173e2 100644 --- a/llama_stack/distribution/server/server.py +++ b/llama_stack/distribution/server/server.py @@ -250,7 +250,7 @@ class ClientVersionMiddleware: server_version_parts = tuple( map(int, self.server_version.split(".")[:2]) ) - if client_version_parts != server_version_parts: + if False and client_version_parts != server_version_parts: async def send_version_error(send): await send( diff --git a/llama_stack/templates/remote-vllm/run.yaml b/llama_stack/templates/remote-vllm/run.yaml index 4eac4da..32eb50e 100644 --- a/llama_stack/templates/remote-vllm/run.yaml +++ b/llama_stack/templates/remote-vllm/run.yaml @@ -94,7 +94,8 @@ metadata_store: type: sqlite db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/registry.db models: -- metadata: {} +- metadata: + llama_model: meta-llama/Llama-3.2-3B-Instruct model_id: ${env.INFERENCE_MODEL} provider_id: vllm-inference model_type: llm ``` Test 1: ```python from llama_stack_client import LlamaStackClient client = LlamaStackClient( base_url="http://localhost:5000", ) response = client.inference.completion( model_id="meta-llama/Llama-3.2-3B-Instruct", content="Hello, world client!", ) print(response) ``` Test 2 ``` from llama_stack_client import LlamaStackClient client = LlamaStackClient( base_url="http://localhost:5000", ) response = client.inference.completion( model_id="meta-llama/Llama-3.2-3B-Instruct", content="Hello, world client!", stream=True, ) for chunk in response: print(chunk.delta, end="", flush=True) ``` ``` I'm excited to introduce you to our latest project, a comprehensive guide to the best coffee shops in [City]. As a coffee connoisseur, you're in luck because we've scoured the city to bring you the top picks for the perfect cup of joe. In this guide, we'll take you on a journey through the city's most iconic coffee shops, highlighting their unique features, must-try drinks, and insider tips from the baristas themselves. From cozy cafes to trendy cafes, we've got you covered. **Top 5 Coffee Shops in [City]** 1. **The Daily Grind**: This beloved institution has been serving up expertly crafted pour-overs and lattes for over 10 years. Their expert baristas are always happy to guide you through their menu, which features a rotating selection of single-origin beans from around the world... ``` </details> ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. |
||
---|---|---|
.github | ||
distributions | ||
docs | ||
llama_stack | ||
rfcs | ||
tests/client-sdk | ||
.flake8 | ||
.gitignore | ||
.gitmodules | ||
.pre-commit-config.yaml | ||
.readthedocs.yaml | ||
CHANGELOG.md | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
LICENSE | ||
MANIFEST.in | ||
pyproject.toml | ||
README.md | ||
requirements.txt | ||
SECURITY.md | ||
setup.py |
Llama Stack
Quick Start | Documentation | Colab Notebook
Llama Stack defines and standardizes the core building blocks needed to bring generative AI applications to market. It provides a unified set of APIs with implementations from leading service providers, enabling seamless transitions between development and production environments.
We focus on making it easy to build production applications with the Llama model family - from the latest Llama 3.3 to specialized models like Llama Guard for safety.
Key Features
-
Unified API Layer for:
- Inference: Run LLM models efficiently
- Safety: Apply content filtering and safety policies
- DatasetIO: Store and retrieve knowledge for RAG
- Agents: Build multi-step agentic workflows
- Evaluation: Test and improve model and agent quality
- Telemetry: Collect and analyze usage data and complex agentic traces
- Post Training ( Coming Soon ): Fine tune models for specific use cases
-
Rich Provider Ecosystem
- Local Development: Meta's Reference,Ollama, vLLM, TGI
- Self-hosted: Chroma, pgvector, Nvidia NIM
- Cloud: Fireworks, Together, Nvidia, AWS Bedrock, Groq, Cerebras
- On-device: iOS and Android support
-
Built for Production
- Pre-packaged distributions for common deployment scenarios
- Comprehensive evaluation capabilities
- Full observability and monitoring
- Provider federation and fallback
Supported Llama Stack Implementations
API Providers
API Provider Builder | Environments | Agents | Inference | Memory | Safety | Telemetry |
---|---|---|---|---|---|---|
Meta Reference | Single Node | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Cerebras | Hosted | ✔️ | ||||
Fireworks | Hosted | ✔️ | ✔️ | ✔️ | ||
AWS Bedrock | Hosted | ✔️ | ✔️ | |||
Together | Hosted | ✔️ | ✔️ | ✔️ | ||
Groq | Hosted | ✔️ | ||||
Ollama | Single Node | ✔️ | ||||
TGI | Hosted and Single Node | ✔️ | ||||
NVIDIA NIM | Hosted and Single Node | ✔️ | ||||
Chroma | Single Node | ✔️ | ||||
PG Vector | Single Node | ✔️ | ||||
PyTorch ExecuTorch | On-device iOS | ✔️ | ✔️ | |||
vLLM | Hosted and Single Node | ✔️ |
Distributions
A Llama Stack Distribution (or "distro") is a pre-configured bundle of provider implementations for each API component. Distributions make it easy to get started with a specific deployment scenario - you can begin with a local development setup (eg. ollama) and seamlessly transition to production (eg. Fireworks) without changing your application code. Here are some of the distributions we support:
Distribution | Llama Stack Docker | Start This Distribution |
---|---|---|
Meta Reference | llamastack/distribution-meta-reference-gpu | Guide |
Meta Reference Quantized | llamastack/distribution-meta-reference-quantized-gpu | Guide |
Cerebras | llamastack/distribution-cerebras | Guide |
Ollama | llamastack/distribution-ollama | Guide |
TGI | llamastack/distribution-tgi | Guide |
Together | llamastack/distribution-together | Guide |
Fireworks | llamastack/distribution-fireworks | Guide |
vLLM | llamastack/distribution-remote-vllm | Guide |
Installation
You have two ways to install this repository:
-
Install as a package: You can install the repository directly from PyPI by running the following command:
pip install llama-stack
-
Install from source: If you prefer to install from the source code, make sure you have conda installed. Then, follow these steps:
mkdir -p ~/local cd ~/local git clone git@github.com:meta-llama/llama-stack.git conda create -n stack python=3.10 conda activate stack cd llama-stack pip install -e .
Documentation
Please checkout our Documentation page for more details.
- CLI reference
- Guide using
llama
CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
- Guide using
- Getting Started
- Quick guide to start a Llama Stack server.
- Jupyter notebook to walk-through how to use simple text and vision inference llama_stack_client APIs
- The complete Llama Stack lesson Colab notebook of the new Llama 3.2 course on Deeplearning.ai.
- A Zero-to-Hero Guide that guide you through all the key components of llama stack with code samples.
- Contributing
- Adding a new API Provider to walk-through how to add a new API provider.
Llama Stack Client SDKs
Language | Client SDK | Package |
---|---|---|
Python | llama-stack-client-python | |
Swift | llama-stack-client-swift | |
Node | llama-stack-client-node | |
Kotlin | llama-stack-client-kotlin |
Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from python, node, swift, and kotlin programming languages to quickly build your applications.
You can find more example scripts with client SDKs to talk with the Llama Stack server in our llama-stack-apps repo.