| .. | ||
| .env.template | ||
| 00_Inference101.ipynb | ||
| 01_Local_Cloud_Inference101.ipynb | ||
| 02_Prompt_Engineering101.ipynb | ||
| 03_Image_Chat101.ipynb | ||
| 04_Tool_Calling101.ipynb | ||
| 05_Memory101.ipynb | ||
| 06_Safety101.ipynb | ||
| 07_Agents101.ipynb | ||
| README.md | ||
| Tool_Calling101_Using_Together's_Llama_Stack_Server.ipynb | ||
Llama Stack: from Zero to Hero
Llama-Stack allows you to configure your distribution from various providers, allowing you to focus on going from zero to production super fast.
This guide will walk you through how to build a local distribution, using Ollama as an inference provider.
We also have a set of notebooks walking you through how to use Llama-Stack APIs:
- Inference
- Prompt Engineering
- Chatting with Images
- Tool Calling
- Memory API for RAG
- Safety API
- Agentic API
Below, we will learn how to get started with Ollama as an inference provider, please note the steps for configuring your provider will vary a little depending on the service. However, the user experience will remain universal-this is the power of Llama-Stack.
Prototype locally using Ollama, deploy to the cloud with your favorite provider or own deployment. Use any API from any provider while focussing on development.
Ollama Quickstart Guide
This guide will walk you through setting up an end-to-end workflow with Llama Stack with ollama, enabling you to perform text generation using the Llama3.2-3B-Instruct model. Follow these steps to get started quickly.
If you're looking for more specific topics like tool calling or agent setup, we have a Zero to Hero Guide that covers everything from Tool Calling to Agents in detail. Feel free to skip to the end to explore the advanced topics you're interested in.
If you'd prefer not to set up a local server, explore our notebook on tool calling with the Together API. This guide will show you how to leverage Together.ai's Llama Stack Server API, allowing you to get started with Llama Stack without the need for a locally built and running server.
Table of Contents
- Setup ollama
- Install Dependencies and Set Up Environment
- Build, Configure, and Run Llama Stack
- Run Ollama Model
- Next Steps
Setup ollama
- 
Download Ollama App: - Go to https://ollama.com/download.
- Download and unzip Ollama-darwin.zip.
- Run the Ollamaapplication.
 
- 
Download the Ollama CLI: - Ensure you have the ollamacommand line tool by downloading and installing it from the same website.
 
- Ensure you have the 
- 
Start ollama server: - Open the terminal and run:
ollama serve
 
- Open the terminal and run:
- 
Run the model: - Open the terminal and run:
 Note: The supported models for llama stack for now is listed in hereollama run llama3.2:3b-instruct-fp16
 
- Open the terminal and run:
Install Dependencies and Set Up Environment
- 
Create a Conda Environment: - Create a new Conda environment with Python 3.10:
conda create -n ollama python=3.10
- Activate the environment:
conda activate ollama
 
- Create a new Conda environment with Python 3.10:
- 
Install ChromaDB: - Install chromadbusingpip:pip install chromadb
 
- Install 
- 
Run ChromaDB: - Start the ChromaDB server:
chroma run --host localhost --port 8000 --path ./my_chroma_data
 
- Start the ChromaDB server:
- 
Install Llama Stack: - Open a new terminal and install llama-stack:conda activate hack pip install llama-stack==0.0.53
 
- Open a new terminal and install 
Build, Configure, and Run Llama Stack
- Build the Llama Stack:
- Build the Llama Stack using the ollamatemplate:llama stack build --template ollama --image-type conda
 
- Build the Llama Stack using the 
After this step, you will see the console output:
Build Successful! Next steps:
   1. Set the environment variables: LLAMASTACK_PORT, OLLAMA_URL, INFERENCE_MODEL, SAFETY_MODEL
   2. `llama stack run /Users/username/.llama/distributions/llamastack-ollama/ollama-run.yaml`
- Set the ENV variables by exporting them to the terminal:
export OLLAMA_URL="http://localhost:11434"
export LLAMA_STACK_PORT=5001
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
- Run the Llama Stack:
- Run the stack with command shared by the API from earlier:
llama stack run ollama \
 
 --env INFERENCE_MODEL=$INFERENCE_MODEL
 --env SAFETY_MODEL=$SAFETY_MODEL
 --env OLLAMA_URL=http://localhost:11434
- Run the stack with command shared by the API from earlier:
Note: Everytime you run a new model with ollama run, you will need to restart the llama stack. Otherwise it won't see the new model
The server will start and listen on http://localhost:5051.
Testing with curl
After setting up the server, open a new terminal window and verify it's working by sending a POST request using curl:
curl http://localhost:5051/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
    "model": "Llama3.2-3B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
    ],
    "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
You can check the available models with the command llama-stack-client models list.
Expected Output:
{
  "completion_message": {
    "role": "assistant",
    "content": "The moon glows softly in the midnight sky,\nA beacon of wonder, as it catches the eye.",
    "stop_reason": "out_of_tokens",
    "tool_calls": []
  },
  "logprobs": null
}
Testing with Python
You can also interact with the Llama Stack server using a simple Python script. Below is an example:
1. Active Conda Environment and Install Required Python Packages
The llama-stack-client library offers a robust and efficient python methods for interacting with the Llama Stack server.
conda activate your-llama-stack-conda-env
Note, the client library gets installed by default if you install the server library
2. Create Python Script (test_llama_stack.py)
touch test_llama_stack.py
3. Create a Chat Completion Request in Python
from llama_stack_client import LlamaStackClient
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:5051")
# Create a chat completion request
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
    model_id=MODEL_NAME,
)
# Print the response
print(response.completion_message.content)
4. Run the Python Script
python test_llama_stack.py
Expected Output:
The moon glows softly in the midnight sky,
A beacon of wonder, as it catches the eye.
With these steps, you should have a functional Llama Stack setup capable of generating text using the specified model. For more detailed information and advanced configurations, refer to some of our documentation below.
This command initializes the model to interact with your local Llama Stack instance.
Next Steps
Explore Other Guides: Dive deeper into specific topics by following these guides:
- Understanding Distribution
- Inference 101
- Local and Cloud Model Toggling 101
- Prompt Engineering
- Chat with Image - LlamaStack Vision API
- Tool Calling: How to and Details
- Memory API: Show Simple In-Memory Retrieval
- Using Safety API in Conversation
- Agents API: Explain Components
Explore Client SDKs: Utilize our client SDKs for various languages to integrate Llama Stack into your applications:
Advanced Configuration: Learn how to customize your Llama Stack distribution by referring to the Building a Llama Stack Distribution guide.
Explore Example Apps: Check out llama-stack-apps for example applications built using Llama Stack.