[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)

# Llama Stack - Building AI Applications

<img src="https://llamastack.github.io/latest/_images/llama-stack.png" alt="drawing" width="500"/>

Get started with Llama Stack in minutes!

[Llama Stack](https://github.com/meta-llama/llama-stack) is a stateful service with REST APIs to support the seamless transition of AI applications across different environments. You can build and test using a local server first and deploy to a hosted endpoint for production.

In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/)
as the inference [provider](docs/source/providers/index.md#inference) for a Llama Model.


## Step 1: Install and setup

### 1.1. Install uv and test inference with Ollama

We'll install [uv](https://docs.astral.sh/uv/) to setup the Python virtual environment, along with [colab-xterm](https://github.com/InfuseAI/colab-xterm) for running command-line tools, and [Ollama](https://ollama.com/download) as the inference provider.

In [None]:
%pip install uv llama_stack llama-stack-client

## If running on Collab:
# !pip install colab-xterm
# %load_ext colabxterm

!curl https://ollama.ai/install.sh | sh

### 1.2. Test inference with Ollama

Weâ€™ll now launch a terminal and run inference on a Llama model with Ollama to verify that the model is working correctly.

In [None]:
## If running on Colab:
# %xterm

## To be ran in the terminal:
# ollama serve &
# ollama run llama3.2:3b --keepalive 60m

If successful, you should see the model respond to a prompt.

...
```
>>> hi
Hello! How can I assist you today?
```

## Step 2: Run the Llama Stack server

In this showcase, we will start a Llama Stack server that is running locally.

### 2.1. Setup the Llama Stack Server

In [1]:
import os
import subprocess

if "UV_SYSTEM_PYTHON" in os.environ:
  del os.environ["UV_SYSTEM_PYTHON"]

# this command installs all the dependencies needed for the llama stack server with the ollama inference provider
!uv run --with llama-stack llama stack list-deps starter | xargs -L1 uv pip install

def run_llama_stack_server_background():
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        f"OLLAMA_URL=http://localhost:11434 uv run --with llama-stack llama stack run starter",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True
    )

    print(f"Starting Llama Stack server with PID: {process.pid}")
    return process

def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError
    import time

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


# use this helper if needed to kill the server
def kill_llama_stack_server():
    # Kill any existing llama stack server processes
    os.system("ps aux | grep -v grep | grep llama_stack.core.server.server | awk '{print $2}' | xargs kill -9")


[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m52 packages[0m [2min 1.56s[0m[0m
[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m3 packages[0m [2min 122ms[0m[0m
[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m3 packages[0m [2min 197ms[0m[0m
[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m1 package[0m [2min 11ms[0m[0m


### 2.2. Start the Llama Stack Server

In [2]:
server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()

Starting Llama Stack server with PID: 20778
Waiting for server to start........
Server is ready!


## Step 3: RAG Demos - Three Approaches

We'll demonstrate three different approaches to building RAG applications with Llama Stack:
1. **Agent API** - High-level agent with session management
2. **Responses API** - Direct OpenAI-compatible responses
3. **Chat Completions API** - Manual retrieval with explicit control

### Approach 1: Agent Class (High-level)

In [2]:

# Make sure that your llama stack client version matches with the llama stack server version you are using.
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
import requests
from io import BytesIO



vector_store_id = "my_demo_vector_db"
client = LlamaStackClient(base_url="http://0.0.0.0:8321")

# Get model - find any Ollama Llama model
models = list(client.models.list())
print(f"Available models: {[m.id for m in models]}")

# Find any Ollama Llama LLM model
model_id = None
priority_models = ["ollama/llama3.3:70b","ollama/llama3.2:3b","ollama/llama3.1:8b"]
for m in models:
    if hasattr(m, "custom_metadata") and m.custom_metadata:
        provider_id = m.custom_metadata.get("provider_id")
        model_type = m.custom_metadata.get("model_type")

        # Use any Ollama LLM model with "llama" in the name
        if provider_id == "ollama" and model_type == "llm" and m.id.lower() in priority_models:
            model_id = m.id
            print(f"âœ“ Using model: {model_id}")
            break

if not model_id:
    raise ValueError("No Ollama Llama model found")

# Create vector store
print("\nâœ“ Downloading and indexing Paul Graham's essay...")
source = "https://www.paulgraham.com/greatwork.html"
response = requests.get(source)

# Create a file-like object from the HTML content
file_buffer = BytesIO(response.content)
file_buffer.name = "greatwork.html"

file = client.files.create(
    file=file_buffer,
    purpose='assistants'
)
print(f"âœ“ File created with ID: {file.id}")

vector_store = client.vector_stores.create(
    name=vector_store_id,
    file_ids=[file.id],
)
print(f"âœ“ Vector store created with ID: {vector_store.id}")

# Create agent
agent = Agent(
    client,
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[
        {
            "type": "file_search",
            "vector_store_ids": [vector_store.id],  # Use the actual ID, not the name
        }
    ],
)
print("âœ“ Agent created")

prompt = "How do you do great work?"
print("\nprompt>", prompt)

response = agent.create_turn(
    messages=[{"role": "user", "content": prompt}],
    session_id=agent.create_session("rag_session"),
    stream=True,
)

for log in AgentEventLogger().log(response):
    print(log, end="")

INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models "HTTP/1.1 200 OK"


Available models: ['bedrock/meta.llama3-1-405b-instruct-v1:0', 'bedrock/meta.llama3-1-70b-instruct-v1:0', 'bedrock/meta.llama3-1-8b-instruct-v1:0', 'ollama/chevalblanc/gpt-4o-mini:latest', 'ollama/nomic-embed-text:latest', 'ollama/llama3.3:70b', 'ollama/llama3.2:3b', 'ollama/all-minilm:l6-v2', 'ollama/llama3.1:8b', 'ollama/llama-guard3:latest', 'ollama/llama-guard3:8b', 'ollama/shieldgemma:27b', 'ollama/shieldgemma:latest', 'ollama/llama3.1:8b-instruct-fp16', 'ollama/all-minilm:latest', 'ollama/llama3.2:3b-instruct-fp16', 'sentence-transformers/nomic-ai/nomic-embed-text-v1.5']
âœ“ Using model: ollama/llama3.3:70b

âœ“ Downloading and indexing Paul Graham's essay...


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/files "HTTP/1.1 200 OK"


âœ“ File created with ID: file-e1290f8be28245e681bdfa5c40a7e7c4


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector_stores "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/conversations "HTTP/1.1 200 OK"


âœ“ Vector store created with ID: vs_67efaaf4-ba0d-4037-b816-f73d588e9e4d
âœ“ Agent created

prompt> How do you do great work?


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses "HTTP/1.1 200 OK"


ðŸ¤” 

ðŸ”§ Executing file_search (server-side)...
ðŸ¤” To do great work it's essential to decide what to work on and choose something you have a natural aptitude for that you are deeply interested in and offers scope to do great work <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Develop a habit of working on your own projects and don't let "work" mean something other people tell you to do <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Seek out the best colleagues as they can encourage you and help bounce ideas off each other and it's better to have one or two great ones than a building full of pretty good ones <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Husband your morale as it's crucial for doing great work and try to learn about other kinds of work by taking ideas from distant fields if you let them be metaphors <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Negative examples can also be inspiring so try to learn from things done badly as sometimes it becomes clear what's needed when it's miss

#### Multi-turn RAG Conversation with Session Management

In [5]:
# Create a new session for multi-turn RAG conversation
session_id = agent.create_session("multi_turn_rag_session")

print("\n" + "="*80)
print("Multi-turn RAG Conversation Demo")
print("="*80)
print("Demonstrating: Session maintains context while agent searches document")
print("="*80)

# Turn 1: Initial question - Agent searches document for relevant information
print("\n[Turn 1] User: What does the document say about curiosity and great work?")
print("(Agent will search the document...)")
response1 = agent.create_turn(
    messages=[{"role": "user", "content": "What does the document say about curiosity and great work?"}],
    session_id=session_id,
    stream=True,  # Use streaming for reliability
)
# Collect the response
response1_text = ""
for log in AgentEventLogger().log(response1):
    response1_text += log
print("\nAssistant:", response1_text[:250] + "...\n")

# Turn 2: Follow-up question using pronouns - Agent remembers the context from Turn 1
print("[Turn 2] User: Why is that important?")
print("(Agent remembers 'that' refers to curiosity from Turn 1 - no need to search again)")
response2 = agent.create_turn(
    messages=[{"role": "user", "content": "Why is that important?"}],
    session_id=session_id,
    stream=True,  # Use streaming for reliability
)
response2_text = ""
for log in AgentEventLogger().log(response2):
    response2_text += log
print("\nAssistant:", response2_text[:250] + "...\n")

# Turn 3: New question on different topic - Agent performs new document search
print("[Turn 3] User: What about the role of ambition?")
print("(New topic - agent will search document again for 'ambition')")
response3 = agent.create_turn(
    messages=[{"role": "user", "content": "What about the role of ambition?"}],
    session_id=session_id,
    stream=True,  # Use streaming for reliability
)
response3_text = ""
for log in AgentEventLogger().log(response3):
    response3_text += log
print("\nAssistant:", response3_text[:250] + "...\n")

# Turn 4: Compare previous topics - Agent uses session memory
print("[Turn 4] User: How do curiosity and ambition work together?")
print("(Agent combines information from Turn 1 and Turn 3 using session context)")
response4 = agent.create_turn(
    messages=[{"role": "user", "content": "How do curiosity and ambition work together?"}],
    session_id=session_id,
    stream=True,  # Use streaming for reliability
)
response4_text = ""
for log in AgentEventLogger().log(response4):
    response4_text += log
print("\nAssistant:", response4_text[:250] + "...\n")

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/conversations "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses "HTTP/1.1 200 OK"



Multi-turn RAG Conversation Demo
Demonstrating: Session maintains context while agent searches document

[Turn 1] User: What does the document say about curiosity and great work?
(Agent will search the document...)


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses "HTTP/1.1 200 OK"



Assistant: ðŸ¤” 

ðŸ”§ Executing file_search (server-side)...
ðŸ¤” Curiosity is a key factor in doing great work <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. It drives people to learn and explore new ideas, which can lead to innovative solutions and discoveries <|file-e12...

[Turn 2] User: Why is that important?
(Agent remembers 'that' refers to curiosity from Turn 1 - no need to search again)


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses "HTTP/1.1 200 OK"



Assistant: ðŸ¤” 

ðŸ”§ Executing file_search (server-side)...
ðŸ¤” Curiosity plays a crucial role in driving individuals to do great work and make meaningful discoveries <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. It is the key to all four steps in doing great work: choo...

[Turn 3] User: What about the role of ambition?
(New topic - agent will search document again for 'ambition')


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses "HTTP/1.1 200 OK"



Assistant: ðŸ¤” 

ðŸ”§ Executing file_search (server-side)...
ðŸ¤” To do great work it's an advantage to be optimistic even though that means you'll risk looking like a fool sometimes <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. One way to avoid intellectual dishonesty is...

[Turn 4] User: How do curiosity and ambition work together?
(Agent combines information from Turn 1 and Turn 3 using session context)

Assistant: ðŸ¤” 

ðŸ”§ Executing file_search (server-side)...
ðŸ¤” Curiosity and ambition are closely related as they both drive individuals to achieve great work <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Developing curiosity is essential for doing great work, and it c...



### Approach 3: Chat Completions API

In [6]:
# Step 1: Search vector store explicitly
prompt = "What does paul graham say about curiosity and great work?"
print(f"User Query: {prompt}")
print("\nSearching vector store...")
print(f"Using vector store ID: {vector_store.id}")
search_results = client.vector_stores.search(
    vector_store_id=vector_store.id,  # Use the actual ID, not the name
    query=prompt,
    max_num_results=3,
    rewrite_query=False
)

# Step 2: Extract context from search results
print("Extracting context from search results...")
context_chunks = []
for result in search_results.data:
    if hasattr(result, "content") and result.content:
        for content_item in result.content:
            if hasattr(content_item, "text") and content_item.text:
                context_chunks.append(content_item.text)

context = "\n\n".join(context_chunks)
print(f"Found {len(context_chunks)} relevant chunks\n")

# Step 3: Use Chat Completions with retrieved context
print("Response (Chat Completions API):")
print("="*80)

completion = client.chat.completions.create(
    model=model_id,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful assistant. Use the provided context to answer the user's question.",
        },
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {prompt}\n\nPlease provide a comprehensive answer based on the context above.",
        },
    ],
    temperature=0.7,
)

print(completion.choices[0].message.content)
print("="*80)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector_stores/vs_67efaaf4-ba0d-4037-b816-f73d588e9e4d/search "HTTP/1.1 200 OK"


User Query: What does paul graham say about curiosity and great work?

Searching vector store...
Using vector store ID: vs_67efaaf4-ba0d-4037-b816-f73d588e9e4d
Extracting context from search results...
Found 3 relevant chunks

Response (Chat Completions API):


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/chat/completions "HTTP/1.1 200 OK"


According to Paul Graham, curiosity is a crucial factor in doing great work. He emphasizes that curiosity is the best guide for finding something worth working on, and it plays a significant role in all four steps of doing great work: choosing a field, getting to the frontier, noticing gaps, and exploring them.

Graham notes that curiosity is not something that can be commanded, but it can be nurtured and allowed to drive one's efforts. He suggests that curious people are more likely to find the right thing to work on in the first place, as they cast a wide net and are more likely to stumble upon something important.

Graham also highlights the importance of curiosity in overcoming obstacles and staying motivated. He argues that when working on something that sparks genuine curiosity, the work will feel less burdensome, even if it's challenging. This is because curious people are driven by a desire to learn and understand, rather than just seeking external validation or rewards.

Furth

## Next Steps

Now you're ready to dive deeper into Llama Stack!
- Explore the [Detailed Tutorial](./detailed_tutorial.md).
- Try the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
- Browse more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks).
- Learn about Llama Stack [Concepts](../concepts/index.md).
- Discover how to [Build Llama Stacks](../distributions/index.md).
- Refer to our [References](../references/index.md) for details on the Llama CLI and Python SDK.
- Check out the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository for example applications and tutorials.