[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)

# Llama Stack - Building AI Applications

<img src="https://llamastack.github.io/latest/_images/llama-stack.png" alt="drawing" width="500"/>

Get started with Llama Stack in minutes!

[Llama Stack](https://github.com/meta-llama/llama-stack) is a stateful service with REST APIs to support the seamless transition of AI applications across different environments. You can build and test using a local server first and deploy to a hosted endpoint for production.

In this guide, we'll walk through how to build a RAG application locally using Llama Stack with [Ollama](https://ollama.com/)
as the inference [provider](docs/source/providers/index.md#inference) for a Llama Model.


## Step 1: Install and setup

### 1.1. Install uv and test inference with Ollama

We'll install [uv](https://docs.astral.sh/uv/) to setup the Python virtual environment, along with [colab-xterm](https://github.com/InfuseAI/colab-xterm) for running command-line tools, and [Ollama](https://ollama.com/download) as the inference provider.

In [None]:
%pip install uv llama_stack llama-stack-client

## If running on Collab:
# !pip install colab-xterm
# %load_ext colabxterm

!curl https://ollama.ai/install.sh | sh

### 1.2. Test inference with Ollama

Weâ€™ll now launch a terminal and run inference on a Llama model with Ollama to verify that the model is working correctly.

In [None]:
## If running on Colab:
# %xterm

## To be ran in the terminal:
# ollama serve &
# ollama run llama3.2:3b --keepalive 60m

If successful, you should see the model respond to a prompt.

...
```
>>> hi
Hello! How can I assist you today?
```

## Step 2: Run the Llama Stack server

In this showcase, we will start a Llama Stack server that is running locally.

### 2.1. Setup the Llama Stack Server

In [1]:
import os
import subprocess

if "UV_SYSTEM_PYTHON" in os.environ:
  del os.environ["UV_SYSTEM_PYTHON"]

# this command installs all the dependencies needed for the llama stack server with the ollama inference provider
!uv run --with llama-stack llama stack list-deps starter | xargs -L1 uv pip install

def run_llama_stack_server_background():
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        f"OLLAMA_URL=http://localhost:11434 uv run --with llama-stack llama stack run starter",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True
    )

    print(f"Starting Llama Stack server with PID: {process.pid}")
    return process

def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError
    import time

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


# use this helper if needed to kill the server
def kill_llama_stack_server():
    # Kill any existing llama stack server processes
    os.system("ps aux | grep -v grep | grep llama_stack.core.server.server | awk '{print $2}' | xargs kill -9")


[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m52 packages[0m [2min 1.56s[0m[0m
[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m3 packages[0m [2min 122ms[0m[0m
[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m3 packages[0m [2min 197ms[0m[0m
[2mUsing Python 3.12.12 environment at: /opt/homebrew/Caskroom/miniconda/base/envs/test[0m
[2mAudited [1m1 package[0m [2min 11ms[0m[0m


### 2.2. Start the Llama Stack Server

In [2]:
server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()

Starting Llama Stack server with PID: 20778
Waiting for server to start........
Server is ready!


## Step 3: Run the demo

In [3]:
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
import requests

vector_store_id = "my_demo_vector_db"
client = LlamaStackClient(base_url="http://0.0.0.0:8321")

models = client.models.list()

# Select the first ollama and first ollama's embedding model
model_id = next(m for m in models if m.model_type == "llm" and m.provider_id == "ollama").identifier


source = "https://www.paulgraham.com/greatwork.html"
response = requests.get(source)
file = client.files.create(
    file=response.content,
    purpose='assistants'
)
vector_store = client.vector_stores.create(
    name=vector_store_id,
    file_ids=[file.id],
)

agent = Agent(
    client,
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[
        {
            "type": "file_search",
            "vector_store_ids": [vector_store_id],
        }
    ],
)

prompt = "How do you do great work?"
print("prompt>", prompt)

response = agent.create_turn(
    messages=[{"role": "user", "content": prompt}],
    session_id=agent.create_session("rag_session"),
    stream=True,
)

for log in AgentEventLogger().log(response):
    print(log, end="")

INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector_stores "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/conversations "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses "HTTP/1.1 200 OK"


prompt> How do you do great work?
ðŸ¤” Doing great work involves a combination of skills, habits, and mindsets. Here are some key principles:

1. **Set Clear Goals**: Start with a clear vision of what you want to achieve. Define specific, measurable, achievable, relevant, and time-bound (SMART) goals.

2. **Plan and Prioritize**: Break your goals into smaller, manageable tasks. Prioritize these tasks based on their importance and urgency.

3. **Focus on Quality**: Aim for high-quality outcomes rather than just finishing tasks. Pay attention to detail, and ensure your work meets or exceeds standards.

4. **Stay Organized**: Keep your workspace, both physical and digital, organized to help you stay focused and efficient.

5. **Manage Your Time**: Use time management techniques such as the Pomodoro Technique, time blocking, or the Eisenhower Box to maximize productivity.

6. **Seek Feedback and Learn**: Regularly seek feedback from peers, mentors, or supervisors. Use constructive criticis

Congratulations! You've successfully built your first RAG application using Llama Stack! ðŸŽ‰ðŸ¥³

## Next Steps

Now you're ready to dive deeper into Llama Stack!
- Explore the [Detailed Tutorial](./detailed_tutorial.md).
- Try the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb).
- Browse more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks).
- Learn about Llama Stack [Concepts](../concepts/index.md).
- Discover how to [Build Llama Stacks](../distributions/index.md).
- Refer to our [References](../references/index.md) for details on the Llama CLI and Python SDK.
- Check out the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository for example applications and tutorials.