Simplify getting started doc

Focus on just venv
2025-08-05 10:13:05 +00:00 · 2025-03-29 11:40:45 -07:00 · 2025-03-29 11:40:45 -07:00 · 77cac70dd3
commit 77cac70dd3
parent d8a8a734b5
1 changed files with 240 additions and 179 deletions
--- a/docs/source/getting_started/index.md
+++ b/docs/source/getting_started/index.md
@ -22,115 +22,101 @@ If you do not have ollama, you can install it from [here](https://ollama.com/dow
 ```


-### 2. Pick a client environment
+### 2. Use `uv` to install and run Llama Stack

-Llama Stack has a service-oriented architecture, so every interaction with the Stack happens through an REST interface. You can interact with the Stack in two ways:
-
-* Install the `llama-stack-client` PyPI package and point `LlamaStackClient` to a local or remote Llama Stack server.
-* Or, install the `llama-stack` PyPI package and use the Stack as a library using `LlamaStackAsLibraryClient`.
-
-```{admonition} Note
-:class: tip
-
-The API is **exactly identical** for both clients.
-```
-
-:::{dropdown} Starting up the Llama Stack server
-The Llama Stack server can be configured flexibly so you can mix-and-match various providers for its individual API components -- beyond Inference, these include Vector IO, Agents, Telemetry, Evals, Post Training, etc.
-
-To get started quickly, we provide various container images for the server component that work with different inference providers out of the box. For this guide, we will use `llamastack/distribution-ollama` as the container image. If you'd like to build your own image or customize the configurations, please check out [this guide](../references/index.md).
-
-Lets setup some environment variables that we will use in the rest of the guide.
+Install uv
 ```bash
-export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
-export LLAMA_STACK_PORT=8321
+curl -LsSf https://astral.sh/uv/install.sh | sh
 ```

-Next you can create a local directory to mount into the container’s file system.
+Setup venv
 ```bash
-mkdir -p ~/.llama
+uv venv --python 3.10
+source .venv/bin/activate
 ```
-
-Then you can start the server using the container tool of your choice.  For example, if you are running Docker you can use the following command:
+Install llama stack
 ```bash
-docker run -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  llamastack/distribution-ollama \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env OLLAMA_URL=http://host.docker.internal:11434
+uv pip install llama-stack
 ```

-As another example, to start the container with Podman, you can do the same but replace `docker` at the start of the command with `podman`. If you are using `podman` older than `4.7.0`, please also replace `host.docker.internal` in the `OLLAMA_URL` with `host.containers.internal`.
-
-Configuration for this is available at `distributions/ollama/run.yaml`.
-
-```{admonition} Note
-:class: note
-
-Docker containers run in their own isolated network namespaces on Linux. To allow the container to communicate with services running on the host via `localhost`, you need `--network=host`. This makes the container use the host’s network directly so it can connect to Ollama running on `localhost:11434`.
-
-Linux users having issues running the above command should instead try the following:
+Build llama stack for ollama
 ```bash
-docker run -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  --network=host \
-  llamastack/distribution-ollama \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env OLLAMA_URL=http://localhost:11434
+llama stack build --template ollama --image-type venv
 ```

-:::
+Run llama stack
+```bash
+# Use the model from ollama. Run `ollama ps` to see if its still running
+INFERENCE_MODEL=llama3.2:3b-instruct-fp16 \
+    llama stack run ollama --image-type venv
+```

+You will see the output like below:
+```
+...
+INFO:     Application startup complete.
+INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
+```
+
+Now you can use the llama stack client to run inference and build agents!

 :::{dropdown} Installing the Llama Stack client CLI and SDK

-You can interact with the Llama Stack server using various client SDKs.  Note that you must be using Python 3.10 or newer. We will use the Python SDK which you can install via `conda` or `virtualenv`.
+Open a new terminal and navigate to the same directory you started the server from.

-For `conda`:
+Setup venv (llama-stack already includes the client package)
 ```bash
-yes | conda create -n stack-client python=3.10
-conda activate stack-client
-pip install llama-stack-client
+source .venv/bin/activate
 ```
-
-For `virtualenv`:
-```bash
-python -m venv stack-client
-source stack-client/bin/activate
-pip install llama-stack-client
-```
-
 Let's use the `llama-stack-client` CLI to check the connectivity to the server.

 ```bash
-$ llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT
-> Enter the API key (leave empty if no key is needed):
+llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT --api-key none
+```
+You will see the below:
+```
 Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321
+```

-$ llama-stack-client models list
+List the models
+```
+llama-stack-client models list
+```

+```
 Available Models

-┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
-┃ model_type   ┃ identifier                           ┃ provider_resource_id         ┃ metadata  ┃ provider_id ┃
-┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
-│ llm          │ meta-llama/Llama-3.2-3B-Instruct     │ llama3.2:3b-instruct-fp16    │           │ ollama      │
-└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘
+┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
+┃ model_type      ┃ identifier                          ┃ provider_resource_id                ┃ metadata                                  ┃ provider_id     ┃
+┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
+│ embedding       │ all-MiniLM-L6-v2                    │ all-minilm:latest                   │ {'embedding_dimension': 384.0}            │ ollama          │
+├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤
+│ llm             │ llama3.2:3b-instruct-fp16           │ llama3.2:3b-instruct-fp16           │                                           │ ollama          │
+└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘
+
+Total models: 2

-Total models: 1
 ```

 You can test basic Llama inference completion using the CLI too.
 ```bash
-llama-stack-client \
-  inference chat-completion \
-  --message "hello, what model are you?"
+llama-stack-client inference chat-completion --message "tell me a joke"
+```
+```
+ChatCompletionResponse(
+    completion_message=CompletionMessage(
+        content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!",
+        role='assistant',
+        stop_reason='end_of_turn',
+        tool_calls=[]
+    ),
+    logprobs=None,
+    metrics=[
+        Metric(metric='prompt_tokens', value=14.0, unit=None),
+        Metric(metric='completion_tokens', value=27.0, unit=None),
+        Metric(metric='total_tokens', value=41.0, unit=None)
+    ]
+)
 ```
 :::

@ -140,41 +126,22 @@ llama-stack-client \

 Here is a simple example to perform chat completions using the SDK.
 ```python
-import os
-import sys
+## lstest.py
+from llama_stack_client import LlamaStackClient

-
-def create_http_client():
-    from llama_stack_client import LlamaStackClient
-
-    return LlamaStackClient(
-        base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}"
-    )
-
-
-def create_library_client(template="ollama"):
-    from llama_stack import LlamaStackAsLibraryClient
-
-    client = LlamaStackAsLibraryClient(template)
-    if not client.initialize():
-        print("llama stack not built properly")
-        sys.exit(1)
-    return client
-
-
-client = (
-    create_library_client()
-)  # or create_http_client() depending on the environment you picked
+client = LlamaStackClient(base_url=f"http://localhost:8321")

 # List available models
 models = client.models.list()
-print("--- Available models: ---")
-for m in models:
-    print(f"- {m.identifier}")
-print()
+
+# Find the first LLM
+llm = next(m for m in models if m.model_type == 'llm')
+model_id = llm.identifier
+
+print("Model:", model_id)

 response = client.inference.chat_completion(
-    model_id=os.environ["INFERENCE_MODEL"],
+    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},
@ -183,49 +150,142 @@ response = client.inference.chat_completion(
 print(response.completion_message.content)
 ```

-To run the above example, put the code in a file called `inference.py`, ensure your `conda` or `virtualenv` environment is active, and run the following:
 ```bash
-pip install llama_stack
-llama stack build --template ollama --image-type <conda|venv>
-python inference.py
+python lstest.py
 ```

-### 4. Your first RAG agent
+```
+Model: llama3.2:3b-instruct-fp16
+Here is a haiku about coding:

-Here is an example of a simple RAG (Retrieval Augmented Generation) chatbot agent which can answer questions about TorchTune documentation.
+Lines of code unfold
+Logic flows through digital night
+Beauty in the bits
+```
+
+### 4. Your first agent

 ```python
-import os
+## lsagent.py
+
+from llama_stack_client import LlamaStackClient
+from llama_stack_client import Agent, AgentEventLogger
 import uuid
-from termcolor import cprint

-from llama_stack_client import Agent, AgentEventLogger, RAGDocument
+client = LlamaStackClient(base_url=f"http://localhost:8321")
+
+models = client.models.list()
+llm = next(m for m in models if m.model_type == 'llm')
+model_id = llm.identifier
+
+agent = Agent(client,
+    model=model_id,
+    instructions="You are a helpful assistant that can answer questions about the Torchtune project."
+)
+
+s_id = agent.create_session(session_name=f"s{uuid.uuid4()}")
+
+# Non-streaming example
+print("Non-streaming ...")
+response = agent.create_turn(
+    messages=[ {
+        "role": "user",
+        "content": "Who are you?"
+    }],
+    session_id=s_id,
+    stream=False
+)
+print("agent>", response.output_message.content)
+
+# Streamining with print helper
+print("Streaming with print helper...")
+stream = agent.create_turn(
+    messages=[ {
+        "role": "user",
+        "content": "Who are you?"
+    }],
+    session_id=s_id,
+    stream=True
+)
+for event in AgentEventLogger().log(stream):
+    event.print()


-def create_http_client():
-    from llama_stack_client import LlamaStackClient
+# Streaming example
+print("Streaming ...")
+stream = agent.create_turn(
+    messages=[ {
+        "role": "user",
+        "content": "Who are you?"
+    }],
+    session_id=s_id,
+    stream=True
+)
+for event in stream:
+    print(event)
+```

-    return LlamaStackClient(
-        base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}"
-    )
+**Run the agent**

+```bash
+python lsagent.py
+```
+Sample output
+```
+Non-streaming ...
+agent> I'm an AI assistant, and I'll be happy to help with any questions or information you have about the Torchtune project.

-def create_library_client(template="ollama"):
-    from llama_stack import LlamaStackAsLibraryClient
+For those who may not know, Torchtune is a popular open-source music composition tool that allows users to create and share musical compositions using a unique visual interface. It's designed to make music creation more accessible and fun for everyone, regardless of their musical background or experience level.

-    client = LlamaStackAsLibraryClient(template)
-    client.initialize()
-    return client
+What would you like to know about Torchtune? Are you looking for information on how to use the software, tutorials, or perhaps something else?
+Streaming with print helper...
+inference> I am an AI assistant specifically designed to provide information and support related to the Torchtune project. I don't have a personal identity in the classical sense, but I'm here to help answer your questions, provide guidance, and offer assistance with any topics related to Torchtune.

+I've been trained on a vast amount of text data, including documentation, tutorials, and community discussions about Torchtune, which enables me to provide accurate and up-to-date information. My goal is to be helpful and informative, so feel free to ask me anything you'd like to know about Torchtune!
+Streaming ...
+AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepStartPayload(event_type='step_start', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', metadata={})))
+AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='I', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
+AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text=' am', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
+...
+AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='!', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
+AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepCompletePayload(event_type='step_complete', step_details=InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC))), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
+AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseTurnCompletePayload(event_type='turn_complete', turn=Turn(input_messages=[UserMessage(content='Who are you?', role='user', context=None)], output_message=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='a705b5a1-b9a6-4cf5-a99a-7917cc093755', started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840680, tzinfo=TzInfo(UTC)), steps=[InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC)))], turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 987353, tzinfo=TzInfo(UTC)), output_attachments=[]))))
+```

-client = (
-    create_library_client()
-)  # or create_http_client() depending on the environment you picked
+### 5. RAG agent

-# Documents to be used for RAG
-urls = ["chat.rst", "llama3.rst", "memory_optimizations.rst", "lora_finetune.rst"]
+```python
+## rag_agent.py
+
+from llama_stack_client import LlamaStackClient
+from llama_stack_client import Agent, AgentEventLogger
+from llama_stack_client.types import Document
+import uuid
+
+client = LlamaStackClient(base_url=f"http://localhost:8321")
+
+# Create a vector database instance
+embedlm = next(m for m in client.models.list() if m.model_type == 'embedding')
+embedding_model = embedlm.identifier
+vdb = next(p for p in client.providers.list() if p.api == "vector_io")
+vector_db_id = f"v{uuid.uuid4()}"
+client.vector_dbs.register(
+    provider_id=vdb.provider_id,
+    vector_db_id=vector_db_id,
+    embedding_model=embedding_model,
+)
+
+# Create Documents
+urls = [
+    "memory_optimizations.rst",
+    "chat.rst",
+    "llama3.rst",
+    "datasets.rst",
+    "qat_finetune.rst",
+    "lora_finetune.rst",
+]
 documents = [
-    RAGDocument(
+    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
@ -234,67 +294,68 @@ documents = [
    for i, url in enumerate(urls)
 ]

-vector_providers = [
-    provider for provider in client.providers.list() if provider.api == "vector_io"
-]
-provider_id = vector_providers[0].provider_id  # Use the first available vector provider
-
-# Register a vector database
-vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
-client.vector_dbs.register(
-    vector_db_id=vector_db_id,
-    provider_id=provider_id,
-    embedding_model="all-MiniLM-L6-v2",
-    embedding_dimension=384,
-)
-
-# Insert the documents into the vector database
+# Insert documents
 client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
 )

-rag_agent = Agent(
-    client,
-    model=os.environ["INFERENCE_MODEL"],
-    # Define instructions for the agent ( aka system prompt)
-    instructions="You are a helpful assistant",
-    enable_session_persistence=False,
-    # Define tools available to the agent
-    tools=[
-        {
-            "name": "builtin::rag/knowledge_search",
-            "args": {
-                "vector_db_ids": [vector_db_id],
-            },
-        }
-    ],
-)
-session_id = rag_agent.create_session("test-session")
+# Get the model being served
+llm = next(m for m in client.models.list() if m.model_type == 'llm')
+model = llm.identifier

-user_prompts = [
-    "How to optimize memory usage in torchtune? use the knowledge_search tool to get information.",
+# Create RAG agent
+ragagent = Agent(client,
+    model=model,
+    instructions="You are a helpful assistant that can answer questions about the Torchtune project. Use the RAG tool to answer questions as needed.",
+    tools=[{
+        "name": "builtin::rag",
+        "args": {"vector_db_ids": [vector_db_id]},
+    }],
+)
+
+s_id = ragagent.create_session(
+    session_name=f"s{uuid.uuid4()}"
+)
+
+turns = [
+    "what is torchtune",
+    "tell me about dora"
 ]

-# Run the agent loop by calling the `create_turn` method
-for prompt in user_prompts:
-    cprint(f"User> {prompt}", "green")
-    response = rag_agent.create_turn(
-        messages=[{"role": "user", "content": prompt}],
-        session_id=session_id,
+for t in turns:
+    print("user>", t)
+    stream = ragagent.create_turn(
+        messages=[{
+            "role": "user",
+            "content": t
+        }],
+        session_id=s_id,
+        stream=True
    )
-    for log in AgentEventLogger().log(response):
-        log.print()
+    for chunk in stream:
+        event_type = chunk.event.payload.event_type
+        if event_type == 'step_progress':
+            print(chunk.event.payload.delta.text, end='', flush=True)
 ```
-
-To run the above example, put the code in a file called `rag.py`, ensure your `conda` or `virtualenv` environment is active, and run the following:
-```bash
-pip install llama_stack
-llama stack build --template ollama --image-type <conda|venv>
-python rag.py
 ```
+python lsragagent.py
+```
+Sample output:
+```
+user> what is torchtune
+inference> [knowledge_search(query='TorchTune')]
+tool_execution> Tool:knowledge_search Args:{'query': 'TorchTune'}
+tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text='Result 1:\nDocument_id:num-1\nContent:  conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. ..., type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
+inference> Here is a high-level overview of the text:

+**LoRA Finetuning with PyTorch Tune**
+
+PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which is a technique to adapt pre-trained models to new tasks. The recipe uses the `lora_finetune_distributed` command.
+...
+Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results.
+```
 ## Next Steps

 - Learn more about Llama Stack [Concepts](../concepts/index.md)