Simplify getting started doc

Focus on just venv
This commit is contained in:
raghotham 2025-03-29 11:40:45 -07:00 committed by GitHub
parent d8a8a734b5
commit 77cac70dd3
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -22,115 +22,101 @@ If you do not have ollama, you can install it from [here](https://ollama.com/dow
```
### 2. Pick a client environment
### 2. Use `uv` to install and run Llama Stack
Llama Stack has a service-oriented architecture, so every interaction with the Stack happens through an REST interface. You can interact with the Stack in two ways:
* Install the `llama-stack-client` PyPI package and point `LlamaStackClient` to a local or remote Llama Stack server.
* Or, install the `llama-stack` PyPI package and use the Stack as a library using `LlamaStackAsLibraryClient`.
```{admonition} Note
:class: tip
The API is **exactly identical** for both clients.
```
:::{dropdown} Starting up the Llama Stack server
The Llama Stack server can be configured flexibly so you can mix-and-match various providers for its individual API components -- beyond Inference, these include Vector IO, Agents, Telemetry, Evals, Post Training, etc.
To get started quickly, we provide various container images for the server component that work with different inference providers out of the box. For this guide, we will use `llamastack/distribution-ollama` as the container image. If you'd like to build your own image or customize the configurations, please check out [this guide](../references/index.md).
Lets setup some environment variables that we will use in the rest of the guide.
Install uv
```bash
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Next you can create a local directory to mount into the containers file system.
Setup venv
```bash
mkdir -p ~/.llama
uv venv --python 3.10
source .venv/bin/activate
```
Then you can start the server using the container tool of your choice. For example, if you are running Docker you can use the following command:
Install llama stack
```bash
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
uv pip install llama-stack
```
As another example, to start the container with Podman, you can do the same but replace `docker` at the start of the command with `podman`. If you are using `podman` older than `4.7.0`, please also replace `host.docker.internal` in the `OLLAMA_URL` with `host.containers.internal`.
Configuration for this is available at `distributions/ollama/run.yaml`.
```{admonition} Note
:class: note
Docker containers run in their own isolated network namespaces on Linux. To allow the container to communicate with services running on the host via `localhost`, you need `--network=host`. This makes the container use the hosts network directly so it can connect to Ollama running on `localhost:11434`.
Linux users having issues running the above command should instead try the following:
Build llama stack for ollama
```bash
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
--network=host \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://localhost:11434
llama stack build --template ollama --image-type venv
```
:::
Run llama stack
```bash
# Use the model from ollama. Run `ollama ps` to see if its still running
INFERENCE_MODEL=llama3.2:3b-instruct-fp16 \
llama stack run ollama --image-type venv
```
You will see the output like below:
```
...
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
Now you can use the llama stack client to run inference and build agents!
:::{dropdown} Installing the Llama Stack client CLI and SDK
You can interact with the Llama Stack server using various client SDKs. Note that you must be using Python 3.10 or newer. We will use the Python SDK which you can install via `conda` or `virtualenv`.
Open a new terminal and navigate to the same directory you started the server from.
For `conda`:
Setup venv (llama-stack already includes the client package)
```bash
yes | conda create -n stack-client python=3.10
conda activate stack-client
pip install llama-stack-client
source .venv/bin/activate
```
For `virtualenv`:
```bash
python -m venv stack-client
source stack-client/bin/activate
pip install llama-stack-client
```
Let's use the `llama-stack-client` CLI to check the connectivity to the server.
```bash
$ llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT
> Enter the API key (leave empty if no key is needed):
llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT --api-key none
```
You will see the below:
```
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321
```
$ llama-stack-client models list
List the models
```
llama-stack-client models list
```
```
Available Models
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ llm │ meta-llama/Llama-3.2-3B-Instruct │ llama3.2:3b-instruct-fp16 │ │ ollama │
└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ embedding │ all-MiniLM-L6-v2 │ all-minilm:latest │ {'embedding_dimension': 384.0} │ ollama │
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤
│ llm │ llama3.2:3b-instruct-fp16 │ llama3.2:3b-instruct-fp16 │ │ ollama │
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘
Total models: 2
Total models: 1
```
You can test basic Llama inference completion using the CLI too.
```bash
llama-stack-client \
inference chat-completion \
--message "hello, what model are you?"
llama-stack-client inference chat-completion --message "tell me a joke"
```
```
ChatCompletionResponse(
completion_message=CompletionMessage(
content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!",
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]
),
logprobs=None,
metrics=[
Metric(metric='prompt_tokens', value=14.0, unit=None),
Metric(metric='completion_tokens', value=27.0, unit=None),
Metric(metric='total_tokens', value=41.0, unit=None)
]
)
```
:::
@ -140,41 +126,22 @@ llama-stack-client \
Here is a simple example to perform chat completions using the SDK.
```python
import os
import sys
## lstest.py
from llama_stack_client import LlamaStackClient
def create_http_client():
from llama_stack_client import LlamaStackClient
return LlamaStackClient(
base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}"
)
def create_library_client(template="ollama"):
from llama_stack import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient(template)
if not client.initialize():
print("llama stack not built properly")
sys.exit(1)
return client
client = (
create_library_client()
) # or create_http_client() depending on the environment you picked
client = LlamaStackClient(base_url=f"http://localhost:8321")
# List available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
print(f"- {m.identifier}")
print()
# Find the first LLM
llm = next(m for m in models if m.model_type == 'llm')
model_id = llm.identifier
print("Model:", model_id)
response = client.inference.chat_completion(
model_id=os.environ["INFERENCE_MODEL"],
model_id=model_id,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about coding"},
@ -183,49 +150,142 @@ response = client.inference.chat_completion(
print(response.completion_message.content)
```
To run the above example, put the code in a file called `inference.py`, ensure your `conda` or `virtualenv` environment is active, and run the following:
```bash
pip install llama_stack
llama stack build --template ollama --image-type <conda|venv>
python inference.py
python lstest.py
```
### 4. Your first RAG agent
```
Model: llama3.2:3b-instruct-fp16
Here is a haiku about coding:
Here is an example of a simple RAG (Retrieval Augmented Generation) chatbot agent which can answer questions about TorchTune documentation.
Lines of code unfold
Logic flows through digital night
Beauty in the bits
```
### 4. Your first agent
```python
import os
## lsagent.py
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
import uuid
from termcolor import cprint
from llama_stack_client import Agent, AgentEventLogger, RAGDocument
client = LlamaStackClient(base_url=f"http://localhost:8321")
models = client.models.list()
llm = next(m for m in models if m.model_type == 'llm')
model_id = llm.identifier
agent = Agent(client,
model=model_id,
instructions="You are a helpful assistant that can answer questions about the Torchtune project."
)
s_id = agent.create_session(session_name=f"s{uuid.uuid4()}")
# Non-streaming example
print("Non-streaming ...")
response = agent.create_turn(
messages=[ {
"role": "user",
"content": "Who are you?"
}],
session_id=s_id,
stream=False
)
print("agent>", response.output_message.content)
# Streamining with print helper
print("Streaming with print helper...")
stream = agent.create_turn(
messages=[ {
"role": "user",
"content": "Who are you?"
}],
session_id=s_id,
stream=True
)
for event in AgentEventLogger().log(stream):
event.print()
def create_http_client():
from llama_stack_client import LlamaStackClient
# Streaming example
print("Streaming ...")
stream = agent.create_turn(
messages=[ {
"role": "user",
"content": "Who are you?"
}],
session_id=s_id,
stream=True
)
for event in stream:
print(event)
```
return LlamaStackClient(
base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}"
)
**Run the agent**
```bash
python lsagent.py
```
Sample output
```
Non-streaming ...
agent> I'm an AI assistant, and I'll be happy to help with any questions or information you have about the Torchtune project.
def create_library_client(template="ollama"):
from llama_stack import LlamaStackAsLibraryClient
For those who may not know, Torchtune is a popular open-source music composition tool that allows users to create and share musical compositions using a unique visual interface. It's designed to make music creation more accessible and fun for everyone, regardless of their musical background or experience level.
client = LlamaStackAsLibraryClient(template)
client.initialize()
return client
What would you like to know about Torchtune? Are you looking for information on how to use the software, tutorials, or perhaps something else?
Streaming with print helper...
inference> I am an AI assistant specifically designed to provide information and support related to the Torchtune project. I don't have a personal identity in the classical sense, but I'm here to help answer your questions, provide guidance, and offer assistance with any topics related to Torchtune.
I've been trained on a vast amount of text data, including documentation, tutorials, and community discussions about Torchtune, which enables me to provide accurate and up-to-date information. My goal is to be helpful and informative, so feel free to ask me anything you'd like to know about Torchtune!
Streaming ...
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepStartPayload(event_type='step_start', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', metadata={})))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='I', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text=' am', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
...
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='!', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepCompletePayload(event_type='step_complete', step_details=InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC))), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseTurnCompletePayload(event_type='turn_complete', turn=Turn(input_messages=[UserMessage(content='Who are you?', role='user', context=None)], output_message=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='a705b5a1-b9a6-4cf5-a99a-7917cc093755', started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840680, tzinfo=TzInfo(UTC)), steps=[InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC)))], turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 987353, tzinfo=TzInfo(UTC)), output_attachments=[]))))
```
client = (
create_library_client()
) # or create_http_client() depending on the environment you picked
### 5. RAG agent
# Documents to be used for RAG
urls = ["chat.rst", "llama3.rst", "memory_optimizations.rst", "lora_finetune.rst"]
```python
## rag_agent.py
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from llama_stack_client.types import Document
import uuid
client = LlamaStackClient(base_url=f"http://localhost:8321")
# Create a vector database instance
embedlm = next(m for m in client.models.list() if m.model_type == 'embedding')
embedding_model = embedlm.identifier
vdb = next(p for p in client.providers.list() if p.api == "vector_io")
vector_db_id = f"v{uuid.uuid4()}"
client.vector_dbs.register(
provider_id=vdb.provider_id,
vector_db_id=vector_db_id,
embedding_model=embedding_model,
)
# Create Documents
urls = [
"memory_optimizations.rst",
"chat.rst",
"llama3.rst",
"datasets.rst",
"qat_finetune.rst",
"lora_finetune.rst",
]
documents = [
RAGDocument(
Document(
document_id=f"num-{i}",
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
mime_type="text/plain",
@ -234,67 +294,68 @@ documents = [
for i, url in enumerate(urls)
]
vector_providers = [
provider for provider in client.providers.list() if provider.api == "vector_io"
]
provider_id = vector_providers[0].provider_id # Use the first available vector provider
# Register a vector database
vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"
client.vector_dbs.register(
vector_db_id=vector_db_id,
provider_id=provider_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
)
# Insert the documents into the vector database
# Insert documents
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
rag_agent = Agent(
client,
model=os.environ["INFERENCE_MODEL"],
# Define instructions for the agent ( aka system prompt)
instructions="You are a helpful assistant",
enable_session_persistence=False,
# Define tools available to the agent
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
},
}
],
)
session_id = rag_agent.create_session("test-session")
# Get the model being served
llm = next(m for m in client.models.list() if m.model_type == 'llm')
model = llm.identifier
user_prompts = [
"How to optimize memory usage in torchtune? use the knowledge_search tool to get information.",
# Create RAG agent
ragagent = Agent(client,
model=model,
instructions="You are a helpful assistant that can answer questions about the Torchtune project. Use the RAG tool to answer questions as needed.",
tools=[{
"name": "builtin::rag",
"args": {"vector_db_ids": [vector_db_id]},
}],
)
s_id = ragagent.create_session(
session_name=f"s{uuid.uuid4()}"
)
turns = [
"what is torchtune",
"tell me about dora"
]
# Run the agent loop by calling the `create_turn` method
for prompt in user_prompts:
cprint(f"User> {prompt}", "green")
response = rag_agent.create_turn(
messages=[{"role": "user", "content": prompt}],
session_id=session_id,
for t in turns:
print("user>", t)
stream = ragagent.create_turn(
messages=[{
"role": "user",
"content": t
}],
session_id=s_id,
stream=True
)
for log in AgentEventLogger().log(response):
log.print()
for chunk in stream:
event_type = chunk.event.payload.event_type
if event_type == 'step_progress':
print(chunk.event.payload.delta.text, end='', flush=True)
```
To run the above example, put the code in a file called `rag.py`, ensure your `conda` or `virtualenv` environment is active, and run the following:
```bash
pip install llama_stack
llama stack build --template ollama --image-type <conda|venv>
python rag.py
```
python lsragagent.py
```
Sample output:
```
user> what is torchtune
inference> [knowledge_search(query='TorchTune')]
tool_execution> Tool:knowledge_search Args:{'query': 'TorchTune'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text='Result 1:\nDocument_id:num-1\nContent: conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. ..., type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
inference> Here is a high-level overview of the text:
**LoRA Finetuning with PyTorch Tune**
PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which is a technique to adapt pre-trained models to new tasks. The recipe uses the `lora_finetune_distributed` command.
...
Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results.
```
## Next Steps
- Learn more about Llama Stack [Concepts](../concepts/index.md)