docs: Updated docs to show minimal RAG example and some other minor changes (#1935)

# What does this PR do?
Incorporating some feedback into the docs.

- **`docs/source/getting_started/index.md`:**
    - Demo actually does RAG now
    - Simplified the installation command for dependencies.
    - Updated demo script examples to align with the latest API changes.
- Replaced manual document manipulation with `RAGDocument` for clarity
and maintainability.
- Introduced new logic for model and embedding selection using the Llama
Stack Client SDK.
- Enhanced examples to showcase proper agent initialization and logging.
- **`docs/source/getting_started/detailed_tutorial.md`:**
- Updated the section for listing models to include proper code
formatting with `bash`.
    - Removed and reorganized the "Run the Demos" section for clarity.
- Adjusted tab-item structures and added new instructions for demo
scripts.
- **`docs/_static/css/my_theme.css`:**
- Updated heading styles to include `h2`, `h3`, and `h4` for consistent
font weight.
- Added a new style for `pre` tags to wrap text and break long words,
this is particularly useful for rendering long output from generation.

    
## Test Plan
Tested locally. Screenshot for reference:

<img width="1250" alt="Screenshot 2025-04-10 at 10 12 12 PM"
src="https://github.com/user-attachments/assets/ce1c8986-e072-4c6f-a697-ed0d8fb75b34"
/>

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This commit is contained in:
Francisco Arceo 2025-04-11 12:50:36 -06:00 committed by GitHub
parent c1cb6aad11
commit 24d70cedca
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 62 additions and 71 deletions

View file

@ -17,9 +17,13 @@
display: none;
}
h3 {
h2, h3, h4 {
font-weight: normal;
}
html[data-theme="dark"] .rst-content div[class^="highlight"] {
background-color: #0b0b0b;
}
pre {
white-space: pre-wrap !important;
word-break: break-all;
}

View file

@ -173,9 +173,8 @@ You will see the below:
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321
```
#### iii. List Available Models
List the models
```
```bash
llama-stack-client models list
Available Models
@ -190,15 +189,6 @@ Available Models
Total models: 2
```
## Step 4: Run the Demos
Note that these demos show the [Python Client SDK](../references/python_sdk_reference/index.md).
Other SDKs are also available, please refer to the [Client SDK](../index.md#client-sdks) list for the complete options.
::::{tab-set}
:::{tab-item} Basic Inference with the CLI
You can test basic Llama inference completion using the CLI.
```bash
@ -221,10 +211,16 @@ ChatCompletionResponse(
],
)
```
:::
:::{tab-item} Basic Inference with a Script
Alternatively, you can run inference using the Llama Stack client SDK.
## Step 4: Run the Demos
Note that these demos show the [Python Client SDK](../references/python_sdk_reference/index.md).
Other SDKs are also available, please refer to the [Client SDK](../index.md#client-sdks) list for the complete options.
::::{tab-set}
:::{tab-item} Basic Inference
Now you can run inference using the Llama Stack client SDK.
### i. Create the Script
Create a file `inference.py` and add the following code:
@ -269,7 +265,7 @@ Beauty in the bits
:::
:::{tab-item} Build a Simple Agent
Now we can move beyond simple inference and build an agent that can perform tasks using the Llama Stack server.
Next we can move beyond simple inference and build an agent that can perform tasks using the Llama Stack server.
### i. Create the Script
Create a file `agent.py` and add the following code:

View file

@ -12,9 +12,8 @@ as the inference [provider](../providers/index.md#inference) for a Llama Model.
Install [uv](https://docs.astral.sh/uv/), setup your virtual environment, and run inference on a Llama model with
[Ollama](https://ollama.com/download).
```bash
uv pip install llama-stack aiosqlite faiss-cpu ollama openai datasets opentelemetry-exporter-otlp-proto-http mcp autoevals
uv pip install llama-stack
source .venv/bin/activate
export INFERENCE_MODEL="llama3.2:3b"
ollama run llama3.2:3b --keepalive 60m
```
## Step 2: Run the Llama Stack Server
@ -24,70 +23,62 @@ INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type ven
## Step 3: Run the Demo
Now open up a new terminal using the same virtual environment and you can run this demo as a script using `uv run demo_script.py` or in an interactive shell.
```python
from termcolor import cprint
from llama_stack_client.types import Document
from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
vector_db_id = "my_demo_vector_db"
client = LlamaStackClient(base_url="http://localhost:8321")
vector_db = "faiss"
vector_db_id = "test-vector-db"
model_id = "llama3.2:3b-instruct-fp16"
query = "Can you give me the arxiv link for Lora Fine Tuning in Pytorch?"
documents = [
Document(
models = client.models.list()
# Select the first LLM and first embedding models
model_id = next(m for m in models if m.model_type == "llm").identifier
embedding_model_id = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = em.metadata["embedding_dimension"]
_ = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=embedding_model_id,
embedding_dimension=embedding_dimension,
provider_id="faiss",
)
document = RAGDocument(
document_id="document_1",
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/lora_finetune.rst",
mime_type="text/plain",
content="https://www.paulgraham.com/greatwork.html",
mime_type="text/html",
metadata={},
)
]
client = LlamaStackClient(base_url="http://localhost:8321")
client.vector_dbs.register(
provider_id=vector_db,
vector_db_id=vector_db_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
)
client.tool_runtime.rag_tool.insert(
documents=documents,
documents=[document],
vector_db_id=vector_db_id,
chunk_size_in_tokens=50,
)
response = client.tool_runtime.rag_tool.query(
vector_db_ids=[vector_db_id],
content=query,
agent = Agent(
client,
model=model_id,
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {"vector_db_ids": [vector_db_id]},
}
],
)
cprint("" + "-" * 50, "yellow")
cprint(f"Query> {query}", "red")
cprint("" + "-" * 50, "yellow")
for chunk in response.content:
cprint(f"Chunk ID> {chunk.text}", "green")
cprint("" + "-" * 50, "yellow")
response = agent.create_turn(
messages=[{"role": "user", "content": "How do you do great work?"}],
session_id=agent.create_session("rag_session"),
)
for log in AgentEventLogger().log(response):
log.print()
```
And you should see output like below.
```
--------------------------------------------------
Query> Can you give me the arxiv link for Lora Fine Tuning in Pytorch?
--------------------------------------------------
Chunk ID> knowledge_search tool found 5 chunks:
BEGIN of knowledge_search tool results.
--------------------------------------------------
Chunk ID> Result 1:
Document_id:docum
Content: .. _lora_finetune_label:
============================
Fine-Tuning Llama2 with LoRA
============================
This guide will teach you about `LoRA <https://arxiv.org/abs/2106.09685>`_, a
--------------------------------------------------
```bash
inference> [knowledge_search(query="What does it mean to do great work")]
tool_execution> Tool:knowledge_search Args:{'query': 'What does it mean to do great work'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text="Result 1:\nDocument_id:docum\nContent: work. Doing great work means doing something important\nso well that you expand people's ideas of what's possible. But\nthere's no threshold for importance. It's a matter of degree, and\noften hard to judge at the time anyway.\n", type='text'), TextContentItem(text='Result 2:\nDocument_id:docum\nContent: [<a name="f1n"><font color=#000000>1</font></a>]\nI don\'t think you could give a precise definition of what\ncounts as great work. Doing great work means doing something important\nso well\n', type='text'), TextContentItem(text="Result 3:\nDocument_id:docum\nContent: . And if so\nyou're already further along than you might realize, because the\nset of people willing to want to is small.<br /><br />The factors in doing great work are factors in the literal,\nmathematical sense, and\n", type='text'), TextContentItem(text="Result 4:\nDocument_id:docum\nContent: \nincreases your morale and helps you do even better work. But this\ncycle also operates in the other direction: if you're not doing\ngood work, that can demoralize you and make it even harder to. Since\nit matters\n", type='text'), TextContentItem(text="Result 5:\nDocument_id:docum\nContent: to try to do\ngreat work. But that's what's going on subconsciously; they shy\naway from the question.<br /><br />So I'm going to pull a sneaky trick on you. Do you want to do great\n", type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
```
Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳