Add documentations for building applications and with some content for agentic loop

2025-06-28 02:53:30 +00:00 · 2024-12-08 14:56:03 -08:00 · 2024-12-08 14:56:03 -08:00 · 1274fa4c0d
commit 1274fa4c0d
parent a29013112f
5 changed files with 424 additions and 16 deletions
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -9,3 +9,4 @@ sphinx-tabs
 sphinx-design
 sphinxcontrib-openapi
 sphinxcontrib-redoc
 sphinxcontrib-mermaid
--- a/docs/source/building_applications/index.md
+++ b/docs/source/building_applications/index.md
@ -1,17 +1,413 @@
-# Building Applications
+# Building AI Applications
-```{admonition} Work in Progress
+Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively.
 :class: warning
-## What can you do with the Stack?
+## Basic Inference
- Agents
+The foundation of any AI application is the ability to interact with LLM models. Llama Stack provides a simple interface for both completion and chat-based inference:
-  - what is a turn? session?
+
-  - inference
+```python
-  - memory / RAG; pre-ingesting content or attaching content in a turn
+from llama_stack_client import LlamaStackClient
-  - how does tool calling work
+
-  - can you do evaluation?
+client = LlamaStackClient(base_url="http://localhost:5001")
 # List available models
 models = client.models.list()
 # Simple chat completion
 response = client.inference.chat_completion(
    model_id="Llama3.2-3B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"}
    ]
 )
 print(response.completion_message.content)
 ```
 ## Adding Memory & RAG
 Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks:
 1. **Vector Memory Banks**: For semantic search and retrieval
 2. **Key-Value Memory Banks**: For structured data storage
 3. **Keyword Memory Banks**: For basic text search
 4. **Graph Memory Banks**: For relationship-based retrieval
 Here's how to set up a vector memory bank for RAG:
 ```python
 # Register a memory bank
 bank_id = "my_documents"
 response = client.memory_banks.register(
    memory_bank_id=bank_id,
    params={
        "memory_bank_type": "vector",
        "embedding_model": "all-MiniLM-L6-v2",
        "chunk_size_in_tokens": 512
    }
 )
 # Insert documents
 documents = [
    {
        "document_id": "doc1",
        "content": "Your document text here",
        "mime_type": "text/plain"
    }
 ]
 client.memory.insert(bank_id, documents)
 # Query documents
 results = client.memory.query(
    bank_id=bank_id,
    query="What do you know about...",
 )
 ```
 ## Implementing Safety Guardrails
 Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
 ```python
 # Register a safety shield
 shield_id = "content_safety"
 client.shields.register(
    shield_id=shield_id,
    provider_shield_id="llama-guard-basic"
 )
 # Run content through shield
 response = client.safety.run_shield(
    shield_id=shield_id,
    messages=[{"role": "user", "content": "User message here"}]
 )
 if response.violation:
    print(f"Safety violation detected: {response.violation.user_message}")
 ```
 ## Building Agents
 Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
 ### The Agent Execution Loop
 Each agent turn follows these key steps:
 1. **Initial Safety Check**: The user's input is first screened through configured safety shields
 2. **Context Retrieval**:
   - If RAG is enabled, the agent queries relevant documents from memory banks
   - For new documents, they are first inserted into the memory bank
   - Retrieved context is augmented to the user's prompt
 3. **Inference Loop**: The agent enters its main execution loop:
   - The LLM receives the augmented prompt (with context and/or previous tool outputs)
   - The LLM generates a response, potentially with tool calls
   - If tool calls are present:
     - Tool inputs are safety-checked
     - Tools are executed (e.g., web search, code execution)
     - Tool responses are fed back to the LLM for synthesis
   - The loop continues until:
     - The LLM provides a final response without tool calls
     - Maximum iterations are reached
     - Token limit is exceeded
 4. **Final Safety Check**: The agent's final response is screened through safety shields
 ```{mermaid}
 sequenceDiagram
    participant U as User
    participant E as Executor
    participant M as Memory Bank
    participant L as LLM
    participant T as Tools
    participant S as Safety Shield
    Note over U,S: Agent Turn Start
    U->>S: 1. Submit Prompt
    activate S
    S->>E: Input Safety Check
    deactivate S
    E->>M: 2.1 Query Context
    M-->>E: 2.2 Retrieved Documents
    loop Inference Loop
        E->>L: 3.1 Augment with Context
        L-->>E: 3.2 Response (with/without tool calls)
        alt Has Tool Calls
            E->>S: Check Tool Input
            S->>T: 4.1 Execute Tool
            T-->>E: 4.2 Tool Response
            E->>L: 5.1 Tool Response
            L-->>E: 5.2 Synthesized Response
        end
        opt Stop Conditions
            Note over E: Break if:
            Note over E: - No tool calls
            Note over E: - Max iterations reached
            Note over E: - Token limit exceeded
        end
    end
    E->>S: Output Safety Check
    S->>U: 6. Final Response
 ```
 Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
 ```python
 from llama_stack_client.lib.agents.event_logger import EventLogger
 agent_config = AgentConfig(
    model="Llama3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    # Enable both RAG and tool usage
    tools=[
        {
            "type": "memory",
            "memory_bank_configs": [{
                "type": "vector",
                "bank_id": "my_docs"
            }],
            "max_tokens_in_context": 4096
        },
        {
            "type": "code_interpreter",
            "enable_inline_code_execution": True
        }
    ],
    # Configure safety
    input_shields=["content_safety"],
    output_shields=["content_safety"],
    # Control the inference loop
    max_infer_iters=5,
    sampling_params={
        "temperature": 0.7,
        "max_tokens": 2048
    }
 )
 agent = Agent(client, agent_config)
 session_id = agent.create_session("monitored_session")
 # Stream the agent's execution steps
 response = agent.create_turn(
    messages=[{"role": "user", "content": "Analyze this code and run it"}],
    attachments=[{
        "content": "https://raw.githubusercontent.com/example/code.py",
        "mime_type": "text/plain"
    }],
    session_id=session_id
 )
 # Monitor each step of execution
 for log in EventLogger().log(response):
    if log.event.step_type == "memory_retrieval":
        print("Retrieved context:", log.event.retrieved_context)
    elif log.event.step_type == "inference":
        print("LLM output:", log.event.model_response)
    elif log.event.step_type == "tool_execution":
        print("Tool call:", log.event.tool_call)
        print("Tool response:", log.event.tool_response)
    elif log.event.step_type == "shield_call":
        if log.event.violation:
            print("Safety violation:", log.event.violation)
 ```
 This example shows how an agent can: Llama Stack provides a high-level agent framework:
 ```python
 from llama_stack_client.lib.agents.agent import Agent
 from llama_stack_client.types.agent_create_params import AgentConfig
 # Configure an agent
 agent_config = AgentConfig(
    model="Llama3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    tools=[
        {
            "type": "memory",
            "memory_bank_configs": [],
            "query_generator_config": {
                "type": "default",
                "sep": " "
            }
        }
    ],
    input_shields=["content_safety"],
    output_shields=["content_safety"],
    enable_session_persistence=True
 )
 # Create an agent
 agent = Agent(client, agent_config)
 session_id = agent.create_session("my_session")
 # Run agent turns
 response = agent.create_turn(
    messages=[{"role": "user", "content": "Your question here"}],
    session_id=session_id
 )
 ```
 ### Adding Tools to Agents
 Agents can be enhanced with various tools:
 1. **Search**: Web search capabilities through providers like Brave
 2. **Code Interpreter**: Execute code snippets
 3. **RAG**: Memory and document retrieval
 4. **Function Calling**: Custom function execution
 5. **WolframAlpha**: Mathematical computations
 6. **Photogen**: Image generation
 Example of configuring an agent with tools:
 ```python
 agent_config = AgentConfig(
    model="Llama3.2-3B-Instruct",
    tools=[
        {
            "type": "brave_search",
            "api_key": "YOUR_API_KEY",
            "engine": "brave"
        },
        {
            "type": "code_interpreter",
            "enable_inline_code_execution": True
        }
    ],
    tool_choice="auto",
    tool_prompt_format="json"
 )
 ```
 ## Building RAG-Enhanced Agents
 One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
 ```python
 from llama_stack_client.types import Attachment
 # Create attachments from documents
 attachments = [
    Attachment(
        content="https://raw.githubusercontent.com/example/doc.rst",
        mime_type="text/plain"
    )
 ]
 # Configure agent with memory
 agent_config = AgentConfig(
    model="Llama3.2-3B-Instruct",
    instructions="You are a helpful assistant",
    tools=[{
        "type": "memory",
        "memory_bank_configs": [],
        "query_generator_config": {"type": "default", "sep": " "},
        "max_tokens_in_context": 4096,
        "max_chunks": 10
    }],
    enable_session_persistence=True
 )
 agent = Agent(client, agent_config)
 session_id = agent.create_session("rag_session")
 # Initial document ingestion
 response = agent.create_turn(
    messages=[{
        "role": "user",
        "content": "I am providing some documents for reference."
    }],
    attachments=attachments,
    session_id=session_id
 )
 # Query with RAG
 response = agent.create_turn(
    messages=[{
        "role": "user",
        "content": "What are the key topics in the documents?"
    }],
    session_id=session_id
 )
 ```
 ## Testing & Evaluation
 Llama Stack provides built-in tools for evaluating your applications:
 1. **Benchmarking**: Test against standard datasets
 2. **Application Evaluation**: Score your application's outputs
 3. **Custom Metrics**: Define your own evaluation criteria
 Here's how to set up basic evaluation:
 ```python
 # Create an evaluation task
 response = client.eval_tasks.register(
    eval_task_id="my_eval",
    dataset_id="my_dataset",
    scoring_functions=["accuracy", "relevance"]
 )
 # Run evaluation
 job = client.eval.run_eval(
    task_id="my_eval",
    task_config={
        "type": "app",
        "eval_candidate": {
            "type": "agent",
            "config": agent_config
        }
    }
 )
 # Get results
 result = client.eval.job_result(
    task_id="my_eval",
    job_id=job.job_id
 )
 ```
 ## Debugging & Monitoring
 Llama Stack includes comprehensive telemetry for debugging and monitoring your applications:
 1. **Tracing**: Track request flows across components
 2. **Metrics**: Measure performance and usage
 3. **Logging**: Debug issues and track behavior
 The telemetry system supports multiple output formats:
 - OpenTelemetry for visualization in tools like Jaeger
 - SQLite for local storage and querying
 - Console output for development
 Example of querying traces:
 ```python
 # Query traces for a session
 traces = client.telemetry.query_traces(
    attribute_filters=[{
        "key": "session_id",
        "op": "eq",
        "value": session_id
    }]
 )
 # Get detailed span information
 span_tree = client.telemetry.get_span_tree(
    span_id=traces[0].root_span_id
 )
 ```
 For details on how to use the telemetry system to debug your applications, export traces to a dataset, and run evaluations, see the [Telemetry](telemetry) section.
 ```{toctree}
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -28,6 +28,7 @@ extensions = [
    "sphinx_tabs.tabs",
    "sphinx_design",
    "sphinxcontrib.redoc",
    "sphinxcontrib.mermaid",
 ]
 myst_enable_extensions = ["colon_fence"]
@ -47,6 +48,7 @@ exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
 myst_enable_extensions = [
    "amsmath",
    "attrs_inline",
    "attrs_block",
    "colon_fence",
    "deflist",
    "dollarmath",
@ -65,6 +67,7 @@ myst_substitutions = {
    "docker_hub": "https://hub.docker.com/repository/docker/llamastack",
 }
 # Copy button settings
 copybutton_prompt_text = "$ "  # for bash prompts
 copybutton_prompt_is_regexp = True
--- a/docs/source/distributions/configuration.md
+++ b/docs/source/distributions/configuration.md
@ -81,6 +81,8 @@ A few things to note:
 - The configuration dictionary is provider-specific. Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
 ## Resources
 ```
 Finally, let's look at the `models` section:
 ```yaml
 models:
--- a/docs/source/getting_started/index.md
+++ b/docs/source/getting_started/index.md
@ -19,16 +19,17 @@ export LLAMA_STACK_PORT=5001
 ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
 ```
-By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to enspagents/agenure the model remains loaded for sometime.
+By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime.
 ### 2. Start the Llama Stack server
 Llama Stack is based on a client-server architecture. It consists of a server which can be configured very flexibly so you can mix-and-match various providers for its individual API components -- beyond Inference, these include Memory, Agents, Telemetry, Evals and so forth.
 To get started quickly, we provide various Docker images for the server component that work with different inference providers out of the box. For this guide, we will use `llamastack/distribution-ollama` as the Docker image.
 ```bash
-docker run \
+docker run -it \
  -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-ollama \
@ -42,8 +43,7 @@ Configuration for this is available at `distributions/ollama/run.yaml`.
 ### 3. Use the Llama Stack client SDK
-You can interact with the Llama Stack server using the `llama-stack-client` CLI or via the Python SDK.
+You can interact with the Llama Stack server using various client SDKs. We will use the Python SDK which you can install using:
 ```bash
 pip install llama-stack-client
 ```
@ -123,7 +123,6 @@ async def run_main():
    agent = Agent(client, agent_config)
    session_id = agent.create_session("test-session")
    print(f"Created session_id={session_id} for Agent({agent.agent_id})")
    user_prompts = [
        (
            "I am attaching documentation for Torchtune. Help me answer questions I will ask next.",
@ -154,3 +153,10 @@ if __name__ == "__main__":
 - Learn how to [Build Llama Stacks](../distributions/index.md)
 - See [References](../references/index.md) for more details about the llama CLI and Python SDK
 - For example applications and more detailed tutorials, visit our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository.
 ## Thinking out aloud here in terms of what to write in the docs
 - how to get a llama stack server running
 - what are all the different client sdks
 - what are the components of building agents