diff --git a/docs/requirements.txt b/docs/requirements.txt index c182f41c4..d455cf6b5 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -9,3 +9,4 @@ sphinx-tabs sphinx-design sphinxcontrib-openapi sphinxcontrib-redoc +sphinxcontrib-mermaid diff --git a/docs/source/building_applications/index.md b/docs/source/building_applications/index.md index 1c333c4a7..6e2062204 100644 --- a/docs/source/building_applications/index.md +++ b/docs/source/building_applications/index.md @@ -1,17 +1,413 @@ -# Building Applications +# Building AI Applications -```{admonition} Work in Progress -:class: warning +Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively. -## What can you do with the Stack? +## Basic Inference -- Agents - - what is a turn? session? - - inference - - memory / RAG; pre-ingesting content or attaching content in a turn - - how does tool calling work - - can you do evaluation? +The foundation of any AI application is the ability to interact with LLM models. Llama Stack provides a simple interface for both completion and chat-based inference: + +```python +from llama_stack_client import LlamaStackClient + +client = LlamaStackClient(base_url="http://localhost:5001") + +# List available models +models = client.models.list() + +# Simple chat completion +response = client.inference.chat_completion( + model_id="Llama3.2-3B-Instruct", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a haiku about coding"} + ] +) +print(response.completion_message.content) ``` + +## Adding Memory & RAG + +Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks: + +1. **Vector Memory Banks**: For semantic search and retrieval +2. **Key-Value Memory Banks**: For structured data storage +3. **Keyword Memory Banks**: For basic text search +4. **Graph Memory Banks**: For relationship-based retrieval + +Here's how to set up a vector memory bank for RAG: + +```python +# Register a memory bank +bank_id = "my_documents" +response = client.memory_banks.register( + memory_bank_id=bank_id, + params={ + "memory_bank_type": "vector", + "embedding_model": "all-MiniLM-L6-v2", + "chunk_size_in_tokens": 512 + } +) + +# Insert documents +documents = [ + { + "document_id": "doc1", + "content": "Your document text here", + "mime_type": "text/plain" + } +] +client.memory.insert(bank_id, documents) + +# Query documents +results = client.memory.query( + bank_id=bank_id, + query="What do you know about...", +) +``` + +## Implementing Safety Guardrails + +Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints: + +```python +# Register a safety shield +shield_id = "content_safety" +client.shields.register( + shield_id=shield_id, + provider_shield_id="llama-guard-basic" +) + +# Run content through shield +response = client.safety.run_shield( + shield_id=shield_id, + messages=[{"role": "user", "content": "User message here"}] +) + +if response.violation: + print(f"Safety violation detected: {response.violation.user_message}") +``` + +## Building Agents + +Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks. + +### The Agent Execution Loop + +Each agent turn follows these key steps: + +1. **Initial Safety Check**: The user's input is first screened through configured safety shields + +2. **Context Retrieval**: + - If RAG is enabled, the agent queries relevant documents from memory banks + - For new documents, they are first inserted into the memory bank + - Retrieved context is augmented to the user's prompt + +3. **Inference Loop**: The agent enters its main execution loop: + - The LLM receives the augmented prompt (with context and/or previous tool outputs) + - The LLM generates a response, potentially with tool calls + - If tool calls are present: + - Tool inputs are safety-checked + - Tools are executed (e.g., web search, code execution) + - Tool responses are fed back to the LLM for synthesis + - The loop continues until: + - The LLM provides a final response without tool calls + - Maximum iterations are reached + - Token limit is exceeded + +4. **Final Safety Check**: The agent's final response is screened through safety shields + +```{mermaid} +sequenceDiagram + participant U as User + participant E as Executor + participant M as Memory Bank + participant L as LLM + participant T as Tools + participant S as Safety Shield + + Note over U,S: Agent Turn Start + U->>S: 1. Submit Prompt + activate S + S->>E: Input Safety Check + deactivate S + + E->>M: 2.1 Query Context + M-->>E: 2.2 Retrieved Documents + + loop Inference Loop + E->>L: 3.1 Augment with Context + L-->>E: 3.2 Response (with/without tool calls) + + alt Has Tool Calls + E->>S: Check Tool Input + S->>T: 4.1 Execute Tool + T-->>E: 4.2 Tool Response + E->>L: 5.1 Tool Response + L-->>E: 5.2 Synthesized Response + end + + opt Stop Conditions + Note over E: Break if: + Note over E: - No tool calls + Note over E: - Max iterations reached + Note over E: - Token limit exceeded + end + end + + E->>S: Output Safety Check + S->>U: 6. Final Response +``` + +Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution: + +```python +from llama_stack_client.lib.agents.event_logger import EventLogger + +agent_config = AgentConfig( + model="Llama3.2-3B-Instruct", + instructions="You are a helpful assistant", + # Enable both RAG and tool usage + tools=[ + { + "type": "memory", + "memory_bank_configs": [{ + "type": "vector", + "bank_id": "my_docs" + }], + "max_tokens_in_context": 4096 + }, + { + "type": "code_interpreter", + "enable_inline_code_execution": True + } + ], + # Configure safety + input_shields=["content_safety"], + output_shields=["content_safety"], + # Control the inference loop + max_infer_iters=5, + sampling_params={ + "temperature": 0.7, + "max_tokens": 2048 + } +) + +agent = Agent(client, agent_config) +session_id = agent.create_session("monitored_session") + +# Stream the agent's execution steps +response = agent.create_turn( + messages=[{"role": "user", "content": "Analyze this code and run it"}], + attachments=[{ + "content": "https://raw.githubusercontent.com/example/code.py", + "mime_type": "text/plain" + }], + session_id=session_id +) + +# Monitor each step of execution +for log in EventLogger().log(response): + if log.event.step_type == "memory_retrieval": + print("Retrieved context:", log.event.retrieved_context) + elif log.event.step_type == "inference": + print("LLM output:", log.event.model_response) + elif log.event.step_type == "tool_execution": + print("Tool call:", log.event.tool_call) + print("Tool response:", log.event.tool_response) + elif log.event.step_type == "shield_call": + if log.event.violation: + print("Safety violation:", log.event.violation) +``` + +This example shows how an agent can: Llama Stack provides a high-level agent framework: + +```python +from llama_stack_client.lib.agents.agent import Agent +from llama_stack_client.types.agent_create_params import AgentConfig + +# Configure an agent +agent_config = AgentConfig( + model="Llama3.2-3B-Instruct", + instructions="You are a helpful assistant", + tools=[ + { + "type": "memory", + "memory_bank_configs": [], + "query_generator_config": { + "type": "default", + "sep": " " + } + } + ], + input_shields=["content_safety"], + output_shields=["content_safety"], + enable_session_persistence=True +) + +# Create an agent +agent = Agent(client, agent_config) +session_id = agent.create_session("my_session") + +# Run agent turns +response = agent.create_turn( + messages=[{"role": "user", "content": "Your question here"}], + session_id=session_id +) +``` + +### Adding Tools to Agents + +Agents can be enhanced with various tools: + +1. **Search**: Web search capabilities through providers like Brave +2. **Code Interpreter**: Execute code snippets +3. **RAG**: Memory and document retrieval +4. **Function Calling**: Custom function execution +5. **WolframAlpha**: Mathematical computations +6. **Photogen**: Image generation + +Example of configuring an agent with tools: + +```python +agent_config = AgentConfig( + model="Llama3.2-3B-Instruct", + tools=[ + { + "type": "brave_search", + "api_key": "YOUR_API_KEY", + "engine": "brave" + }, + { + "type": "code_interpreter", + "enable_inline_code_execution": True + } + ], + tool_choice="auto", + tool_prompt_format="json" +) +``` + +## Building RAG-Enhanced Agents + +One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example: + +```python +from llama_stack_client.types import Attachment + +# Create attachments from documents +attachments = [ + Attachment( + content="https://raw.githubusercontent.com/example/doc.rst", + mime_type="text/plain" + ) +] + +# Configure agent with memory +agent_config = AgentConfig( + model="Llama3.2-3B-Instruct", + instructions="You are a helpful assistant", + tools=[{ + "type": "memory", + "memory_bank_configs": [], + "query_generator_config": {"type": "default", "sep": " "}, + "max_tokens_in_context": 4096, + "max_chunks": 10 + }], + enable_session_persistence=True +) + +agent = Agent(client, agent_config) +session_id = agent.create_session("rag_session") + +# Initial document ingestion +response = agent.create_turn( + messages=[{ + "role": "user", + "content": "I am providing some documents for reference." + }], + attachments=attachments, + session_id=session_id +) + +# Query with RAG +response = agent.create_turn( + messages=[{ + "role": "user", + "content": "What are the key topics in the documents?" + }], + session_id=session_id +) +``` + +## Testing & Evaluation + +Llama Stack provides built-in tools for evaluating your applications: + +1. **Benchmarking**: Test against standard datasets +2. **Application Evaluation**: Score your application's outputs +3. **Custom Metrics**: Define your own evaluation criteria + +Here's how to set up basic evaluation: + +```python +# Create an evaluation task +response = client.eval_tasks.register( + eval_task_id="my_eval", + dataset_id="my_dataset", + scoring_functions=["accuracy", "relevance"] +) + +# Run evaluation +job = client.eval.run_eval( + task_id="my_eval", + task_config={ + "type": "app", + "eval_candidate": { + "type": "agent", + "config": agent_config + } + } +) + +# Get results +result = client.eval.job_result( + task_id="my_eval", + job_id=job.job_id +) +``` + +## Debugging & Monitoring + +Llama Stack includes comprehensive telemetry for debugging and monitoring your applications: + +1. **Tracing**: Track request flows across components +2. **Metrics**: Measure performance and usage +3. **Logging**: Debug issues and track behavior + +The telemetry system supports multiple output formats: + +- OpenTelemetry for visualization in tools like Jaeger +- SQLite for local storage and querying +- Console output for development + +Example of querying traces: + +```python +# Query traces for a session +traces = client.telemetry.query_traces( + attribute_filters=[{ + "key": "session_id", + "op": "eq", + "value": session_id + }] +) + +# Get detailed span information +span_tree = client.telemetry.get_span_tree( + span_id=traces[0].root_span_id +) +``` + For details on how to use the telemetry system to debug your applications, export traces to a dataset, and run evaluations, see the [Telemetry](telemetry) section. ```{toctree} diff --git a/docs/source/conf.py b/docs/source/conf.py index b657cddff..2a9e3d17c 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -28,6 +28,7 @@ extensions = [ "sphinx_tabs.tabs", "sphinx_design", "sphinxcontrib.redoc", + "sphinxcontrib.mermaid", ] myst_enable_extensions = ["colon_fence"] @@ -47,6 +48,7 @@ exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] myst_enable_extensions = [ "amsmath", "attrs_inline", + "attrs_block", "colon_fence", "deflist", "dollarmath", @@ -65,6 +67,7 @@ myst_substitutions = { "docker_hub": "https://hub.docker.com/repository/docker/llamastack", } + # Copy button settings copybutton_prompt_text = "$ " # for bash prompts copybutton_prompt_is_regexp = True diff --git a/docs/source/distributions/configuration.md b/docs/source/distributions/configuration.md index abf7d16ed..6fee67936 100644 --- a/docs/source/distributions/configuration.md +++ b/docs/source/distributions/configuration.md @@ -81,6 +81,8 @@ A few things to note: - The configuration dictionary is provider-specific. Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value. ## Resources +``` + Finally, let's look at the `models` section: ```yaml models: diff --git a/docs/source/getting_started/index.md b/docs/source/getting_started/index.md index bae31e8c4..c6227db99 100644 --- a/docs/source/getting_started/index.md +++ b/docs/source/getting_started/index.md @@ -19,16 +19,17 @@ export LLAMA_STACK_PORT=5001 ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m ``` -By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to enspagents/agenure the model remains loaded for sometime. +By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime. ### 2. Start the Llama Stack server Llama Stack is based on a client-server architecture. It consists of a server which can be configured very flexibly so you can mix-and-match various providers for its individual API components -- beyond Inference, these include Memory, Agents, Telemetry, Evals and so forth. +To get started quickly, we provide various Docker images for the server component that work with different inference providers out of the box. For this guide, we will use `llamastack/distribution-ollama` as the Docker image. + ```bash -docker run \ - -it \ +docker run -it \ -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ -v ~/.llama:/root/.llama \ llamastack/distribution-ollama \ @@ -42,8 +43,7 @@ Configuration for this is available at `distributions/ollama/run.yaml`. ### 3. Use the Llama Stack client SDK -You can interact with the Llama Stack server using the `llama-stack-client` CLI or via the Python SDK. - +You can interact with the Llama Stack server using various client SDKs. We will use the Python SDK which you can install using: ```bash pip install llama-stack-client ``` @@ -123,7 +123,6 @@ async def run_main(): agent = Agent(client, agent_config) session_id = agent.create_session("test-session") - print(f"Created session_id={session_id} for Agent({agent.agent_id})") user_prompts = [ ( "I am attaching documentation for Torchtune. Help me answer questions I will ask next.", @@ -154,3 +153,10 @@ if __name__ == "__main__": - Learn how to [Build Llama Stacks](../distributions/index.md) - See [References](../references/index.md) for more details about the llama CLI and Python SDK - For example applications and more detailed tutorials, visit our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository. + + +## Thinking out aloud here in terms of what to write in the docs + +- how to get a llama stack server running +- what are all the different client sdks +- what are the components of building agents