doc enhancements, converted md into jupyter, reorganize files

2025-10-15 06:37:58 +00:00 · 2024-11-05 13:12:30 -08:00 · 2024-11-05 13:12:30 -08:00 · ecad16b904
commit ecad16b904
parent 0f08f77565
13 changed files with 450 additions and 113 deletions
--- a/docs/source/comprehensive-start.md
+++ b/docs/source/comprehensive-start.md
@ -1,111 +0,0 @@
-
-# Getting Started with Llama Stack
-
-This guide will walk you through the steps to set up an end-to-end workflow with Llama Stack. It focuses on building a Llama Stack distribution and starting up a Llama Stack server. See our [documentation](../README.md) for more on Llama Stack's capabilities, or visit [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) for example apps.
-
-## Installation
-
-The `llama` CLI tool helps you manage the Llama toolchain & agentic systems. After installing the `llama-stack` package, the `llama` command should be available in your path.
-
-You can install this repository in two ways:
-
-1. **Install as a package**:
-   Install directly from [PyPI](https://pypi.org/project/llama-stack/) with:
-   ```bash
-   pip install llama-stack
-   ```
-
-2. **Install from source**:
-   Follow these steps to install from the source code:
-   ```bash
-   mkdir -p ~/local
-   cd ~/local
-   git clone git@github.com:meta-llama/llama-stack.git
-
-   conda create -n stack python=3.10
-   conda activate stack
-
-   cd llama-stack
-   $CONDA_PREFIX/bin/pip install -e .
-   ```
-
-Refer to the [CLI Reference](./cli_reference.md) for details on Llama CLI commands.
-
-## Starting Up Llama Stack Server
-
-There are two ways to start the Llama Stack server:
-
-1. **Using Docker**:
-   We provide a pre-built Docker image of Llama Stack, available in the [distributions](../distributions/) folder.
-
-   > **Note:** For GPU inference, set environment variables to specify the local directory with your model checkpoints and enable GPU inference.
-   ```bash
-   export LLAMA_CHECKPOINT_DIR=~/.llama
-   ```
-   Download Llama models with:
-   ```
-   llama download --model-id Llama3.1-8B-Instruct
-   ```
-   Start a Docker container with:
-   ```bash
-   cd llama-stack/distributions/meta-reference-gpu
-   docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
-   ```
-
-   **Tip:** For remote providers, use `docker compose up` with scripts in the [distributions folder](../distributions/).
-
-2. **Build->Configure->Run via Conda**:
-   For development, build a LlamaStack distribution from scratch.
-
-   **`llama stack build`**
-   Enter build information interactively:
-   ```bash
-   llama stack build
-   ```
-
-   **`llama stack configure`**
-   Run `llama stack configure <name>` using the name from the build step.
-   ```bash
-   llama stack configure my-local-stack
-   ```
-
-   **`llama stack run`**
-   Start the server with:
-   ```bash
-   llama stack run my-local-stack
-   ```
-
-## Testing with Client
-
-After setup, test the server with a client:
-```bash
-cd /path/to/llama-stack
-conda activate <env>
-
-python -m llama_stack.apis.inference.client localhost 5000
-```
-
-You can also send a POST request:
-```bash
-curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
-    "model": "Llama3.1-8B-Instruct",
-    "messages": [
-        {"role": "system", "content": "You are a helpful assistant."},
-        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
-    ],
-    "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
-}'
-```
-
-For testing safety, run:
-```bash
-python -m llama_stack.apis.safety.client localhost 5000
-```
-
-Check our client SDKs for various languages: [Python](https://github.com/meta-llama/llama-stack-client-python), [Node](https://github.com/meta-llama/llama-stack-client-node), [Swift](https://github.com/meta-llama/llama-stack-client-swift), and [Kotlin](https://github.com/meta-llama/llama-stack-client-kotlin).
-
-## Advanced Guides
-
-For more on custom Llama Stack distributions, refer to our [Building a Llama Stack Distribution](./building_distro.md) guide.
--- a/docs/zero_to_hero_guide/00_Inference101.ipynb
+++ b/docs/zero_to_hero_guide/00_Inference101.ipynb
@ -0,0 +1,247 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c1e7571c",
+   "metadata": {},
+   "source": [
+    "# Llama Stack Inference Guide\n",
+    "\n",
+    "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).\n",
+    "\n",
+    "### Table of Contents\n",
+    "1. [Quickstart](#quickstart)\n",
+    "2. [Building Effective Prompts](#building-effective-prompts)\n",
+    "3. [Conversation Loop](#conversation-loop)\n",
+    "4. [Conversation History](#conversation-history)\n",
+    "5. [Streaming Responses](#streaming-responses)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "414301dc",
+   "metadata": {},
+   "source": [
+    "## Quickstart\n",
+    "\n",
+    "This section walks through each step to set up and make a simple text generation request.\n",
+    "\n",
+    "### 1. Set Up the Client\n",
+    "\n",
+    "Begin by importing the necessary components from Llama Stack’s client library:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a573752",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.types import SystemMessage, UserMessage\n",
+    "\n",
+    "client = LlamaStackClient(base_url='http://localhost:5000')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86366383",
+   "metadata": {},
+   "source": [
+    "### 2. Create a Chat Completion Request\n",
+    "\n",
+    "Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77c29dba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = client.inference.chat_completion(\n",
+    "    messages=[\n",
+    "        SystemMessage(content='You are a friendly assistant.', role='system'),\n",
+    "        UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
+    "    ],\n",
+    "    model='Llama3.2-11B-Vision-Instruct',\n",
+    ")\n",
+    "\n",
+    "print(response.completion_message.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5f16949",
+   "metadata": {},
+   "source": [
+    "## Building Effective Prompts\n",
+    "\n",
+    "Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n",
+    "\n",
+    "1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.\n",
+    "   - **Example**: `SystemMessage(content='You are a friendly assistant that explains complex topics simply.')`\n",
+    "2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.\n",
+    "   - **Example**: `UserMessage(content='Explain recursion in programming in simple terms.')`\n",
+    "\n",
+    "### Sample Prompt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5c6812da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = client.inference.chat_completion(\n",
+    "    messages=[\n",
+    "        SystemMessage(content='You are shakespeare.', role='system'),\n",
+    "        UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
+    "    ],\n",
+    "    model='Llama3.2-11B-Vision-Instruct',\n",
+    ")\n",
+    "\n",
+    "print(response.completion_message.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8690ef0",
+   "metadata": {},
+   "source": [
+    "## Conversation Loop\n",
+    "\n",
+    "To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "02211625",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.types import UserMessage\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "\n",
+    "async def chat_loop():\n",
+    "    while True:\n",
+    "        user_input = input('User> ')\n",
+    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
+    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
+    "            break\n",
+    "\n",
+    "        message = UserMessage(content=user_input, role='user')\n",
+    "        response = client.inference.chat_completion(\n",
+    "            messages=[message],\n",
+    "            model='Llama3.2-11B-Vision-Instruct',\n",
+    "        )\n",
+    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
+    "\n",
+    "asyncio.run(chat_loop())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cf0d555",
+   "metadata": {},
+   "source": [
+    "## Conversation History\n",
+    "\n",
+    "Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9496f75c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "async def chat_loop():\n",
+    "    conversation_history = []\n",
+    "    while True:\n",
+    "        user_input = input('User> ')\n",
+    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
+    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
+    "            break\n",
+    "\n",
+    "        user_message = UserMessage(content=user_input, role='user')\n",
+    "        conversation_history.append(user_message)\n",
+    "\n",
+    "        response = client.inference.chat_completion(\n",
+    "            messages=conversation_history,\n",
+    "            model='Llama3.2-11B-Vision-Instruct',\n",
+    "        )\n",
+    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
+    "\n",
+    "        assistant_message = UserMessage(content=response.completion_message.content, role='user')\n",
+    "        conversation_history.append(assistant_message)\n",
+    "\n",
+    "asyncio.run(chat_loop())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03fcf5e0",
+   "metadata": {},
+   "source": [
+    "## Streaming Responses\n",
+    "\n",
+    "Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.\n",
+    "\n",
+    "### Example: Streaming Responses"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d119026e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
+    "from llama_stack_client.types import UserMessage\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "async def run_main(stream: bool = True):\n",
+    "    client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "\n",
+    "    message = UserMessage(\n",
+    "        content='hello world, write me a 2 sentence poem about the moon', role='user'\n",
+    "    )\n",
+    "    print(f'User>{message.content}', 'green')\n",
+    "\n",
+    "    response = client.inference.chat_completion(\n",
+    "        messages=[message],\n",
+    "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        stream=stream,\n",
+    "    )\n",
+    "\n",
+    "    if not stream:\n",
+    "        cprint(f'> Response: {response}', 'cyan')\n",
+    "    else:\n",
+    "        async for log in EventLogger().log(response):\n",
+    "            log.print()\n",
+    "\n",
+    "    models_response = client.models.list()\n",
+    "    print(models_response)\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    asyncio.run(run_main())"
+   ]
+  }
+ ],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb
+++ b/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb
@ -0,0 +1,201 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a0ed972d",
+   "metadata": {},
+   "source": [
+    "# Switching between Local and Cloud Model with Llama Stack\n",
+    "\n",
+    "This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.\n",
+    "\n",
+    "### Pre-requisite\n",
+    "Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.\n",
+    "\n",
+    "### Implementation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df89cff7",
+   "metadata": {},
+   "source": [
+    "#### 1. Set Up Local and Cloud Clients\n",
+    "\n",
+    "Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:5000` and the cloud distribution running on `http://localhost:5001`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f868dfe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_stack_client import LlamaStackClient\n",
+    "\n",
+    "# Configure local and cloud clients\n",
+    "local_client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "cloud_client = LlamaStackClient(base_url='http://localhost:5001')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "894689c1",
+   "metadata": {},
+   "source": [
+    "#### 2. Client Selection with Fallback\n",
+    "\n",
+    "The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ff0c8277",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import httpx\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "async def select_client() -> LlamaStackClient:\n",
+    "    \"\"\"Use local client if available; otherwise, switch to cloud client.\"\"\"\n",
+    "    try:\n",
+    "        async with httpx.AsyncClient() as http_client:\n",
+    "            response = await http_client.get(f'{local_client.base_url}/health')\n",
+    "            if response.status_code == 200:\n",
+    "                cprint('Using local client.', 'yellow')\n",
+    "                return local_client\n",
+    "    except httpx.RequestError:\n",
+    "        pass\n",
+    "    cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n",
+    "    return cloud_client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ccfe66f",
+   "metadata": {},
+   "source": [
+    "#### 3. Generate a Response\n",
+    "\n",
+    "After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e19cc20",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_stack_client.types import UserMessage\n",
+    "\n",
+    "async def get_llama_response(stream: bool = True):\n",
+    "    client = await select_client()  # Selects the available client\n",
+    "    message = UserMessage(content='hello world, write me a 2 sentence poem about the moon', role='user')\n",
+    "    cprint(f'User> {message.content}', 'green')\n",
+    "\n",
+    "    response = client.inference.chat_completion(\n",
+    "        messages=[message],\n",
+    "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        stream=stream,\n",
+    "    )\n",
+    "\n",
+    "    if not stream:\n",
+    "        cprint(f'> Response: {response}', 'cyan')\n",
+    "    else:\n",
+    "        # Stream tokens progressively\n",
+    "        async for log in EventLogger().log(response):\n",
+    "            log.print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6edf5e57",
+   "metadata": {},
+   "source": [
+    "#### 4. Run the Asynchronous Response Generation\n",
+    "\n",
+    "Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c10f487e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "\n",
+    "# Initiate the response generation process\n",
+    "asyncio.run(get_llama_response())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56aa9a09",
+   "metadata": {},
+   "source": [
+    "### Complete code\n",
+    "Summing it up, here's the complete code for local-cloud model implementation with Llama Stack:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d9fd74ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "import httpx\n",
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
+    "from llama_stack_client.types import UserMessage\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "local_client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "cloud_client = LlamaStackClient(base_url='http://localhost:5001')\n",
+    "\n",
+    "async def select_client() -> LlamaStackClient:\n",
+    "    try:\n",
+    "        async with httpx.AsyncClient() as http_client:\n",
+    "            response = await http_client.get(f'{local_client.base_url}/health')\n",
+    "            if response.status_code == 200:\n",
+    "                cprint('Using local client.', 'yellow')\n",
+    "                return local_client\n",
+    "    except httpx.RequestError:\n",
+    "        pass\n",
+    "    cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n",
+    "    return cloud_client\n",
+    "\n",
+    "async def get_llama_response(stream: bool = True):\n",
+    "    client = await select_client()\n",
+    "    message = UserMessage(\n",
+    "        content='hello world, write me a 2 sentence poem about the moon', role='user'\n",
+    "    )\n",
+    "    cprint(f'User> {message.content}', 'green')\n",
+    "\n",
+    "    response = client.inference.chat_completion(\n",
+    "        messages=[message],\n",
+    "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        stream=stream,\n",
+    "    )\n",
+    "\n",
+    "    if not stream:\n",
+    "        cprint(f'> Response: {response}', 'cyan')\n",
+    "    else:\n",
+    "        async for log in EventLogger().log(response):\n",
+    "            log.print()\n",
+    "\n",
+    "asyncio.run(get_llama_response())"
+   ]
+  }
+ ],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/zero_to_hero_guide/01_Prompt_Engineering101.ipynb
+++ b/docs/zero_to_hero_guide/01_Prompt_Engineering101.ipynb
--- a/docs/zero_to_hero_guide/02_Image_Chat101.ipynb
+++ b/docs/zero_to_hero_guide/02_Image_Chat101.ipynb
--- a/docs/zero_to_hero_guide/03_Tool_Calling101.ipynb
+++ b/docs/zero_to_hero_guide/03_Tool_Calling101.ipynb
--- a/docs/zero_to_hero_guide/04_Memory101.ipynb
+++ b/docs/zero_to_hero_guide/04_Memory101.ipynb
--- a/docs/zero_to_hero_guide/05_Safety101.ipynb
+++ b/docs/zero_to_hero_guide/05_Safety101.ipynb
--- a/docs/zero_to_hero_guide/06_Agents101.ipynb
+++ b/docs/zero_to_hero_guide/06_Agents101.ipynb
--- a/docs/zero_to_hero_guide/chat_completion_guide.md
+++ b/docs/zero_to_hero_guide/chat_completion_guide.md
@ -1,7 +1,7 @@

-# Llama Stack Text Generation Guide
+# Llama Stack Inference Guide

-This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). 
+This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).

 ### Table of Contents
 1. [Quickstart](#quickstart)
--- a/docs/zero_to_hero_guide/chat_few_shot_guide.md
+++ b/docs/zero_to_hero_guide/chat_few_shot_guide.md
--- a/docs/zero_to_hero_guide/chat_local_cloud_guide.md
+++ b/docs/zero_to_hero_guide/chat_local_cloud_guide.md
--- a/docs/zero_to_hero_guide/quickstart.md
+++ b/docs/zero_to_hero_guide/quickstart.md