diff --git a/docs/source/comprehensive-start.md b/docs/source/comprehensive-start.md deleted file mode 100644 index 604c87568..000000000 --- a/docs/source/comprehensive-start.md +++ /dev/null @@ -1,111 +0,0 @@ - -# Getting Started with Llama Stack - -This guide will walk you through the steps to set up an end-to-end workflow with Llama Stack. It focuses on building a Llama Stack distribution and starting up a Llama Stack server. See our [documentation](../README.md) for more on Llama Stack's capabilities, or visit [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) for example apps. - -## Installation - -The `llama` CLI tool helps you manage the Llama toolchain & agentic systems. After installing the `llama-stack` package, the `llama` command should be available in your path. - -You can install this repository in two ways: - -1. **Install as a package**: - Install directly from [PyPI](https://pypi.org/project/llama-stack/) with: - ```bash - pip install llama-stack - ``` - -2. **Install from source**: - Follow these steps to install from the source code: - ```bash - mkdir -p ~/local - cd ~/local - git clone git@github.com:meta-llama/llama-stack.git - - conda create -n stack python=3.10 - conda activate stack - - cd llama-stack - $CONDA_PREFIX/bin/pip install -e . - ``` - -Refer to the [CLI Reference](./cli_reference.md) for details on Llama CLI commands. - -## Starting Up Llama Stack Server - -There are two ways to start the Llama Stack server: - -1. **Using Docker**: - We provide a pre-built Docker image of Llama Stack, available in the [distributions](../distributions/) folder. - - > **Note:** For GPU inference, set environment variables to specify the local directory with your model checkpoints and enable GPU inference. - ```bash - export LLAMA_CHECKPOINT_DIR=~/.llama - ``` - Download Llama models with: - ``` - llama download --model-id Llama3.1-8B-Instruct - ``` - Start a Docker container with: - ```bash - cd llama-stack/distributions/meta-reference-gpu - docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml - ``` - - **Tip:** For remote providers, use `docker compose up` with scripts in the [distributions folder](../distributions/). - -2. **Build->Configure->Run via Conda**: - For development, build a LlamaStack distribution from scratch. - - **`llama stack build`** - Enter build information interactively: - ```bash - llama stack build - ``` - - **`llama stack configure`** - Run `llama stack configure ` using the name from the build step. - ```bash - llama stack configure my-local-stack - ``` - - **`llama stack run`** - Start the server with: - ```bash - llama stack run my-local-stack - ``` - -## Testing with Client - -After setup, test the server with a client: -```bash -cd /path/to/llama-stack -conda activate - -python -m llama_stack.apis.inference.client localhost 5000 -``` - -You can also send a POST request: -```bash -curl http://localhost:5000/inference/chat_completion \ --H "Content-Type: application/json" \ --d '{ - "model": "Llama3.1-8B-Instruct", - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Write me a 2-sentence poem about the moon"} - ], - "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} -}' -``` - -For testing safety, run: -```bash -python -m llama_stack.apis.safety.client localhost 5000 -``` - -Check our client SDKs for various languages: [Python](https://github.com/meta-llama/llama-stack-client-python), [Node](https://github.com/meta-llama/llama-stack-client-node), [Swift](https://github.com/meta-llama/llama-stack-client-swift), and [Kotlin](https://github.com/meta-llama/llama-stack-client-kotlin). - -## Advanced Guides - -For more on custom Llama Stack distributions, refer to our [Building a Llama Stack Distribution](./building_distro.md) guide. diff --git a/docs/zero_to_hero_guide/00_Inference101.ipynb b/docs/zero_to_hero_guide/00_Inference101.ipynb new file mode 100644 index 000000000..c5efa600d --- /dev/null +++ b/docs/zero_to_hero_guide/00_Inference101.ipynb @@ -0,0 +1,247 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c1e7571c", + "metadata": {}, + "source": [ + "# Llama Stack Inference Guide\n", + "\n", + "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).\n", + "\n", + "### Table of Contents\n", + "1. [Quickstart](#quickstart)\n", + "2. [Building Effective Prompts](#building-effective-prompts)\n", + "3. [Conversation Loop](#conversation-loop)\n", + "4. [Conversation History](#conversation-history)\n", + "5. [Streaming Responses](#streaming-responses)\n" + ] + }, + { + "cell_type": "markdown", + "id": "414301dc", + "metadata": {}, + "source": [ + "## Quickstart\n", + "\n", + "This section walks through each step to set up and make a simple text generation request.\n", + "\n", + "### 1. Set Up the Client\n", + "\n", + "Begin by importing the necessary components from Llama Stack’s client library:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a573752", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_stack_client import LlamaStackClient\n", + "from llama_stack_client.types import SystemMessage, UserMessage\n", + "\n", + "client = LlamaStackClient(base_url='http://localhost:5000')" + ] + }, + { + "cell_type": "markdown", + "id": "86366383", + "metadata": {}, + "source": [ + "### 2. Create a Chat Completion Request\n", + "\n", + "Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "77c29dba", + "metadata": {}, + "outputs": [], + "source": [ + "response = client.inference.chat_completion(\n", + " messages=[\n", + " SystemMessage(content='You are a friendly assistant.', role='system'),\n", + " UserMessage(content='Write a two-sentence poem about llama.', role='user')\n", + " ],\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + ")\n", + "\n", + "print(response.completion_message.content)" + ] + }, + { + "cell_type": "markdown", + "id": "e5f16949", + "metadata": {}, + "source": [ + "## Building Effective Prompts\n", + "\n", + "Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n", + "\n", + "1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.\n", + " - **Example**: `SystemMessage(content='You are a friendly assistant that explains complex topics simply.')`\n", + "2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.\n", + " - **Example**: `UserMessage(content='Explain recursion in programming in simple terms.')`\n", + "\n", + "### Sample Prompt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c6812da", + "metadata": {}, + "outputs": [], + "source": [ + "response = client.inference.chat_completion(\n", + " messages=[\n", + " SystemMessage(content='You are shakespeare.', role='system'),\n", + " UserMessage(content='Write a two-sentence poem about llama.', role='user')\n", + " ],\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + ")\n", + "\n", + "print(response.completion_message.content)" + ] + }, + { + "cell_type": "markdown", + "id": "c8690ef0", + "metadata": {}, + "source": [ + "## Conversation Loop\n", + "\n", + "To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02211625", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from llama_stack_client import LlamaStackClient\n", + "from llama_stack_client.types import UserMessage\n", + "from termcolor import cprint\n", + "\n", + "client = LlamaStackClient(base_url='http://localhost:5000')\n", + "\n", + "async def chat_loop():\n", + " while True:\n", + " user_input = input('User> ')\n", + " if user_input.lower() in ['exit', 'quit', 'bye']:\n", + " cprint('Ending conversation. Goodbye!', 'yellow')\n", + " break\n", + "\n", + " message = UserMessage(content=user_input, role='user')\n", + " response = client.inference.chat_completion(\n", + " messages=[message],\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + " )\n", + " cprint(f'> Response: {response.completion_message.content}', 'cyan')\n", + "\n", + "asyncio.run(chat_loop())" + ] + }, + { + "cell_type": "markdown", + "id": "8cf0d555", + "metadata": {}, + "source": [ + "## Conversation History\n", + "\n", + "Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9496f75c", + "metadata": {}, + "outputs": [], + "source": [ + "async def chat_loop():\n", + " conversation_history = []\n", + " while True:\n", + " user_input = input('User> ')\n", + " if user_input.lower() in ['exit', 'quit', 'bye']:\n", + " cprint('Ending conversation. Goodbye!', 'yellow')\n", + " break\n", + "\n", + " user_message = UserMessage(content=user_input, role='user')\n", + " conversation_history.append(user_message)\n", + "\n", + " response = client.inference.chat_completion(\n", + " messages=conversation_history,\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + " )\n", + " cprint(f'> Response: {response.completion_message.content}', 'cyan')\n", + "\n", + " assistant_message = UserMessage(content=response.completion_message.content, role='user')\n", + " conversation_history.append(assistant_message)\n", + "\n", + "asyncio.run(chat_loop())" + ] + }, + { + "cell_type": "markdown", + "id": "03fcf5e0", + "metadata": {}, + "source": [ + "## Streaming Responses\n", + "\n", + "Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.\n", + "\n", + "### Example: Streaming Responses" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d119026e", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "from llama_stack_client import LlamaStackClient\n", + "from llama_stack_client.lib.inference.event_logger import EventLogger\n", + "from llama_stack_client.types import UserMessage\n", + "from termcolor import cprint\n", + "\n", + "async def run_main(stream: bool = True):\n", + " client = LlamaStackClient(base_url='http://localhost:5000')\n", + "\n", + " message = UserMessage(\n", + " content='hello world, write me a 2 sentence poem about the moon', role='user'\n", + " )\n", + " print(f'User>{message.content}', 'green')\n", + "\n", + " response = client.inference.chat_completion(\n", + " messages=[message],\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + " stream=stream,\n", + " )\n", + "\n", + " if not stream:\n", + " cprint(f'> Response: {response}', 'cyan')\n", + " else:\n", + " async for log in EventLogger().log(response):\n", + " log.print()\n", + "\n", + " models_response = client.models.list()\n", + " print(models_response)\n", + "\n", + "if __name__ == '__main__':\n", + " asyncio.run(run_main())" + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb b/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb new file mode 100644 index 000000000..8b80c2731 --- /dev/null +++ b/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb @@ -0,0 +1,201 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a0ed972d", + "metadata": {}, + "source": [ + "# Switching between Local and Cloud Model with Llama Stack\n", + "\n", + "This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.\n", + "\n", + "### Pre-requisite\n", + "Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.\n", + "\n", + "### Implementation" + ] + }, + { + "cell_type": "markdown", + "id": "df89cff7", + "metadata": {}, + "source": [ + "#### 1. Set Up Local and Cloud Clients\n", + "\n", + "Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:5000` and the cloud distribution running on `http://localhost:5001`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f868dfe", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_stack_client import LlamaStackClient\n", + "\n", + "# Configure local and cloud clients\n", + "local_client = LlamaStackClient(base_url='http://localhost:5000')\n", + "cloud_client = LlamaStackClient(base_url='http://localhost:5001')" + ] + }, + { + "cell_type": "markdown", + "id": "894689c1", + "metadata": {}, + "source": [ + "#### 2. Client Selection with Fallback\n", + "\n", + "The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff0c8277", + "metadata": {}, + "outputs": [], + "source": [ + "import httpx\n", + "from termcolor import cprint\n", + "\n", + "async def select_client() -> LlamaStackClient:\n", + " \"\"\"Use local client if available; otherwise, switch to cloud client.\"\"\"\n", + " try:\n", + " async with httpx.AsyncClient() as http_client:\n", + " response = await http_client.get(f'{local_client.base_url}/health')\n", + " if response.status_code == 200:\n", + " cprint('Using local client.', 'yellow')\n", + " return local_client\n", + " except httpx.RequestError:\n", + " pass\n", + " cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n", + " return cloud_client" + ] + }, + { + "cell_type": "markdown", + "id": "9ccfe66f", + "metadata": {}, + "source": [ + "#### 3. Generate a Response\n", + "\n", + "After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e19cc20", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_stack_client.types import UserMessage\n", + "\n", + "async def get_llama_response(stream: bool = True):\n", + " client = await select_client() # Selects the available client\n", + " message = UserMessage(content='hello world, write me a 2 sentence poem about the moon', role='user')\n", + " cprint(f'User> {message.content}', 'green')\n", + "\n", + " response = client.inference.chat_completion(\n", + " messages=[message],\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + " stream=stream,\n", + " )\n", + "\n", + " if not stream:\n", + " cprint(f'> Response: {response}', 'cyan')\n", + " else:\n", + " # Stream tokens progressively\n", + " async for log in EventLogger().log(response):\n", + " log.print()" + ] + }, + { + "cell_type": "markdown", + "id": "6edf5e57", + "metadata": {}, + "source": [ + "#### 4. Run the Asynchronous Response Generation\n", + "\n", + "Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c10f487e", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "\n", + "# Initiate the response generation process\n", + "asyncio.run(get_llama_response())" + ] + }, + { + "cell_type": "markdown", + "id": "56aa9a09", + "metadata": {}, + "source": [ + "### Complete code\n", + "Summing it up, here's the complete code for local-cloud model implementation with Llama Stack:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d9fd74ff", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "import httpx\n", + "from llama_stack_client import LlamaStackClient\n", + "from llama_stack_client.lib.inference.event_logger import EventLogger\n", + "from llama_stack_client.types import UserMessage\n", + "from termcolor import cprint\n", + "\n", + "local_client = LlamaStackClient(base_url='http://localhost:5000')\n", + "cloud_client = LlamaStackClient(base_url='http://localhost:5001')\n", + "\n", + "async def select_client() -> LlamaStackClient:\n", + " try:\n", + " async with httpx.AsyncClient() as http_client:\n", + " response = await http_client.get(f'{local_client.base_url}/health')\n", + " if response.status_code == 200:\n", + " cprint('Using local client.', 'yellow')\n", + " return local_client\n", + " except httpx.RequestError:\n", + " pass\n", + " cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n", + " return cloud_client\n", + "\n", + "async def get_llama_response(stream: bool = True):\n", + " client = await select_client()\n", + " message = UserMessage(\n", + " content='hello world, write me a 2 sentence poem about the moon', role='user'\n", + " )\n", + " cprint(f'User> {message.content}', 'green')\n", + "\n", + " response = client.inference.chat_completion(\n", + " messages=[message],\n", + " model='Llama3.2-11B-Vision-Instruct',\n", + " stream=stream,\n", + " )\n", + "\n", + " if not stream:\n", + " cprint(f'> Response: {response}', 'cyan')\n", + " else:\n", + " async for log in EventLogger().log(response):\n", + " log.print()\n", + "\n", + "asyncio.run(get_llama_response())" + ] + } + ], + "metadata": {}, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/Prompt_Engineering_with_Llama_3.ipynb b/docs/zero_to_hero_guide/01_Prompt_Engineering101.ipynb similarity index 100% rename from docs/Prompt_Engineering_with_Llama_3.ipynb rename to docs/zero_to_hero_guide/01_Prompt_Engineering101.ipynb diff --git a/docs/zero_to_hero_guide/01_Image_Chat101.ipynb b/docs/zero_to_hero_guide/02_Image_Chat101.ipynb similarity index 100% rename from docs/zero_to_hero_guide/01_Image_Chat101.ipynb rename to docs/zero_to_hero_guide/02_Image_Chat101.ipynb diff --git a/docs/zero_to_hero_guide/02_Tool_Calling101.ipynb b/docs/zero_to_hero_guide/03_Tool_Calling101.ipynb similarity index 100% rename from docs/zero_to_hero_guide/02_Tool_Calling101.ipynb rename to docs/zero_to_hero_guide/03_Tool_Calling101.ipynb diff --git a/docs/zero_to_hero_guide/03_Memory101.ipynb b/docs/zero_to_hero_guide/04_Memory101.ipynb similarity index 100% rename from docs/zero_to_hero_guide/03_Memory101.ipynb rename to docs/zero_to_hero_guide/04_Memory101.ipynb diff --git a/docs/zero_to_hero_guide/04_Safety101.ipynb b/docs/zero_to_hero_guide/05_Safety101.ipynb similarity index 100% rename from docs/zero_to_hero_guide/04_Safety101.ipynb rename to docs/zero_to_hero_guide/05_Safety101.ipynb diff --git a/docs/zero_to_hero_guide/05_Agents101.ipynb b/docs/zero_to_hero_guide/06_Agents101.ipynb similarity index 100% rename from docs/zero_to_hero_guide/05_Agents101.ipynb rename to docs/zero_to_hero_guide/06_Agents101.ipynb diff --git a/docs/source/chat_completion_guide.md b/docs/zero_to_hero_guide/chat_completion_guide.md similarity index 98% rename from docs/source/chat_completion_guide.md rename to docs/zero_to_hero_guide/chat_completion_guide.md index 9ec6edfab..3fcdbfc1d 100644 --- a/docs/source/chat_completion_guide.md +++ b/docs/zero_to_hero_guide/chat_completion_guide.md @@ -1,7 +1,7 @@ -# Llama Stack Text Generation Guide +# Llama Stack Inference Guide -This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). +This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). ### Table of Contents 1. [Quickstart](#quickstart) diff --git a/docs/source/chat_few_shot_guide.md b/docs/zero_to_hero_guide/chat_few_shot_guide.md similarity index 100% rename from docs/source/chat_few_shot_guide.md rename to docs/zero_to_hero_guide/chat_few_shot_guide.md diff --git a/docs/source/chat_local_cloud_guide.md b/docs/zero_to_hero_guide/chat_local_cloud_guide.md similarity index 100% rename from docs/source/chat_local_cloud_guide.md rename to docs/zero_to_hero_guide/chat_local_cloud_guide.md diff --git a/docs/source/quickstart.md b/docs/zero_to_hero_guide/quickstart.md similarity index 100% rename from docs/source/quickstart.md rename to docs/zero_to_hero_guide/quickstart.md