doc enhancements, converted md into jupyter, reorganize files

This commit is contained in:
Justin Lee 2024-11-05 13:12:30 -08:00
parent 0f08f77565
commit ecad16b904
13 changed files with 450 additions and 113 deletions

View file

@ -1,111 +0,0 @@
# Getting Started with Llama Stack
This guide will walk you through the steps to set up an end-to-end workflow with Llama Stack. It focuses on building a Llama Stack distribution and starting up a Llama Stack server. See our [documentation](../README.md) for more on Llama Stack's capabilities, or visit [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) for example apps.
## Installation
The `llama` CLI tool helps you manage the Llama toolchain & agentic systems. After installing the `llama-stack` package, the `llama` command should be available in your path.
You can install this repository in two ways:
1. **Install as a package**:
Install directly from [PyPI](https://pypi.org/project/llama-stack/) with:
```bash
pip install llama-stack
```
2. **Install from source**:
Follow these steps to install from the source code:
```bash
mkdir -p ~/local
cd ~/local
git clone git@github.com:meta-llama/llama-stack.git
conda create -n stack python=3.10
conda activate stack
cd llama-stack
$CONDA_PREFIX/bin/pip install -e .
```
Refer to the [CLI Reference](./cli_reference.md) for details on Llama CLI commands.
## Starting Up Llama Stack Server
There are two ways to start the Llama Stack server:
1. **Using Docker**:
We provide a pre-built Docker image of Llama Stack, available in the [distributions](../distributions/) folder.
> **Note:** For GPU inference, set environment variables to specify the local directory with your model checkpoints and enable GPU inference.
```bash
export LLAMA_CHECKPOINT_DIR=~/.llama
```
Download Llama models with:
```
llama download --model-id Llama3.1-8B-Instruct
```
Start a Docker container with:
```bash
cd llama-stack/distributions/meta-reference-gpu
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
```
**Tip:** For remote providers, use `docker compose up` with scripts in the [distributions folder](../distributions/).
2. **Build->Configure->Run via Conda**:
For development, build a LlamaStack distribution from scratch.
**`llama stack build`**
Enter build information interactively:
```bash
llama stack build
```
**`llama stack configure`**
Run `llama stack configure <name>` using the name from the build step.
```bash
llama stack configure my-local-stack
```
**`llama stack run`**
Start the server with:
```bash
llama stack run my-local-stack
```
## Testing with Client
After setup, test the server with a client:
```bash
cd /path/to/llama-stack
conda activate <env>
python -m llama_stack.apis.inference.client localhost 5000
```
You can also send a POST request:
```bash
curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "Llama3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
```
For testing safety, run:
```bash
python -m llama_stack.apis.safety.client localhost 5000
```
Check our client SDKs for various languages: [Python](https://github.com/meta-llama/llama-stack-client-python), [Node](https://github.com/meta-llama/llama-stack-client-node), [Swift](https://github.com/meta-llama/llama-stack-client-swift), and [Kotlin](https://github.com/meta-llama/llama-stack-client-kotlin).
## Advanced Guides
For more on custom Llama Stack distributions, refer to our [Building a Llama Stack Distribution](./building_distro.md) guide.

View file

@ -0,0 +1,247 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "c1e7571c",
"metadata": {},
"source": [
"# Llama Stack Inference Guide\n",
"\n",
"This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).\n",
"\n",
"### Table of Contents\n",
"1. [Quickstart](#quickstart)\n",
"2. [Building Effective Prompts](#building-effective-prompts)\n",
"3. [Conversation Loop](#conversation-loop)\n",
"4. [Conversation History](#conversation-history)\n",
"5. [Streaming Responses](#streaming-responses)\n"
]
},
{
"cell_type": "markdown",
"id": "414301dc",
"metadata": {},
"source": [
"## Quickstart\n",
"\n",
"This section walks through each step to set up and make a simple text generation request.\n",
"\n",
"### 1. Set Up the Client\n",
"\n",
"Begin by importing the necessary components from Llama Stacks client library:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a573752",
"metadata": {},
"outputs": [],
"source": [
"from llama_stack_client import LlamaStackClient\n",
"from llama_stack_client.types import SystemMessage, UserMessage\n",
"\n",
"client = LlamaStackClient(base_url='http://localhost:5000')"
]
},
{
"cell_type": "markdown",
"id": "86366383",
"metadata": {},
"source": [
"### 2. Create a Chat Completion Request\n",
"\n",
"Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77c29dba",
"metadata": {},
"outputs": [],
"source": [
"response = client.inference.chat_completion(\n",
" messages=[\n",
" SystemMessage(content='You are a friendly assistant.', role='system'),\n",
" UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
" ],\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
")\n",
"\n",
"print(response.completion_message.content)"
]
},
{
"cell_type": "markdown",
"id": "e5f16949",
"metadata": {},
"source": [
"## Building Effective Prompts\n",
"\n",
"Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n",
"\n",
"1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.\n",
" - **Example**: `SystemMessage(content='You are a friendly assistant that explains complex topics simply.')`\n",
"2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.\n",
" - **Example**: `UserMessage(content='Explain recursion in programming in simple terms.')`\n",
"\n",
"### Sample Prompt"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5c6812da",
"metadata": {},
"outputs": [],
"source": [
"response = client.inference.chat_completion(\n",
" messages=[\n",
" SystemMessage(content='You are shakespeare.', role='system'),\n",
" UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
" ],\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
")\n",
"\n",
"print(response.completion_message.content)"
]
},
{
"cell_type": "markdown",
"id": "c8690ef0",
"metadata": {},
"source": [
"## Conversation Loop\n",
"\n",
"To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "02211625",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"from llama_stack_client import LlamaStackClient\n",
"from llama_stack_client.types import UserMessage\n",
"from termcolor import cprint\n",
"\n",
"client = LlamaStackClient(base_url='http://localhost:5000')\n",
"\n",
"async def chat_loop():\n",
" while True:\n",
" user_input = input('User> ')\n",
" if user_input.lower() in ['exit', 'quit', 'bye']:\n",
" cprint('Ending conversation. Goodbye!', 'yellow')\n",
" break\n",
"\n",
" message = UserMessage(content=user_input, role='user')\n",
" response = client.inference.chat_completion(\n",
" messages=[message],\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
" )\n",
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
"\n",
"asyncio.run(chat_loop())"
]
},
{
"cell_type": "markdown",
"id": "8cf0d555",
"metadata": {},
"source": [
"## Conversation History\n",
"\n",
"Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9496f75c",
"metadata": {},
"outputs": [],
"source": [
"async def chat_loop():\n",
" conversation_history = []\n",
" while True:\n",
" user_input = input('User> ')\n",
" if user_input.lower() in ['exit', 'quit', 'bye']:\n",
" cprint('Ending conversation. Goodbye!', 'yellow')\n",
" break\n",
"\n",
" user_message = UserMessage(content=user_input, role='user')\n",
" conversation_history.append(user_message)\n",
"\n",
" response = client.inference.chat_completion(\n",
" messages=conversation_history,\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
" )\n",
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
"\n",
" assistant_message = UserMessage(content=response.completion_message.content, role='user')\n",
" conversation_history.append(assistant_message)\n",
"\n",
"asyncio.run(chat_loop())"
]
},
{
"cell_type": "markdown",
"id": "03fcf5e0",
"metadata": {},
"source": [
"## Streaming Responses\n",
"\n",
"Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.\n",
"\n",
"### Example: Streaming Responses"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d119026e",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"from llama_stack_client import LlamaStackClient\n",
"from llama_stack_client.lib.inference.event_logger import EventLogger\n",
"from llama_stack_client.types import UserMessage\n",
"from termcolor import cprint\n",
"\n",
"async def run_main(stream: bool = True):\n",
" client = LlamaStackClient(base_url='http://localhost:5000')\n",
"\n",
" message = UserMessage(\n",
" content='hello world, write me a 2 sentence poem about the moon', role='user'\n",
" )\n",
" print(f'User>{message.content}', 'green')\n",
"\n",
" response = client.inference.chat_completion(\n",
" messages=[message],\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
" stream=stream,\n",
" )\n",
"\n",
" if not stream:\n",
" cprint(f'> Response: {response}', 'cyan')\n",
" else:\n",
" async for log in EventLogger().log(response):\n",
" log.print()\n",
"\n",
" models_response = client.models.list()\n",
" print(models_response)\n",
"\n",
"if __name__ == '__main__':\n",
" asyncio.run(run_main())"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}

View file

@ -0,0 +1,201 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "a0ed972d",
"metadata": {},
"source": [
"# Switching between Local and Cloud Model with Llama Stack\n",
"\n",
"This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stacks `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.\n",
"\n",
"### Pre-requisite\n",
"Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.\n",
"\n",
"### Implementation"
]
},
{
"cell_type": "markdown",
"id": "df89cff7",
"metadata": {},
"source": [
"#### 1. Set Up Local and Cloud Clients\n",
"\n",
"Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:5000` and the cloud distribution running on `http://localhost:5001`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f868dfe",
"metadata": {},
"outputs": [],
"source": [
"from llama_stack_client import LlamaStackClient\n",
"\n",
"# Configure local and cloud clients\n",
"local_client = LlamaStackClient(base_url='http://localhost:5000')\n",
"cloud_client = LlamaStackClient(base_url='http://localhost:5001')"
]
},
{
"cell_type": "markdown",
"id": "894689c1",
"metadata": {},
"source": [
"#### 2. Client Selection with Fallback\n",
"\n",
"The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff0c8277",
"metadata": {},
"outputs": [],
"source": [
"import httpx\n",
"from termcolor import cprint\n",
"\n",
"async def select_client() -> LlamaStackClient:\n",
" \"\"\"Use local client if available; otherwise, switch to cloud client.\"\"\"\n",
" try:\n",
" async with httpx.AsyncClient() as http_client:\n",
" response = await http_client.get(f'{local_client.base_url}/health')\n",
" if response.status_code == 200:\n",
" cprint('Using local client.', 'yellow')\n",
" return local_client\n",
" except httpx.RequestError:\n",
" pass\n",
" cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n",
" return cloud_client"
]
},
{
"cell_type": "markdown",
"id": "9ccfe66f",
"metadata": {},
"source": [
"#### 3. Generate a Response\n",
"\n",
"After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5e19cc20",
"metadata": {},
"outputs": [],
"source": [
"from llama_stack_client.types import UserMessage\n",
"\n",
"async def get_llama_response(stream: bool = True):\n",
" client = await select_client() # Selects the available client\n",
" message = UserMessage(content='hello world, write me a 2 sentence poem about the moon', role='user')\n",
" cprint(f'User> {message.content}', 'green')\n",
"\n",
" response = client.inference.chat_completion(\n",
" messages=[message],\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
" stream=stream,\n",
" )\n",
"\n",
" if not stream:\n",
" cprint(f'> Response: {response}', 'cyan')\n",
" else:\n",
" # Stream tokens progressively\n",
" async for log in EventLogger().log(response):\n",
" log.print()"
]
},
{
"cell_type": "markdown",
"id": "6edf5e57",
"metadata": {},
"source": [
"#### 4. Run the Asynchronous Response Generation\n",
"\n",
"Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c10f487e",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"\n",
"# Initiate the response generation process\n",
"asyncio.run(get_llama_response())"
]
},
{
"cell_type": "markdown",
"id": "56aa9a09",
"metadata": {},
"source": [
"### Complete code\n",
"Summing it up, here's the complete code for local-cloud model implementation with Llama Stack:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9fd74ff",
"metadata": {},
"outputs": [],
"source": [
"import asyncio\n",
"import httpx\n",
"from llama_stack_client import LlamaStackClient\n",
"from llama_stack_client.lib.inference.event_logger import EventLogger\n",
"from llama_stack_client.types import UserMessage\n",
"from termcolor import cprint\n",
"\n",
"local_client = LlamaStackClient(base_url='http://localhost:5000')\n",
"cloud_client = LlamaStackClient(base_url='http://localhost:5001')\n",
"\n",
"async def select_client() -> LlamaStackClient:\n",
" try:\n",
" async with httpx.AsyncClient() as http_client:\n",
" response = await http_client.get(f'{local_client.base_url}/health')\n",
" if response.status_code == 200:\n",
" cprint('Using local client.', 'yellow')\n",
" return local_client\n",
" except httpx.RequestError:\n",
" pass\n",
" cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n",
" return cloud_client\n",
"\n",
"async def get_llama_response(stream: bool = True):\n",
" client = await select_client()\n",
" message = UserMessage(\n",
" content='hello world, write me a 2 sentence poem about the moon', role='user'\n",
" )\n",
" cprint(f'User> {message.content}', 'green')\n",
"\n",
" response = client.inference.chat_completion(\n",
" messages=[message],\n",
" model='Llama3.2-11B-Vision-Instruct',\n",
" stream=stream,\n",
" )\n",
"\n",
" if not stream:\n",
" cprint(f'> Response: {response}', 'cyan')\n",
" else:\n",
" async for log in EventLogger().log(response):\n",
" log.print()\n",
"\n",
"asyncio.run(get_llama_response())"
]
}
],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}

View file

@ -1,7 +1,7 @@
# Llama Stack Text Generation Guide
# Llama Stack Inference Guide
This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).
This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).
### Table of Contents
1. [Quickstart](#quickstart)