doc enhancements, converted md into jupyter, reorganize files

This commit is contained in:
Justin Lee 2024-11-05 13:12:30 -08:00
parent 0f08f77565
commit ecad16b904
13 changed files with 450 additions and 113 deletions

View file

@ -1,192 +0,0 @@
# Llama Stack Text Generation Guide
This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).
### Table of Contents
1. [Quickstart](#quickstart)
2. [Building Effective Prompts](#building-effective-prompts)
3. [Conversation Loop](#conversation-loop)
4. [Conversation History](#conversation-history)
5. [Streaming Responses](#streaming-responses)
## Quickstart
This section walks through each step to set up and make a simple text generation request.
### 1. Set Up the Client
Begin by importing the necessary components from Llama Stacks client library:
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import SystemMessage, UserMessage
client = LlamaStackClient(base_url="http://localhost:5000")
```
### 2. Create a Chat Completion Request
Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:
```python
response = client.inference.chat_completion(
messages=[
SystemMessage(content="You are a friendly assistant.", role="system"),
UserMessage(content="Write a two-sentence poem about llama.", role="user")
],
model="Llama3.2-11B-Vision-Instruct",
)
print(response.completion_message.content)
```
---
## Building Effective Prompts
Effective prompt creation (often called "prompt engineering") is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:
1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.
- **Example**: `SystemMessage(content="You are a friendly assistant that explains complex topics simply.")`
2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.
- **Example**: `UserMessage(content="Explain recursion in programming in simple terms.")`
### Sample Prompt
Heres a prompt that defines the model's role and a user question:
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import SystemMessage, UserMessage
client = LlamaStackClient(base_url="http://localhost:5000")
response = client.inference.chat_completion(
messages=[
SystemMessage(content="You are shakespeare.", role="system"),
UserMessage(content="Write a two-sentence poem about llama.", role="user")
],
model="Llama3.2-11B-Vision-Instruct",
)
print(response.completion_message.content)
```
---
## Conversation Loop
To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types "exit," "quit," or "bye."
```python
import asyncio
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage
from termcolor import cprint
client = LlamaStackClient(base_url="http://localhost:5000")
async def chat_loop():
while True:
user_input = input("User> ")
if user_input.lower() in ["exit", "quit", "bye"]:
cprint("Ending conversation. Goodbye!", "yellow")
break
message = UserMessage(content=user_input, role="user")
response = client.inference.chat_completion(
messages=[message],
model="Llama3.2-11B-Vision-Instruct",
)
cprint(f"> Response: {response.completion_message.content}", "cyan")
asyncio.run(chat_loop())
```
---
## Conversation History
Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.
```python
import asyncio
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import UserMessage
from termcolor import cprint
client = LlamaStackClient(base_url="http://localhost:5000")
async def chat_loop():
conversation_history = []
while True:
user_input = input("User> ")
if user_input.lower() in ["exit", "quit", "bye"]:
cprint("Ending conversation. Goodbye!", "yellow")
break
user_message = UserMessage(content=user_input, role="user")
conversation_history.append(user_message)
response = client.inference.chat_completion(
messages=conversation_history,
model="Llama3.2-11B-Vision-Instruct",
)
cprint(f"> Response: {response.completion_message.content}", "cyan")
assistant_message = UserMessage(content=response.completion_message.content, role="user")
conversation_history.append(assistant_message)
asyncio.run(chat_loop())
```
## Streaming Responses
Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.
### Example: Streaming Responses
The following code demonstrates how to use the `stream` parameter to enable response streaming. When `stream=True`, the `chat_completion` function will yield tokens as they are generated. To display these tokens, this example leverages asynchronous streaming with `EventLogger`.
```python
import asyncio
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.inference.event_logger import EventLogger
from llama_stack_client.types import UserMessage
from termcolor import cprint
async def run_main(stream: bool = True):
client = LlamaStackClient(
base_url="http://localhost:5000",
)
message = UserMessage(
content="hello world, write me a 2 sentence poem about the moon", role="user"
)
print(f"User>{message.content}", "green")
response = client.inference.chat_completion(
messages=[message],
model="Llama3.2-11B-Vision-Instruct",
stream=stream,
)
if not stream:
cprint(f"> Response: {response}", "cyan")
else:
async for log in EventLogger().log(response):
log.print()
models_response = client.models.list()
print(models_response)
if __name__ == "__main__":
asyncio.run(run_main())
```
---
With these fundamentals, you should be well on your way to leveraging Llama Stacks text generation capabilities! For more advanced features, refer to the [Llama Stack Documentation](https://llama-stack.readthedocs.io/en/latest/).

View file

@ -1,144 +0,0 @@
# Few-Shot Inference for LLMs
This guide provides instructions on how to use Llama Stacks `chat_completion` API with a few-shot learning approach to enhance text generation. Few-shot examples enable the model to recognize patterns by providing labeled prompts, allowing it to complete tasks based on minimal prior examples.
### Overview
Few-shot learning provides the model with multiple examples of input-output pairs. This is particularly useful for guiding the model's behavior in specific tasks, helping it understand the desired completion format and content based on a few sample interactions.
### Implementation
1. **Initialize the Client**
Begin by setting up the `LlamaStackClient` to connect to the inference endpoint.
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url="http://localhost:5000")
```
2. **Define Few-Shot Examples**
Construct a series of labeled `UserMessage` and `CompletionMessage` instances to demonstrate the task to the model. Each `UserMessage` represents an input prompt, and each `CompletionMessage` is the desired output. The model uses these examples to infer the appropriate response patterns.
```python
from llama_stack_client.types import CompletionMessage, UserMessage
few_shot_examples = messages=[
UserMessage(content="Have shorter, spear-shaped ears.", role="user"),
CompletionMessage(
content="That's Alpaca!",
role="assistant",
stop_reason="end_of_message",
tool_calls=[],
),
UserMessage(
content="Known for their calm nature and used as pack animals in mountainous regions.",
role="user",
),
CompletionMessage(
content="That's Llama!",
role="assistant",
stop_reason="end_of_message",
tool_calls=[],
),
UserMessage(
content="Has a straight, slender neck and is smaller in size compared to its relative.",
role="user",
),
CompletionMessage(
content="That's Alpaca!",
role="assistant",
stop_reason="end_of_message",
tool_calls=[],
),
UserMessage(
content="Generally taller and more robust, commonly seen as guard animals.",
role="user",
),
]
```
### Note
- **Few-Shot Examples**: These examples show the model the correct responses for specific prompts.
- **CompletionMessage**: This defines the model's expected completion for each prompt.
3. **Invoke `chat_completion` with Few-Shot Examples**
Use the few-shot examples as the message input for `chat_completion`. The model will use the examples to generate contextually appropriate responses, allowing it to infer and complete new queries in a similar format.
```python
response = client.inference.chat_completion(
messages=few_shot_examples, model="Llama3.2-11B-Vision-Instruct"
)
```
4. **Display the Models Response**
The `completion_message` contains the assistants generated content based on the few-shot examples provided. Output this content to see the model's response directly in the console.
```python
from termcolor import cprint
cprint(f"> Response: {response.completion_message.content}", "cyan")
```
Few-shot learning with Llama Stacks `chat_completion` allows the model to recognize patterns with minimal training data, helping it generate contextually accurate responses based on prior examples. This approach is highly effective for guiding the model in tasks that benefit from clear input-output examples without extensive fine-tuning.
### Complete code
Summing it up, here's the code for few-shot implementation with llama-stack:
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import CompletionMessage, UserMessage
from termcolor import cprint
client = LlamaStackClient(base_url="http://localhost:5000")
response = client.inference.chat_completion(
messages=[
UserMessage(content="Have shorter, spear-shaped ears.", role="user"),
CompletionMessage(
content="That's Alpaca!",
role="assistant",
stop_reason="end_of_message",
tool_calls=[],
),
UserMessage(
content="Known for their calm nature and used as pack animals in mountainous regions.",
role="user",
),
CompletionMessage(
content="That's Llama!",
role="assistant",
stop_reason="end_of_message",
tool_calls=[],
),
UserMessage(
content="Has a straight, slender neck and is smaller in size compared to its relative.",
role="user",
),
CompletionMessage(
content="That's Alpaca!",
role="assistant",
stop_reason="end_of_message",
tool_calls=[],
),
UserMessage(
content="Generally taller and more robust, commonly seen as guard animals.",
role="user",
),
],
model="Llama3.2-11B-Vision-Instruct",
)
cprint(f"> Response: {response.completion_message.content}", "cyan")
```
---
With this fundamental, you should be well on your way to leveraging Llama Stacks text generation capabilities! For more advanced features, refer to the [Llama Stack Documentation](https://llama-stack.readthedocs.io/en/latest/).

View file

@ -1,140 +0,0 @@
# Switching between Local and Cloud Model with Llama Stack
This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stacks `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.
### Pre-requisite
Before you begin, please ensure Llama Stack is installed and the distribution are set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distribution, a local and a cloud distribution, for this demo to work.
<!--- [TODO: show how to create two distributions] --->
### Implementation
1. **Set Up Local and Cloud Clients**
Initialize both clients, specifying the `base_url` for you intialized each instance. In this case, we have the local distribution running on `http://localhost:5000` and the cloud distribution running on `http://localhost:5001`.
```python
from llama_stack_client import LlamaStackClient
# Configure local and cloud clients
local_client = LlamaStackClient(base_url="http://localhost:5000")
cloud_client = LlamaStackClient(base_url="http://localhost:5001")
```
2. **Client Selection with Fallback**
The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.
```python
import httpx
from termcolor import cprint
async def select_client() -> LlamaStackClient:
"""Use local client if available; otherwise, switch to cloud client."""
try:
async with httpx.AsyncClient() as http_client:
response = await http_client.get(f"{local_client.base_url}/health")
if response.status_code == 200:
cprint("Using local client.", "yellow")
return local_client
except httpx.RequestError:
pass
cprint("Local client unavailable. Switching to cloud client.", "yellow")
return cloud_client
```
3. **Generate a Response**
After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.
```python
from llama_stack_client.types import UserMessage
async def get_llama_response(stream: bool = True):
client = await select_client() # Selects the available client
message = UserMessage(content="hello world, write me a 2 sentence poem about the moon", role="user")
cprint(f"User> {message.content}", "green")
response = client.inference.chat_completion(
messages=[message],
model="Llama3.2-11B-Vision-Instruct",
stream=stream,
)
if not stream:
cprint(f"> Response: {response}", "cyan")
else:
# Stream tokens progressively
async for log in EventLogger().log(response):
log.print()
```
4. **Run the Asynchronous Response Generation**
Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.
```python
import asyncio
# Initiate the response generation process
asyncio.run(get_llama_response())
```
### Complete code
Summing it up, here's the code for local-cloud model implementation with llama-stack:
```python
import asyncio
import httpx
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.inference.event_logger import EventLogger
from llama_stack_client.types import UserMessage
from termcolor import cprint
local_client = LlamaStackClient(base_url="http://localhost:5000")
cloud_client = LlamaStackClient(base_url="http://localhost:5001")
async def select_client() -> LlamaStackClient:
try:
async with httpx.AsyncClient() as http_client:
response = await http_client.get(f"{local_client.base_url}/health")
if response.status_code == 200:
cprint("Using local client.", "yellow")
return local_client
except httpx.RequestError:
pass
cprint("Local client unavailable. Switching to cloud client.", "yellow")
return cloud_client
async def get_llama_response(stream: bool = True):
client = await select_client()
message = UserMessage(
content="hello world, write me a 2 sentence poem about the moon", role="user"
)
cprint(f"User> {message.content}", "green")
response = client.inference.chat_completion(
messages=[message],
model="Llama3.2-11B-Vision-Instruct",
stream=stream,
)
if not stream:
cprint(f"> Response: {response}", "cyan")
else:
async for log in EventLogger().log(response):
log.print()
asyncio.run(get_llama_response())
```
---
With this fundamental, you should be well on your way to leveraging Llama Stacks text generation capabilities! For more advanced features, refer to the [Llama Stack Documentation](https://llama-stack.readthedocs.io/en/latest/).

View file

@ -1,111 +0,0 @@
# Getting Started with Llama Stack
This guide will walk you through the steps to set up an end-to-end workflow with Llama Stack. It focuses on building a Llama Stack distribution and starting up a Llama Stack server. See our [documentation](../README.md) for more on Llama Stack's capabilities, or visit [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) for example apps.
## Installation
The `llama` CLI tool helps you manage the Llama toolchain & agentic systems. After installing the `llama-stack` package, the `llama` command should be available in your path.
You can install this repository in two ways:
1. **Install as a package**:
Install directly from [PyPI](https://pypi.org/project/llama-stack/) with:
```bash
pip install llama-stack
```
2. **Install from source**:
Follow these steps to install from the source code:
```bash
mkdir -p ~/local
cd ~/local
git clone git@github.com:meta-llama/llama-stack.git
conda create -n stack python=3.10
conda activate stack
cd llama-stack
$CONDA_PREFIX/bin/pip install -e .
```
Refer to the [CLI Reference](./cli_reference.md) for details on Llama CLI commands.
## Starting Up Llama Stack Server
There are two ways to start the Llama Stack server:
1. **Using Docker**:
We provide a pre-built Docker image of Llama Stack, available in the [distributions](../distributions/) folder.
> **Note:** For GPU inference, set environment variables to specify the local directory with your model checkpoints and enable GPU inference.
```bash
export LLAMA_CHECKPOINT_DIR=~/.llama
```
Download Llama models with:
```
llama download --model-id Llama3.1-8B-Instruct
```
Start a Docker container with:
```bash
cd llama-stack/distributions/meta-reference-gpu
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
```
**Tip:** For remote providers, use `docker compose up` with scripts in the [distributions folder](../distributions/).
2. **Build->Configure->Run via Conda**:
For development, build a LlamaStack distribution from scratch.
**`llama stack build`**
Enter build information interactively:
```bash
llama stack build
```
**`llama stack configure`**
Run `llama stack configure <name>` using the name from the build step.
```bash
llama stack configure my-local-stack
```
**`llama stack run`**
Start the server with:
```bash
llama stack run my-local-stack
```
## Testing with Client
After setup, test the server with a client:
```bash
cd /path/to/llama-stack
conda activate <env>
python -m llama_stack.apis.inference.client localhost 5000
```
You can also send a POST request:
```bash
curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "Llama3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
```
For testing safety, run:
```bash
python -m llama_stack.apis.safety.client localhost 5000
```
Check our client SDKs for various languages: [Python](https://github.com/meta-llama/llama-stack-client-python), [Node](https://github.com/meta-llama/llama-stack-client-node), [Swift](https://github.com/meta-llama/llama-stack-client-swift), and [Kotlin](https://github.com/meta-llama/llama-stack-client-kotlin).
## Advanced Guides
For more on custom Llama Stack distributions, refer to our [Building a Llama Stack Distribution](./building_distro.md) guide.

View file

@ -1,184 +0,0 @@
# Llama Stack Quickstart Guide
This guide will walk you through setting up an end-to-end workflow with Llama Stack, enabling you to perform text generation using the `Llama3.2-11B-Vision-Instruct` model. Follow these steps to get started quickly.
## Table of Contents
1. [Prerequisite](#prerequisite)
2. [Installation](#installation)
3. [Download Llama Models](#download-llama-models)
4. [Build, Configure, and Run Llama Stack](#build-configure-and-run-llama-stack)
5. [Testing with `curl`](#testing-with-curl)
6. [Testing with Python](#testing-with-python)
7. [Next Steps](#next-steps)
---
## Prerequisite
Ensure you have the following installed on your system:
- **Conda**: A package, dependency, and environment management tool.
---
## Installation
The `llama` CLI tool helps you manage the Llama Stack toolchain and agent systems.
**Install via PyPI:**
```bash
pip install llama-stack
```
*After installation, the `llama` command should be available in your PATH.*
---
## Download Llama Models
Download the necessary Llama model checkpoints using the `llama` CLI:
```bash
llama download --model-id Llama3.2-11B-Vision-Instruct
```
*Follow the CLI prompts to complete the download. You may need to accept a license agreement. Obtain an instant license [here](https://www.llama.com/llama-downloads/).*
---
## Build, Configure, and Run Llama Stack
### 1. Build the Llama Stack Distribution
We will default into building a `meta-reference-gpu` distribution, however you could read more about the different distriubtion [here](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/index.html).
```bash
llama stack build --template meta-reference-gpu --image-type conda
```
### 2. Run the Llama Stack Distribution
> Launching a distribution initializes and configures the necessary APIs and Providers, enabling seamless interaction with the underlying model.
Start the server with the configured stack:
```bash
cd llama-stack/distributions/meta-reference-gpu
llama stack run ./run.yaml
```
*The server will start and listen on `http://localhost:5000` by default.*
---
## Testing with `curl`
After setting up the server, verify it's working by sending a `POST` request using `curl`:
```bash
curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
"model": "Llama3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2-sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
```
**Expected Output:**
```json
{
"completion_message": {
"role": "assistant",
"content": "The moon glows softly in the midnight sky,\nA beacon of wonder, as it catches the eye.",
"stop_reason": "out_of_tokens",
"tool_calls": []
},
"logprobs": null
}
```
---
## Testing with Python
You can also interact with the Llama Stack server using a simple Python script. Below is an example:
### 1. Install Required Python Packages
The `llama-stack-client` library offers a robust and efficient python methods for interacting with the Llama Stack server.
```bash
pip install llama-stack-client
```
### 2. Create a Python Script (`test_llama_stack.py`)
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types import SystemMessage, UserMessage
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:5000")
# Create a chat completion request
response = client.inference.chat_completion(
messages=[
SystemMessage(content="You are a helpful assistant.", role="system"),
UserMessage(content="Write me a 2-sentence poem about the moon", role="user")
],
model="Llama3.1-8B-Instruct",
)
# Print the response
print(response.completion_message.content)
```
### 3. Run the Python Script
```bash
python test_llama_stack.py
```
**Expected Output:**
```
The moon glows softly in the midnight sky,
A beacon of wonder, as it catches the eye.
```
With these steps, you should have a functional Llama Stack setup capable of generating text using the specified model. For more detailed information and advanced configurations, refer to some of our documentation below.
---
## Next Steps
- **Explore Other Guides**: Dive deeper into specific topics by following these guides:
- [Understanding Distributions](#)
- [Configure your Distro](#)
- [Doing Inference API Call and Fetching a Response from Endpoints](#)
- [Creating a Conversation Loop](#)
- [Sending Image to the Model](#)
- [Tool Calling: How to and Details](#)
- [Memory API: Show Simple In-Memory Retrieval](#)
- [Agents API: Explain Components](#)
- [Using Safety API in Conversation](#)
- [Prompt Engineering Guide](#)
- **Explore Client SDKs**: Utilize our client SDKs for various languages to integrate Llama Stack into your applications:
- [Python SDK](https://github.com/meta-llama/llama-stack-client-python)
- [Node SDK](https://github.com/meta-llama/llama-stack-client-node)
- [Swift SDK](https://github.com/meta-llama/llama-stack-client-swift)
- [Kotlin SDK](https://github.com/meta-llama/llama-stack-client-kotlin)
- **Advanced Configuration**: Learn how to customize your Llama Stack distribution by referring to the [Building a Llama Stack Distribution](./building_distro.md) guide.
- **Explore Example Apps**: Check out [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) for example applications built using Llama Stack.
---