mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-10-15 14:43:48 +00:00
421 lines
13 KiB
Text
421 lines
13 KiB
Text
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c1e7571c",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Llama Stack Inference Guide\n",
|
||
"\n",
|
||
"This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. \n",
|
||
"\n",
|
||
"Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).\n",
|
||
"\n",
|
||
"\n",
|
||
"### Table of Contents\n",
|
||
"1. [Quickstart](#quickstart)\n",
|
||
"2. [Building Effective Prompts](#building-effective-prompts)\n",
|
||
"3. [Conversation Loop](#conversation-loop)\n",
|
||
"4. [Conversation History](#conversation-history)\n",
|
||
"5. [Streaming Responses](#streaming-responses)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "414301dc",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Quickstart\n",
|
||
"\n",
|
||
"This section walks through each step to set up and make a simple text generation request.\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "25b97dfe",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 0. Configuration\n",
|
||
"Set up your connection parameters:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"id": "38a39e44",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"HOST = \"localhost\" # Replace with your host\n",
|
||
"PORT = 5000 # Replace with your port"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "7dacaa2d-94e9-42e9-82a0-73522dfc7010",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 1. Set Up the Client\n",
|
||
"\n",
|
||
"Begin by importing the necessary components from Llama Stack’s client library:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"id": "7a573752",
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from llama_stack_client import LlamaStackClient\n",
|
||
"from llama_stack_client.types import SystemMessage, UserMessage\n",
|
||
"\n",
|
||
"client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "86366383",
|
||
"metadata": {},
|
||
"source": [
|
||
"### 2. Create a Chat Completion Request\n",
|
||
"\n",
|
||
"Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"id": "77c29dba",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"A gentle llama roams the land,\n",
|
||
"With soft fur and a gentle hand.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"response = client.inference.chat_completion(\n",
|
||
" messages=[\n",
|
||
" SystemMessage(content='You are a friendly assistant.', role='system'),\n",
|
||
" UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
|
||
" ],\n",
|
||
" model='Llama3.2-11B-Vision-Instruct',\n",
|
||
")\n",
|
||
"\n",
|
||
"print(response.completion_message.content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e5f16949",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Building Effective Prompts\n",
|
||
"\n",
|
||
"Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n",
|
||
"\n",
|
||
"1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.\n",
|
||
" - **Example**: `SystemMessage(content='You are a friendly assistant that explains complex topics simply.')`\n",
|
||
"2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.\n",
|
||
" - **Example**: `UserMessage(content='Explain recursion in programming in simple terms.')`\n",
|
||
"\n",
|
||
"### Sample Prompt"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"id": "5c6812da",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"O, fairest llama, with thy softest fleece,\n",
|
||
"Thy gentle eyes, like sapphires, in serenity do cease.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"response = client.inference.chat_completion(\n",
|
||
" messages=[\n",
|
||
" SystemMessage(content='You are shakespeare.', role='system'),\n",
|
||
" UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
|
||
" ],\n",
|
||
" model='Llama3.2-11B-Vision-Instruct',\n",
|
||
")\n",
|
||
"\n",
|
||
"print(response.completion_message.content)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "c8690ef0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Conversation Loop\n",
|
||
"\n",
|
||
"To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"id": "02211625",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"User> Write me a 3 sentence poem about alpaca\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[36m> Response: Softly grazing, gentle soul,\n",
|
||
"Alpaca's fleece, a treasure whole,\n",
|
||
"In Andean fields, they softly roll.\u001b[0m\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"User> exit\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[33mEnding conversation. Goodbye!\u001b[0m\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import asyncio\n",
|
||
"from llama_stack_client import LlamaStackClient\n",
|
||
"from llama_stack_client.types import UserMessage\n",
|
||
"from termcolor import cprint\n",
|
||
"\n",
|
||
"client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')\n",
|
||
"\n",
|
||
"async def chat_loop():\n",
|
||
" while True:\n",
|
||
" user_input = input('User> ')\n",
|
||
" if user_input.lower() in ['exit', 'quit', 'bye']:\n",
|
||
" cprint('Ending conversation. Goodbye!', 'yellow')\n",
|
||
" break\n",
|
||
"\n",
|
||
" message = UserMessage(content=user_input, role='user')\n",
|
||
" response = client.inference.chat_completion(\n",
|
||
" messages=[message],\n",
|
||
" model='Llama3.2-11B-Vision-Instruct',\n",
|
||
" )\n",
|
||
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
|
||
"\n",
|
||
"# Run the chat loop in a Jupyter Notebook cell using `await`\n",
|
||
"await chat_loop()\n",
|
||
"# To run it in a python file, use this line instead\n",
|
||
"# asyncio.run(chat_loop())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "8cf0d555",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Conversation History\n",
|
||
"\n",
|
||
"Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"id": "9496f75c",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"User> what is 1+1\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[36m> Response: 1 + 1 = 2\u001b[0m\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"User> what is llama + alpaca\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[36m> Response: That's a creative and imaginative question. However, since llamas and alpacas are animals, not numbers, we can't perform a mathematical operation on them.\n",
|
||
"\n",
|
||
"But if we were to interpret this as a creative or humorous question, we could say that the result of \"llama + alpaca\" is a fun and fuzzy bundle of South American camelid cuteness!\u001b[0m\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"User> what was the first question\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[36m> Response: The first question was \"what is 1+1\"\u001b[0m\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdin",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"User> exit\n"
|
||
]
|
||
},
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[33mEnding conversation. Goodbye!\u001b[0m\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"async def chat_loop():\n",
|
||
" conversation_history = []\n",
|
||
" while True:\n",
|
||
" user_input = input('User> ')\n",
|
||
" if user_input.lower() in ['exit', 'quit', 'bye']:\n",
|
||
" cprint('Ending conversation. Goodbye!', 'yellow')\n",
|
||
" break\n",
|
||
"\n",
|
||
" user_message = UserMessage(content=user_input, role='user')\n",
|
||
" conversation_history.append(user_message)\n",
|
||
"\n",
|
||
" response = client.inference.chat_completion(\n",
|
||
" messages=conversation_history,\n",
|
||
" model='Llama3.2-11B-Vision-Instruct',\n",
|
||
" )\n",
|
||
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
|
||
"\n",
|
||
" assistant_message = UserMessage(content=response.completion_message.content, role='user')\n",
|
||
" conversation_history.append(assistant_message)\n",
|
||
"\n",
|
||
"# Use `await` in the Jupyter Notebook cell to call the function\n",
|
||
"await chat_loop()\n",
|
||
"# To run it in a python file, use this line instead\n",
|
||
"# asyncio.run(chat_loop())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "03fcf5e0",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Streaming Responses\n",
|
||
"\n",
|
||
"Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.\n",
|
||
"\n",
|
||
"### Example: Streaming Responses"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"id": "d119026e",
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\u001b[32mUser> Write me a 3 sentence poem about llama\u001b[0m\n",
|
||
"\u001b[36mAssistant> \u001b[0m\u001b[33mSoft\u001b[0m\u001b[33mly\u001b[0m\u001b[33m padded\u001b[0m\u001b[33m feet\u001b[0m\u001b[33m on\u001b[0m\u001b[33m the\u001b[0m\u001b[33m ground\u001b[0m\u001b[33m,\n",
|
||
"\u001b[0m\u001b[33mA\u001b[0m\u001b[33m gentle\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m's\u001b[0m\u001b[33m peaceful\u001b[0m\u001b[33m sound\u001b[0m\u001b[33m,\n",
|
||
"\u001b[0m\u001b[33mF\u001b[0m\u001b[33murry\u001b[0m\u001b[33m coat\u001b[0m\u001b[33m and\u001b[0m\u001b[33m calm\u001b[0m\u001b[33m,\u001b[0m\u001b[33m serene\u001b[0m\u001b[33m eyes\u001b[0m\u001b[33m all\u001b[0m\u001b[33m around\u001b[0m\u001b[33m.\u001b[0m\u001b[97m\u001b[0m\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import asyncio\n",
|
||
"from llama_stack_client import LlamaStackClient\n",
|
||
"from llama_stack_client.lib.inference.event_logger import EventLogger\n",
|
||
"from llama_stack_client.types import UserMessage\n",
|
||
"from termcolor import cprint\n",
|
||
"\n",
|
||
"async def run_main(stream: bool = True):\n",
|
||
" client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')\n",
|
||
"\n",
|
||
" message = UserMessage(\n",
|
||
" content='Write me a 3 sentence poem about llama', role='user'\n",
|
||
" )\n",
|
||
" cprint(f'User> {message.content}', 'green')\n",
|
||
"\n",
|
||
" response = client.inference.chat_completion(\n",
|
||
" messages=[message],\n",
|
||
" model='Llama3.2-11B-Vision-Instruct',\n",
|
||
" stream=stream,\n",
|
||
" )\n",
|
||
"\n",
|
||
" if not stream:\n",
|
||
" cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
|
||
" else:\n",
|
||
" async for log in EventLogger().log(response):\n",
|
||
" log.print()\n",
|
||
" \n",
|
||
" models_response = client.models.list()\n",
|
||
"\n",
|
||
"# In a Jupyter Notebook cell, use `await` to call the function\n",
|
||
"await run_main()\n",
|
||
"# To run it in a python file, use this line instead\n",
|
||
"# asyncio.run(chat_loop())"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.15"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|