llama-stack-mirror/docs/zero_to_hero_guide/00_Inference101.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c1e7571c",
   "metadata": {},
   "source": [
    "# Llama Stack Inference Guide\n",
    "\n",
    "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. \n",
    "\n",
    "Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "1. [Quickstart](#quickstart)\n",
    "2. [Building Effective Prompts](#building-effective-prompts)\n",
    "3. [Conversation Loop](#conversation-loop)\n",
    "4. [Conversation History](#conversation-history)\n",
    "5. [Streaming Responses](#streaming-responses)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "414301dc",
   "metadata": {},
   "source": [
    "## Quickstart\n",
    "\n",
    "This section walks through each step to set up and make a simple text generation request.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25b97dfe",
   "metadata": {},
   "source": [
    "### 0. Configuration\n",
    "Set up your connection parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "38a39e44",
   "metadata": {},
   "outputs": [],
   "source": [
    "HOST = \"localhost\"  # Replace with your host\n",
    "PORT = 5000       # Replace with your port"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dacaa2d-94e9-42e9-82a0-73522dfc7010",
   "metadata": {},
   "source": [
    "### 1. Set Up the Client\n",
    "\n",
    "Begin by importing the necessary components from Llama Stack’s client library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7a573752",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_stack_client import LlamaStackClient\n",
    "from llama_stack_client.types import SystemMessage, UserMessage\n",
    "\n",
    "client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86366383",
   "metadata": {},
   "source": [
    "### 2. Create a Chat Completion Request\n",
    "\n",
    "Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "77c29dba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A gentle llama roams the land,\n",
      "With soft fur and a gentle hand.\n"
     ]
    }
   ],
   "source": [
    "response = client.inference.chat_completion(\n",
    "    messages=[\n",
    "        SystemMessage(content='You are a friendly assistant.', role='system'),\n",
    "        UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
    "    ],\n",
    "    model='Llama3.2-11B-Vision-Instruct',\n",
    ")\n",
    "\n",
    "print(response.completion_message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5f16949",
   "metadata": {},
   "source": [
    "## Building Effective Prompts\n",
    "\n",
    "Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n",
    "\n",
    "1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.\n",
    "   - **Example**: `SystemMessage(content='You are a friendly assistant that explains complex topics simply.')`\n",
    "2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.\n",
    "   - **Example**: `UserMessage(content='Explain recursion in programming in simple terms.')`\n",
    "\n",
    "### Sample Prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5c6812da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "O, fairest llama, with thy softest fleece,\n",
      "Thy gentle eyes, like sapphires, in serenity do cease.\n"
     ]
    }
   ],
   "source": [
    "response = client.inference.chat_completion(\n",
    "    messages=[\n",
    "        SystemMessage(content='You are shakespeare.', role='system'),\n",
    "        UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
    "    ],\n",
    "    model='Llama3.2-11B-Vision-Instruct',\n",
    ")\n",
    "\n",
    "print(response.completion_message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8690ef0",
   "metadata": {},
   "source": [
    "## Conversation Loop\n",
    "\n",
    "To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "02211625",
   "metadata": {},
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "User>  Write me a 3 sentence poem about alpaca\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m> Response: Softly grazing, gentle soul,\n",
      "Alpaca's fleece, a treasure whole,\n",
      "In Andean fields, they softly roll.\u001b[0m\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "User>  exit\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mEnding conversation. Goodbye!\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import asyncio\n",
    "from llama_stack_client import LlamaStackClient\n",
    "from llama_stack_client.types import UserMessage\n",
    "from termcolor import cprint\n",
    "\n",
    "client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')\n",
    "\n",
    "async def chat_loop():\n",
    "    while True:\n",
    "        user_input = input('User> ')\n",
    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
    "            break\n",
    "\n",
    "        message = UserMessage(content=user_input, role='user')\n",
    "        response = client.inference.chat_completion(\n",
    "            messages=[message],\n",
    "            model='Llama3.2-11B-Vision-Instruct',\n",
    "        )\n",
    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
    "\n",
    "# Run the chat loop in a Jupyter Notebook cell using `await`\n",
    "await chat_loop()\n",
    "# To run it in a python file, use this line instead\n",
    "# asyncio.run(chat_loop())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cf0d555",
   "metadata": {},
   "source": [
    "## Conversation History\n",
    "\n",
    "Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "9496f75c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "User>  what is 1+1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m> Response: 1 + 1 = 2\u001b[0m\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "User>  what is llama + alpaca\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m> Response: That's a creative and imaginative question. However, since llamas and alpacas are animals, not numbers, we can't perform a mathematical operation on them.\n",
      "\n",
      "But if we were to interpret this as a creative or humorous question, we could say that the result of \"llama + alpaca\" is a fun and fuzzy bundle of South American camelid cuteness!\u001b[0m\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "User>  what was the first question\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m> Response: The first question was \"what is 1+1\"\u001b[0m\n"
     ]
    },
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "User>  exit\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33mEnding conversation. Goodbye!\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "async def chat_loop():\n",
    "    conversation_history = []\n",
    "    while True:\n",
    "        user_input = input('User> ')\n",
    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
    "            break\n",
    "\n",
    "        user_message = UserMessage(content=user_input, role='user')\n",
    "        conversation_history.append(user_message)\n",
    "\n",
    "        response = client.inference.chat_completion(\n",
    "            messages=conversation_history,\n",
    "            model='Llama3.2-11B-Vision-Instruct',\n",
    "        )\n",
    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
    "\n",
    "        assistant_message = UserMessage(content=response.completion_message.content, role='user')\n",
    "        conversation_history.append(assistant_message)\n",
    "\n",
    "# Use `await` in the Jupyter Notebook cell to call the function\n",
    "await chat_loop()\n",
    "# To run it in a python file, use this line instead\n",
    "# asyncio.run(chat_loop())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03fcf5e0",
   "metadata": {},
   "source": [
    "## Streaming Responses\n",
    "\n",
    "Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.\n",
    "\n",
    "### Example: Streaming Responses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d119026e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32mUser> Write me a 3 sentence poem about llama\u001b[0m\n",
      "\u001b[36mAssistant> \u001b[0m\u001b[33mSoft\u001b[0m\u001b[33mly\u001b[0m\u001b[33m padded\u001b[0m\u001b[33m feet\u001b[0m\u001b[33m on\u001b[0m\u001b[33m the\u001b[0m\u001b[33m ground\u001b[0m\u001b[33m,\n",
      "\u001b[0m\u001b[33mA\u001b[0m\u001b[33m gentle\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m's\u001b[0m\u001b[33m peaceful\u001b[0m\u001b[33m sound\u001b[0m\u001b[33m,\n",
      "\u001b[0m\u001b[33mF\u001b[0m\u001b[33murry\u001b[0m\u001b[33m coat\u001b[0m\u001b[33m and\u001b[0m\u001b[33m calm\u001b[0m\u001b[33m,\u001b[0m\u001b[33m serene\u001b[0m\u001b[33m eyes\u001b[0m\u001b[33m all\u001b[0m\u001b[33m around\u001b[0m\u001b[33m.\u001b[0m\u001b[97m\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import asyncio\n",
    "from llama_stack_client import LlamaStackClient\n",
    "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
    "from llama_stack_client.types import UserMessage\n",
    "from termcolor import cprint\n",
    "\n",
    "async def run_main(stream: bool = True):\n",
    "    client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')\n",
    "\n",
    "    message = UserMessage(\n",
    "        content='Write me a 3 sentence poem about llama', role='user'\n",
    "    )\n",
    "    cprint(f'User> {message.content}', 'green')\n",
    "\n",
    "    response = client.inference.chat_completion(\n",
    "        messages=[message],\n",
    "        model='Llama3.2-11B-Vision-Instruct',\n",
    "        stream=stream,\n",
    "    )\n",
    "\n",
    "    if not stream:\n",
    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
    "    else:\n",
    "        async for log in EventLogger().log(response):\n",
    "            log.print()\n",
    "    \n",
    "    models_response = client.models.list()\n",
    "\n",
    "# In a Jupyter Notebook cell, use `await` to call the function\n",
    "await run_main()\n",
    "# To run it in a python file, use this line instead\n",
    "# asyncio.run(chat_loop())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}