llama-stack-mirror/docs/zero_to_hero_guide/00_Inference101.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5af4f44e",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/zero_to_hero_guide/00_Inference101.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1e7571c",
   "metadata": {},
   "source": [
    "# Llama Stack Inference Guide\n",
    "\n",
    "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.1-8B-Instruct` model. \n",
    "\n",
    "Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "1. [Quickstart](#quickstart)\n",
    "2. [Building Effective Prompts](#building-effective-prompts)\n",
    "3. [Conversation Loop](#conversation-loop)\n",
    "4. [Conversation History](#conversation-history)\n",
    "5. [Streaming Responses](#streaming-responses)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "414301dc",
   "metadata": {},
   "source": [
    "## Quickstart\n",
    "\n",
    "This section walks through each step to set up and make a simple text generation request.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25b97dfe",
   "metadata": {},
   "source": [
    "### 0. Configuration\n",
    "Set up your connection parameters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "38a39e44",
   "metadata": {},
   "outputs": [],
   "source": [
    "HOST = \"localhost\"  # Replace with your host\n",
    "PORT = 5000       # Replace with your port"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dacaa2d-94e9-42e9-82a0-73522dfc7010",
   "metadata": {},
   "source": [
    "### 1. Set Up the Client\n",
    "\n",
    "Begin by importing the necessary components from Llama Stack’s client library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7a573752",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_stack_client import LlamaStackClient\n",
    "\n",
    "client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86366383",
   "metadata": {},
   "source": [
    "### 2. Create a Chat Completion Request\n",
    "\n",
    "Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "77c29dba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "With soft fur and gentle eyes,\n",
      "The llama roams, a peaceful surprise.\n"
     ]
    }
   ],
   "source": [
    "response = client.inference.chat_completion(\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are a friendly assistant.\"},\n",
    "        {\"role\": \"user\", \"content\": \"Write a two-sentence poem about llama.\"}\n",
    "    ],\n",
    "    model='Llama3.2-11B-Vision-Instruct',\n",
    ")\n",
    "\n",
    "print(response.completion_message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5f16949",
   "metadata": {},
   "source": [
    "## Building Effective Prompts\n",
    "\n",
    "Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n",
    "\n",
    "### Sample Prompt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "5c6812da",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "O, fairest llama, with thy softest fleece,\n",
      "Thy gentle eyes, like sapphires, in serenity do cease.\n"
     ]
    }
   ],
   "source": [
    "response = client.inference.chat_completion(\n",
    "    messages=[\n",
    "        {\"role\": \"system\", \"content\": \"You are shakespeare.\"},\n",
    "        {\"role\": \"user\", \"content\": \"Write a two-sentence poem about llama.\"}\n",
    "    ],\n",
    "    model='Llama3.2-11B-Vision-Instruct',\n",
    ")\n",
    "\n",
    "print(response.completion_message.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8690ef0",
   "metadata": {},
   "source": [
    "## Conversation Loop\n",
    "\n",
    "To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02211625",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "User>  1+1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m> Response: 2\u001b[0m\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "User>  what is llama\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[36m> Response: A llama is a domesticated mammal native to South America, specifically the Andean region. It belongs to the camelid family, which also includes camels, alpacas, guanacos, and vicuñas.\n",
      "\n",
      "Here are some interesting facts about llamas:\n",
      "\n",
      "1. **Physical Characteristics**: Llamas are large, even-toed ungulates with a distinctive appearance. They have a long neck, a small head, and a soft, woolly coat that can be various colors, including white, brown, gray, and black.\n",
      "2. **Size**: Llamas typically grow to be between 5 and 6 feet (1.5 to 1.8 meters) tall at the shoulder and weigh between 280 and 450 pounds (127 to 204 kilograms).\n",
      "3. **Habitat**: Llamas are native to the Andean highlands, where they live in herds and roam freely. They are well adapted to the harsh, high-altitude climate of the Andes.\n",
      "4. **Diet**: Llamas are herbivores and feed on a variety of plants, including grasses, leaves, and shrubs. They are known for their ability to digest plant material that other animals cannot.\n",
      "5. **Behavior**: Llamas are social animals and live in herds. They are known for their intelligence, curiosity, and strong sense of self-preservation.\n",
      "6. **Purpose**: Llamas have been domesticated for thousands of years and have been used for a variety of purposes, including:\n",
      "\t* **Pack animals**: Llamas are often used as pack animals, carrying goods and supplies over long distances.\n",
      "\t* **Fiber production**: Llama wool is highly valued for its softness, warmth, and durability.\n",
      "\t* **Meat**: Llama meat is consumed in some parts of the world, particularly in South America.\n",
      "\t* **Companionship**: Llamas are often kept as pets or companions, due to their gentle nature and intelligence.\n",
      "\n",
      "Overall, llamas are fascinating animals that have been an integral part of Andean culture for thousands of years.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "import asyncio\n",
    "from llama_stack_client import LlamaStackClient\n",
    "from termcolor import cprint\n",
    "\n",
    "client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')\n",
    "\n",
    "async def chat_loop():\n",
    "    while True:\n",
    "        user_input = input('User> ')\n",
    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
    "            break\n",
    "\n",
    "        message = {\"role\": \"user\", \"content\": user_input}\n",
    "        response = client.inference.chat_completion(\n",
    "            messages=[message],\n",
    "            model='Llama3.2-11B-Vision-Instruct',\n",
    "        )\n",
    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
    "\n",
    "# Run the chat loop in a Jupyter Notebook cell using await\n",
    "await chat_loop()\n",
    "# To run it in a python file, use this line instead\n",
    "# asyncio.run(chat_loop())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cf0d555",
   "metadata": {},
   "source": [
    "## Conversation History\n",
    "\n",
    "Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9496f75c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "User>  1+1\n"
     ]
    }
   ],
   "source": [
    "async def chat_loop():\n",
    "    conversation_history = []\n",
    "    while True:\n",
    "        user_input = input('User> ')\n",
    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
    "            break\n",
    "\n",
    "        user_message = {\"role\": \"user\", \"content\": user_input}\n",
    "        conversation_history.append(user_message)\n",
    "\n",
    "        response = client.inference.chat_completion(\n",
    "            messages=conversation_history,\n",
    "            model='Llama3.2-11B-Vision-Instruct',\n",
    "        )\n",
    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
    "\n",
    "        # Append the assistant message with all required fields\n",
    "        assistant_message = {\n",
    "            \"role\": \"user\",\n",
    "            \"content\": response.completion_message.content,\n",
    "            # Add any additional required fields here if necessary\n",
    "        }\n",
    "        conversation_history.append(assistant_message)\n",
    "\n",
    "# Use `await` in the Jupyter Notebook cell to call the function\n",
    "await chat_loop()\n",
    "# To run it in a python file, use this line instead\n",
    "# asyncio.run(chat_loop())\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "03fcf5e0",
   "metadata": {},
   "source": [
    "## Streaming Responses\n",
    "\n",
    "Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d119026e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
    "\n",
    "async def run_main(stream: bool = True):\n",
    "    client = LlamaStackClient(base_url=f'http://{HOST}:{PORT}')\n",
    "\n",
    "    message = {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": 'Write me a 3 sentence poem about llama'\n",
    "    }\n",
    "    cprint(f'User> {message[\"content\"]}', 'green')\n",
    "\n",
    "    response = client.inference.chat_completion(\n",
    "        messages=[message],\n",
    "        model='Llama3.2-11B-Vision-Instruct',\n",
    "        stream=stream,\n",
    "    )\n",
    "\n",
    "    if not stream:\n",
    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
    "    else:\n",
    "        async for log in EventLogger().log(response):\n",
    "            log.print()\n",
    "\n",
    "# In a Jupyter Notebook cell, use `await` to call the function\n",
    "await run_main()\n",
    "# To run it in a python file, use this line instead\n",
    "# asyncio.run(run_main())\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}