doc enhancements, converted md into jupyter, reorganize files

2025-12-08 11:07:22 +00:00 · 2024-11-05 13:12:30 -08:00 · 2024-11-05 13:12:30 -08:00 · ecad16b904
commit ecad16b904
parent 0f08f77565
13 changed files with 450 additions and 113 deletions
--- a/docs/zero_to_hero_guide/00_Inference101.ipynb
+++ b/docs/zero_to_hero_guide/00_Inference101.ipynb
@ -0,0 +1,247 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c1e7571c",
+   "metadata": {},
+   "source": [
+    "# Llama Stack Inference Guide\n",
+    "\n",
+    "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).\n",
+    "\n",
+    "### Table of Contents\n",
+    "1. [Quickstart](#quickstart)\n",
+    "2. [Building Effective Prompts](#building-effective-prompts)\n",
+    "3. [Conversation Loop](#conversation-loop)\n",
+    "4. [Conversation History](#conversation-history)\n",
+    "5. [Streaming Responses](#streaming-responses)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "414301dc",
+   "metadata": {},
+   "source": [
+    "## Quickstart\n",
+    "\n",
+    "This section walks through each step to set up and make a simple text generation request.\n",
+    "\n",
+    "### 1. Set Up the Client\n",
+    "\n",
+    "Begin by importing the necessary components from Llama Stack’s client library:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7a573752",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.types import SystemMessage, UserMessage\n",
+    "\n",
+    "client = LlamaStackClient(base_url='http://localhost:5000')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86366383",
+   "metadata": {},
+   "source": [
+    "### 2. Create a Chat Completion Request\n",
+    "\n",
+    "Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77c29dba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = client.inference.chat_completion(\n",
+    "    messages=[\n",
+    "        SystemMessage(content='You are a friendly assistant.', role='system'),\n",
+    "        UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
+    "    ],\n",
+    "    model='Llama3.2-11B-Vision-Instruct',\n",
+    ")\n",
+    "\n",
+    "print(response.completion_message.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5f16949",
+   "metadata": {},
+   "source": [
+    "## Building Effective Prompts\n",
+    "\n",
+    "Effective prompt creation (often called 'prompt engineering') is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:\n",
+    "\n",
+    "1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.\n",
+    "   - **Example**: `SystemMessage(content='You are a friendly assistant that explains complex topics simply.')`\n",
+    "2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.\n",
+    "   - **Example**: `UserMessage(content='Explain recursion in programming in simple terms.')`\n",
+    "\n",
+    "### Sample Prompt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5c6812da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "response = client.inference.chat_completion(\n",
+    "    messages=[\n",
+    "        SystemMessage(content='You are shakespeare.', role='system'),\n",
+    "        UserMessage(content='Write a two-sentence poem about llama.', role='user')\n",
+    "    ],\n",
+    "    model='Llama3.2-11B-Vision-Instruct',\n",
+    ")\n",
+    "\n",
+    "print(response.completion_message.content)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8690ef0",
+   "metadata": {},
+   "source": [
+    "## Conversation Loop\n",
+    "\n",
+    "To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types 'exit,' 'quit,' or 'bye.'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "02211625",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.types import UserMessage\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "\n",
+    "async def chat_loop():\n",
+    "    while True:\n",
+    "        user_input = input('User> ')\n",
+    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
+    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
+    "            break\n",
+    "\n",
+    "        message = UserMessage(content=user_input, role='user')\n",
+    "        response = client.inference.chat_completion(\n",
+    "            messages=[message],\n",
+    "            model='Llama3.2-11B-Vision-Instruct',\n",
+    "        )\n",
+    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
+    "\n",
+    "asyncio.run(chat_loop())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8cf0d555",
+   "metadata": {},
+   "source": [
+    "## Conversation History\n",
+    "\n",
+    "Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9496f75c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "async def chat_loop():\n",
+    "    conversation_history = []\n",
+    "    while True:\n",
+    "        user_input = input('User> ')\n",
+    "        if user_input.lower() in ['exit', 'quit', 'bye']:\n",
+    "            cprint('Ending conversation. Goodbye!', 'yellow')\n",
+    "            break\n",
+    "\n",
+    "        user_message = UserMessage(content=user_input, role='user')\n",
+    "        conversation_history.append(user_message)\n",
+    "\n",
+    "        response = client.inference.chat_completion(\n",
+    "            messages=conversation_history,\n",
+    "            model='Llama3.2-11B-Vision-Instruct',\n",
+    "        )\n",
+    "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
+    "\n",
+    "        assistant_message = UserMessage(content=response.completion_message.content, role='user')\n",
+    "        conversation_history.append(assistant_message)\n",
+    "\n",
+    "asyncio.run(chat_loop())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03fcf5e0",
+   "metadata": {},
+   "source": [
+    "## Streaming Responses\n",
+    "\n",
+    "Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.\n",
+    "\n",
+    "### Example: Streaming Responses"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d119026e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
+    "from llama_stack_client.types import UserMessage\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "async def run_main(stream: bool = True):\n",
+    "    client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "\n",
+    "    message = UserMessage(\n",
+    "        content='hello world, write me a 2 sentence poem about the moon', role='user'\n",
+    "    )\n",
+    "    print(f'User>{message.content}', 'green')\n",
+    "\n",
+    "    response = client.inference.chat_completion(\n",
+    "        messages=[message],\n",
+    "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        stream=stream,\n",
+    "    )\n",
+    "\n",
+    "    if not stream:\n",
+    "        cprint(f'> Response: {response}', 'cyan')\n",
+    "    else:\n",
+    "        async for log in EventLogger().log(response):\n",
+    "            log.print()\n",
+    "\n",
+    "    models_response = client.models.list()\n",
+    "    print(models_response)\n",
+    "\n",
+    "if __name__ == '__main__':\n",
+    "    asyncio.run(run_main())"
+   ]
+  }
+ ],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb
+++ b/docs/zero_to_hero_guide/00_Local_Cloud_Inference101.ipynb
@ -0,0 +1,201 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a0ed972d",
+   "metadata": {},
+   "source": [
+    "# Switching between Local and Cloud Model with Llama Stack\n",
+    "\n",
+    "This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.\n",
+    "\n",
+    "### Pre-requisite\n",
+    "Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.\n",
+    "\n",
+    "### Implementation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df89cff7",
+   "metadata": {},
+   "source": [
+    "#### 1. Set Up Local and Cloud Clients\n",
+    "\n",
+    "Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:5000` and the cloud distribution running on `http://localhost:5001`.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7f868dfe",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_stack_client import LlamaStackClient\n",
+    "\n",
+    "# Configure local and cloud clients\n",
+    "local_client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "cloud_client = LlamaStackClient(base_url='http://localhost:5001')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "894689c1",
+   "metadata": {},
+   "source": [
+    "#### 2. Client Selection with Fallback\n",
+    "\n",
+    "The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ff0c8277",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import httpx\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "async def select_client() -> LlamaStackClient:\n",
+    "    \"\"\"Use local client if available; otherwise, switch to cloud client.\"\"\"\n",
+    "    try:\n",
+    "        async with httpx.AsyncClient() as http_client:\n",
+    "            response = await http_client.get(f'{local_client.base_url}/health')\n",
+    "            if response.status_code == 200:\n",
+    "                cprint('Using local client.', 'yellow')\n",
+    "                return local_client\n",
+    "    except httpx.RequestError:\n",
+    "        pass\n",
+    "    cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n",
+    "    return cloud_client"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9ccfe66f",
+   "metadata": {},
+   "source": [
+    "#### 3. Generate a Response\n",
+    "\n",
+    "After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5e19cc20",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from llama_stack_client.types import UserMessage\n",
+    "\n",
+    "async def get_llama_response(stream: bool = True):\n",
+    "    client = await select_client()  # Selects the available client\n",
+    "    message = UserMessage(content='hello world, write me a 2 sentence poem about the moon', role='user')\n",
+    "    cprint(f'User> {message.content}', 'green')\n",
+    "\n",
+    "    response = client.inference.chat_completion(\n",
+    "        messages=[message],\n",
+    "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        stream=stream,\n",
+    "    )\n",
+    "\n",
+    "    if not stream:\n",
+    "        cprint(f'> Response: {response}', 'cyan')\n",
+    "    else:\n",
+    "        # Stream tokens progressively\n",
+    "        async for log in EventLogger().log(response):\n",
+    "            log.print()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6edf5e57",
+   "metadata": {},
+   "source": [
+    "#### 4. Run the Asynchronous Response Generation\n",
+    "\n",
+    "Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c10f487e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "\n",
+    "# Initiate the response generation process\n",
+    "asyncio.run(get_llama_response())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "56aa9a09",
+   "metadata": {},
+   "source": [
+    "### Complete code\n",
+    "Summing it up, here's the complete code for local-cloud model implementation with Llama Stack:\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d9fd74ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import asyncio\n",
+    "import httpx\n",
+    "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
+    "from llama_stack_client.types import UserMessage\n",
+    "from termcolor import cprint\n",
+    "\n",
+    "local_client = LlamaStackClient(base_url='http://localhost:5000')\n",
+    "cloud_client = LlamaStackClient(base_url='http://localhost:5001')\n",
+    "\n",
+    "async def select_client() -> LlamaStackClient:\n",
+    "    try:\n",
+    "        async with httpx.AsyncClient() as http_client:\n",
+    "            response = await http_client.get(f'{local_client.base_url}/health')\n",
+    "            if response.status_code == 200:\n",
+    "                cprint('Using local client.', 'yellow')\n",
+    "                return local_client\n",
+    "    except httpx.RequestError:\n",
+    "        pass\n",
+    "    cprint('Local client unavailable. Switching to cloud client.', 'yellow')\n",
+    "    return cloud_client\n",
+    "\n",
+    "async def get_llama_response(stream: bool = True):\n",
+    "    client = await select_client()\n",
+    "    message = UserMessage(\n",
+    "        content='hello world, write me a 2 sentence poem about the moon', role='user'\n",
+    "    )\n",
+    "    cprint(f'User> {message.content}', 'green')\n",
+    "\n",
+    "    response = client.inference.chat_completion(\n",
+    "        messages=[message],\n",
+    "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        stream=stream,\n",
+    "    )\n",
+    "\n",
+    "    if not stream:\n",
+    "        cprint(f'> Response: {response}', 'cyan')\n",
+    "    else:\n",
+    "        async for log in EventLogger().log(response):\n",
+    "            log.print()\n",
+    "\n",
+    "asyncio.run(get_llama_response())"
+   ]
+  }
+ ],
+ "metadata": {},
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
--- a/docs/zero_to_hero_guide/01_Prompt_Engineering101.ipynb
+++ b/docs/zero_to_hero_guide/01_Prompt_Engineering101.ipynb
@ -0,0 +1,312 @@
+{
+ "cells": [
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/meta-llama/llama-recipes/blob/main/recipes/quickstart/Prompt_Engineering_with_Llama_3.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
+    "\n",
+    "# Prompt Engineering with Llama 3.1\n",
+    "\n",
+    "Prompt engineering is using natural language to produce a desired response from a large language model (LLM).\n",
+    "\n",
+    "This interactive guide covers prompt engineering & best practices with Llama 3.1."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Why now?\n",
+    "\n",
+    "[Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) introduced the world to transformer neural networks (originally for machine translation). Transformers ushered an era of generative AI with diffusion models for image creation and large language models (`LLMs`) as **programmable deep learning networks**.\n",
+    "\n",
+    "Programming foundational LLMs is done with natural language – it doesn't require training/tuning like ML models of the past. This has opened the door to a massive amount of innovation and a paradigm shift in how technology can be deployed. The science/art of using natural language to program language models to accomplish a task is referred to as **Prompt Engineering**."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prompting Techniques"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Explicit Instructions\n",
+    "\n",
+    "Detailed, explicit instructions produce better results than open-ended prompts:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "complete_and_print(prompt=\"Describe quantum physics in one short sentence of no more than 12 words\")\n",
+    "# Returns a succinct explanation of quantum physics that mentions particles and states existing simultaneously."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can think about giving explicit instructions as using rules and restrictions to how Llama 3 responds to your prompt.\n",
+    "\n",
+    "- Stylization\n",
+    "    - `Explain this to me like a topic on a children's educational network show teaching elementary students.`\n",
+    "    - `I'm a software engineer using large language models for summarization. Summarize the following text in under 250 words:`\n",
+    "    - `Give your answer like an old timey private investigator hunting down a case step by step.`\n",
+    "- Formatting\n",
+    "    - `Use bullet points.`\n",
+    "    - `Return as a JSON object.`\n",
+    "    - `Use less technical terms and help me apply it in my work in communications.`\n",
+    "- Restrictions\n",
+    "    - `Only use academic papers.`\n",
+    "    - `Never give sources older than 2020.`\n",
+    "    - `If you don't know the answer, say that you don't know.`\n",
+    "\n",
+    "Here's an example of giving explicit instructions to give more specific results by limiting the responses to recently created sources."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "complete_and_print(\"Explain the latest advances in large language models to me.\")\n",
+    "# More likely to cite sources from 2017\n",
+    "\n",
+    "complete_and_print(\"Explain the latest advances in large language models to me. Always cite your sources. Never cite sources older than 2020.\")\n",
+    "# Gives more specific advances and only cites sources from 2020"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Example Prompting using Zero- and Few-Shot Learning\n",
+    "\n",
+    "A shot is an example or demonstration of what type of prompt and response you expect from a large language model. This term originates from training computer vision models on photographs, where one shot was one example or instance that the model used to classify an image ([Fei-Fei et al. (2006)](http://vision.stanford.edu/documents/Fei-FeiFergusPerona2006.pdf)).\n",
+    "\n",
+    "#### Zero-Shot Prompting\n",
+    "\n",
+    "Large language models like Llama 3 are unique because they are capable of following instructions and producing responses without having previously seen an example of a task. Prompting without examples is called \"zero-shot prompting\".\n",
+    "\n",
+    "Let's try using Llama 3 as a sentiment detector. You may notice that output format varies - we can improve this with better prompting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "complete_and_print(\"Text: This was the best movie I've ever seen! \\n The sentiment of the text is: \")\n",
+    "# Returns positive sentiment\n",
+    "\n",
+    "complete_and_print(\"Text: The director was trying too hard. \\n The sentiment of the text is: \")\n",
+    "# Returns negative sentiment"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "#### Few-Shot Prompting\n",
+    "\n",
+    "Adding specific examples of your desired output generally results in more accurate, consistent output. This technique is called \"few-shot prompting\".\n",
+    "\n",
+    "In this example, the generated response follows our desired format that offers a more nuanced sentiment classifer that gives a positive, neutral, and negative response confidence percentage.\n",
+    "\n",
+    "See also: [Zhao et al. (2021)](https://arxiv.org/abs/2102.09690), [Liu et al. (2021)](https://arxiv.org/abs/2101.06804), [Su et al. (2022)](https://arxiv.org/abs/2209.01975), [Rubin et al. (2022)](https://arxiv.org/abs/2112.08633).\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def sentiment(text):\n",
+    "    response = chat_completion(messages=[\n",
+    "        user(\"You are a sentiment classifier. For each message, give the percentage of positive/netural/negative.\"),\n",
+    "        user(\"I liked it\"),\n",
+    "        assistant(\"70% positive 30% neutral 0% negative\"),\n",
+    "        user(\"It could be better\"),\n",
+    "        assistant(\"0% positive 50% neutral 50% negative\"),\n",
+    "        user(\"It's fine\"),\n",
+    "        assistant(\"25% positive 50% neutral 25% negative\"),\n",
+    "        user(text),\n",
+    "    ])\n",
+    "    return response\n",
+    "\n",
+    "def print_sentiment(text):\n",
+    "    print(f'INPUT: {text}')\n",
+    "    print(sentiment(text))\n",
+    "\n",
+    "print_sentiment(\"I thought it was okay\")\n",
+    "# More likely to return a balanced mix of positive, neutral, and negative\n",
+    "print_sentiment(\"I loved it!\")\n",
+    "# More likely to return 100% positive\n",
+    "print_sentiment(\"Terrible service 0/10\")\n",
+    "# More likely to return 100% negative"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Role Prompting\n",
+    "\n",
+    "Llama will often give more consistent responses when given a role ([Kong et al. (2023)](https://browse.arxiv.org/pdf/2308.07702.pdf)). Roles give context to the LLM on what type of answers are desired.\n",
+    "\n",
+    "Let's use Llama 3 to create a more focused, technical response for a question around the pros and cons of using PyTorch."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "complete_and_print(\"Explain the pros and cons of using PyTorch.\")\n",
+    "# More likely to explain the pros and cons of PyTorch covers general areas like documentation, the PyTorch community, and mentions a steep learning curve\n",
+    "\n",
+    "complete_and_print(\"Your role is a machine learning expert who gives highly technical advice to senior engineers who work with complicated datasets. Explain the pros and cons of using PyTorch.\")\n",
+    "# Often results in more technical benefits and drawbacks that provide more technical details on how model layers"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Chain-of-Thought\n",
+    "\n",
+    "Simply adding a phrase encouraging step-by-step thinking \"significantly improves the ability of large language models to perform complex reasoning\" ([Wei et al. (2022)](https://arxiv.org/abs/2201.11903)). This technique is called \"CoT\" or \"Chain-of-Thought\" prompting.\n",
+    "\n",
+    "Llama 3.1 now reasons step-by-step naturally without the addition of the phrase. This section remains for completeness."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompt = \"Who lived longer, Mozart or Elvis?\"\n",
+    "\n",
+    "complete_and_print(prompt)\n",
+    "# Llama 2 would often give the incorrect answer of \"Mozart\"\n",
+    "\n",
+    "complete_and_print(f\"{prompt} Let's think through this carefully, step by step.\")\n",
+    "# Gives the correct answer \"Elvis\""
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Self-Consistency\n",
+    "\n",
+    "LLMs are probablistic, so even with Chain-of-Thought, a single generation might produce incorrect results. Self-Consistency ([Wang et al. (2022)](https://arxiv.org/abs/2203.11171)) introduces enhanced accuracy by selecting the most frequent answer from multiple generations (at the cost of higher compute):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "from statistics import mode\n",
+    "\n",
+    "def gen_answer():\n",
+    "    response = completion(\n",
+    "        \"John found that the average of 15 numbers is 40.\"\n",
+    "        \"If 10 is added to each number then the mean of the numbers is?\"\n",
+    "        \"Report the answer surrounded by backticks (example: `123`)\",\n",
+    "    )\n",
+    "    match = re.search(r'`(\\d+)`', response)\n",
+    "    if match is None:\n",
+    "        return None\n",
+    "    return match.group(1)\n",
+    "\n",
+    "answers = [gen_answer() for i in range(5)]\n",
+    "\n",
+    "print(\n",
+    "    f\"Answers: {answers}\\n\",\n",
+    "    f\"Final answer: {mode(answers)}\",\n",
+    "    )\n",
+    "\n",
+    "# Sample runs of Llama-3-70B (all correct):\n",
+    "# ['60', '50', '50', '50', '50'] -> 50\n",
+    "# ['50', '50', '50', '60', '50'] -> 50\n",
+    "# ['50', '50', '60', '50', '50'] -> 50"
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Author & Contact\n",
+    "\n",
+    "Edited by [Dalton Flanagan](https://www.linkedin.com/in/daltonflanagan/) (dalton@meta.com) with contributions from Mohsen Agsen, Bryce Bortree, Ricardo Juan Palma Duran, Kaolin Fire, Thomas Scialom."
+   ]
+  }
+ ],
+ "metadata": {
+  "captumWidgetMessage": [],
+  "dataExplorerConfig": [],
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  },
+  "last_base_url": "https://bento.edge.x2p.facebook.net/",
+  "last_kernel_id": "161e2a7b-2d2b-4995-87f3-d1539860ecac",
+  "last_msg_id": "4eab1242-d815b886ebe4f5b1966da982_543",
+  "last_server_session_id": "4a7b41c5-ed66-4dcb-a376-22673aebb469",
+  "operator_data": [],
+  "outputWidgetContext": []
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/docs/zero_to_hero_guide/02_Image_Chat101.ipynb
+++ b/docs/zero_to_hero_guide/02_Image_Chat101.ipynb
--- a/docs/zero_to_hero_guide/03_Tool_Calling101.ipynb
+++ b/docs/zero_to_hero_guide/03_Tool_Calling101.ipynb
--- a/docs/zero_to_hero_guide/04_Memory101.ipynb
+++ b/docs/zero_to_hero_guide/04_Memory101.ipynb
--- a/docs/zero_to_hero_guide/05_Safety101.ipynb
+++ b/docs/zero_to_hero_guide/05_Safety101.ipynb
--- a/docs/zero_to_hero_guide/06_Agents101.ipynb
+++ b/docs/zero_to_hero_guide/06_Agents101.ipynb
--- a/docs/zero_to_hero_guide/chat_completion_guide.md
+++ b/docs/zero_to_hero_guide/chat_completion_guide.md
@ -0,0 +1,192 @@
+
+# Llama Stack Inference Guide
+
+This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-11B-Vision-Instruct` model. Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/).
+
+### Table of Contents
+1. [Quickstart](#quickstart)
+2. [Building Effective Prompts](#building-effective-prompts)
+3. [Conversation Loop](#conversation-loop)
+4. [Conversation History](#conversation-history)
+5. [Streaming Responses](#streaming-responses)
+
+
+## Quickstart
+
+This section walks through each step to set up and make a simple text generation request.
+
+### 1. Set Up the Client
+
+Begin by importing the necessary components from Llama Stack’s client library:
+
+```python
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.types import SystemMessage, UserMessage
+
+client = LlamaStackClient(base_url="http://localhost:5000")
+```
+
+### 2. Create a Chat Completion Request
+
+Use the `chat_completion` function to define the conversation context. Each message you include should have a specific role and content:
+
+```python
+response = client.inference.chat_completion(
+    messages=[
+        SystemMessage(content="You are a friendly assistant.", role="system"),
+        UserMessage(content="Write a two-sentence poem about llama.", role="user")
+    ],
+    model="Llama3.2-11B-Vision-Instruct",
+)
+
+print(response.completion_message.content)
+```
+
+---
+
+## Building Effective Prompts
+
+Effective prompt creation (often called "prompt engineering") is essential for quality responses. Here are best practices for structuring your prompts to get the most out of the Llama Stack model:
+
+1. **System Messages**: Use `SystemMessage` to set the model's behavior. This is similar to providing top-level instructions for tone, format, or specific behavior.
+   - **Example**: `SystemMessage(content="You are a friendly assistant that explains complex topics simply.")`
+2. **User Messages**: Define the task or question you want to ask the model with a `UserMessage`. The clearer and more direct you are, the better the response.
+   - **Example**: `UserMessage(content="Explain recursion in programming in simple terms.")`
+
+### Sample Prompt
+
+Here’s a prompt that defines the model's role and a user question:
+
+```python
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.types import SystemMessage, UserMessage
+client = LlamaStackClient(base_url="http://localhost:5000")
+
+response = client.inference.chat_completion(
+    messages=[
+        SystemMessage(content="You are shakespeare.", role="system"),
+        UserMessage(content="Write a two-sentence poem about llama.", role="user")
+    ],
+    model="Llama3.2-11B-Vision-Instruct",
+)
+
+print(response.completion_message.content)
+```
+
+---
+
+
+## Conversation Loop
+
+To create a continuous conversation loop, where users can input multiple messages in a session, use the following structure. This example runs an asynchronous loop, ending when the user types "exit," "quit," or "bye."
+
+```python
+import asyncio
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.types import UserMessage
+from termcolor import cprint
+
+client = LlamaStackClient(base_url="http://localhost:5000")
+
+async def chat_loop():
+    while True:
+        user_input = input("User> ")
+        if user_input.lower() in ["exit", "quit", "bye"]:
+            cprint("Ending conversation. Goodbye!", "yellow")
+            break
+
+        message = UserMessage(content=user_input, role="user")
+        response = client.inference.chat_completion(
+            messages=[message],
+            model="Llama3.2-11B-Vision-Instruct",
+        )
+        cprint(f"> Response: {response.completion_message.content}", "cyan")
+
+asyncio.run(chat_loop())
+```
+
+---
+
+## Conversation History
+
+Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.
+
+```python
+import asyncio
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.types import UserMessage
+from termcolor import cprint
+
+client = LlamaStackClient(base_url="http://localhost:5000")
+
+async def chat_loop():
+    conversation_history = []
+    while True:
+        user_input = input("User> ")
+        if user_input.lower() in ["exit", "quit", "bye"]:
+            cprint("Ending conversation. Goodbye!", "yellow")
+            break
+
+        user_message = UserMessage(content=user_input, role="user")
+        conversation_history.append(user_message)
+
+        response = client.inference.chat_completion(
+            messages=conversation_history,
+            model="Llama3.2-11B-Vision-Instruct",
+        )
+        cprint(f"> Response: {response.completion_message.content}", "cyan")
+
+        assistant_message = UserMessage(content=response.completion_message.content, role="user")
+        conversation_history.append(assistant_message)
+
+asyncio.run(chat_loop())
+```
+
+## Streaming Responses
+
+Llama Stack offers a `stream` parameter in the `chat_completion` function, which allows partial responses to be returned progressively as they are generated. This can enhance user experience by providing immediate feedback without waiting for the entire response to be processed.
+
+### Example: Streaming Responses
+
+The following code demonstrates how to use the `stream` parameter to enable response streaming. When `stream=True`, the `chat_completion` function will yield tokens as they are generated. To display these tokens, this example leverages asynchronous streaming with `EventLogger`.
+
+```python
+import asyncio
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.lib.inference.event_logger import EventLogger
+from llama_stack_client.types import UserMessage
+from termcolor import cprint
+
+async def run_main(stream: bool = True):
+    client = LlamaStackClient(
+        base_url="http://localhost:5000",
+    )
+
+    message = UserMessage(
+        content="hello world, write me a 2 sentence poem about the moon", role="user"
+    )
+    print(f"User>{message.content}", "green")
+
+    response = client.inference.chat_completion(
+        messages=[message],
+        model="Llama3.2-11B-Vision-Instruct",
+        stream=stream,
+    )
+
+    if not stream:
+        cprint(f"> Response: {response}", "cyan")
+    else:
+        async for log in EventLogger().log(response):
+            log.print()
+
+    models_response = client.models.list()
+    print(models_response)
+
+if __name__ == "__main__":
+    asyncio.run(run_main())
+```
+
+
+---
+
+With these fundamentals, you should be well on your way to leveraging Llama Stack’s text generation capabilities! For more advanced features, refer to the [Llama Stack Documentation](https://llama-stack.readthedocs.io/en/latest/).
--- a/docs/zero_to_hero_guide/chat_few_shot_guide.md
+++ b/docs/zero_to_hero_guide/chat_few_shot_guide.md
@ -0,0 +1,144 @@
+
+# Few-Shot Inference for LLMs
+
+This guide provides instructions on how to use Llama Stack’s `chat_completion` API with a few-shot learning approach to enhance text generation. Few-shot examples enable the model to recognize patterns by providing labeled prompts, allowing it to complete tasks based on minimal prior examples.
+
+### Overview
+
+Few-shot learning provides the model with multiple examples of input-output pairs. This is particularly useful for guiding the model's behavior in specific tasks, helping it understand the desired completion format and content based on a few sample interactions.
+
+### Implementation
+
+1. **Initialize the Client**
+
+   Begin by setting up the `LlamaStackClient` to connect to the inference endpoint.
+
+   ```python
+   from llama_stack_client import LlamaStackClient
+
+   client = LlamaStackClient(base_url="http://localhost:5000")
+   ```
+
+2. **Define Few-Shot Examples**
+
+   Construct a series of labeled `UserMessage` and `CompletionMessage` instances to demonstrate the task to the model. Each `UserMessage` represents an input prompt, and each `CompletionMessage` is the desired output. The model uses these examples to infer the appropriate response patterns.
+
+   ```python
+   from llama_stack_client.types import CompletionMessage, UserMessage
+
+   few_shot_examples =  messages=[
+        UserMessage(content="Have shorter, spear-shaped ears.", role="user"),
+        CompletionMessage(
+            content="That's Alpaca!",
+            role="assistant",
+            stop_reason="end_of_message",
+            tool_calls=[],
+        ),
+        UserMessage(
+            content="Known for their calm nature and used as pack animals in mountainous regions.",
+            role="user",
+        ),
+        CompletionMessage(
+            content="That's Llama!",
+            role="assistant",
+            stop_reason="end_of_message",
+            tool_calls=[],
+        ),
+        UserMessage(
+            content="Has a straight, slender neck and is smaller in size compared to its relative.",
+            role="user",
+        ),
+        CompletionMessage(
+            content="That's Alpaca!",
+            role="assistant",
+            stop_reason="end_of_message",
+            tool_calls=[],
+        ),
+        UserMessage(
+            content="Generally taller and more robust, commonly seen as guard animals.",
+            role="user",
+        ),
+    ]
+   ```
+
+   ### Note
+   - **Few-Shot Examples**: These examples show the model the correct responses for specific prompts.
+   - **CompletionMessage**: This defines the model's expected completion for each prompt.
+
+3. **Invoke `chat_completion` with Few-Shot Examples**
+
+   Use the few-shot examples as the message input for `chat_completion`. The model will use the examples to generate contextually appropriate responses, allowing it to infer and complete new queries in a similar format.
+
+   ```python
+   response = client.inference.chat_completion(
+       messages=few_shot_examples, model="Llama3.2-11B-Vision-Instruct"
+   )
+   ```
+
+4. **Display the Model’s Response**
+
+   The `completion_message` contains the assistant’s generated content based on the few-shot examples provided. Output this content to see the model's response directly in the console.
+
+   ```python
+   from termcolor import cprint
+
+   cprint(f"> Response: {response.completion_message.content}", "cyan")
+   ```
+
+Few-shot learning with Llama Stack’s `chat_completion` allows the model to recognize patterns with minimal training data, helping it generate contextually accurate responses based on prior examples. This approach is highly effective for guiding the model in tasks that benefit from clear input-output examples without extensive fine-tuning.
+
+
+### Complete code
+Summing it up, here's the code for few-shot implementation with llama-stack:
+
+```python
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.types import CompletionMessage, UserMessage
+from termcolor import cprint
+
+client = LlamaStackClient(base_url="http://localhost:5000")
+
+response = client.inference.chat_completion(
+    messages=[
+        UserMessage(content="Have shorter, spear-shaped ears.", role="user"),
+        CompletionMessage(
+            content="That's Alpaca!",
+            role="assistant",
+            stop_reason="end_of_message",
+            tool_calls=[],
+        ),
+        UserMessage(
+            content="Known for their calm nature and used as pack animals in mountainous regions.",
+            role="user",
+        ),
+        CompletionMessage(
+            content="That's Llama!",
+            role="assistant",
+            stop_reason="end_of_message",
+            tool_calls=[],
+        ),
+        UserMessage(
+            content="Has a straight, slender neck and is smaller in size compared to its relative.",
+            role="user",
+        ),
+        CompletionMessage(
+            content="That's Alpaca!",
+            role="assistant",
+            stop_reason="end_of_message",
+            tool_calls=[],
+        ),
+        UserMessage(
+            content="Generally taller and more robust, commonly seen as guard animals.",
+            role="user",
+        ),
+    ],
+    model="Llama3.2-11B-Vision-Instruct",
+)
+
+cprint(f"> Response: {response.completion_message.content}", "cyan")
+```
+
+---
+
+With this fundamental, you should be well on your way to leveraging Llama Stack’s text generation capabilities! For more advanced features, refer to the [Llama Stack Documentation](https://llama-stack.readthedocs.io/en/latest/).
+
--- a/docs/zero_to_hero_guide/chat_local_cloud_guide.md
+++ b/docs/zero_to_hero_guide/chat_local_cloud_guide.md
@ -0,0 +1,140 @@
+
+# Switching between Local and Cloud Model with Llama Stack
+
+This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.
+
+
+### Pre-requisite
+Before you begin, please ensure Llama Stack is installed and the distribution are set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distribution, a local and a cloud distribution, for this demo to work.
+
+<!--- [TODO: show how to create two distributions] --->
+
+### Implementation
+
+1. **Set Up Local and Cloud Clients**
+
+   Initialize both clients, specifying the `base_url` for you intialized each instance. In this case, we have the local distribution running on `http://localhost:5000` and the cloud distribution running on `http://localhost:5001`.
+
+   ```python
+   from llama_stack_client import LlamaStackClient
+
+   # Configure local and cloud clients
+   local_client = LlamaStackClient(base_url="http://localhost:5000")
+   cloud_client = LlamaStackClient(base_url="http://localhost:5001")
+   ```
+
+2. **Client Selection with Fallback**
+
+   The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.
+
+   ```python
+   import httpx
+   from termcolor import cprint
+
+   async def select_client() -> LlamaStackClient:
+       """Use local client if available; otherwise, switch to cloud client."""
+       try:
+           async with httpx.AsyncClient() as http_client:
+               response = await http_client.get(f"{local_client.base_url}/health")
+               if response.status_code == 200:
+                   cprint("Using local client.", "yellow")
+                   return local_client
+       except httpx.RequestError:
+           pass
+       cprint("Local client unavailable. Switching to cloud client.", "yellow")
+       return cloud_client
+   ```
+
+3. **Generate a Response**
+
+   After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.
+
+   ```python
+   from llama_stack_client.types import UserMessage
+
+   async def get_llama_response(stream: bool = True):
+       client = await select_client()  # Selects the available client
+       message = UserMessage(content="hello world, write me a 2 sentence poem about the moon", role="user")
+       cprint(f"User> {message.content}", "green")
+
+       response = client.inference.chat_completion(
+           messages=[message],
+           model="Llama3.2-11B-Vision-Instruct",
+           stream=stream,
+       )
+
+       if not stream:
+           cprint(f"> Response: {response}", "cyan")
+       else:
+           # Stream tokens progressively
+           async for log in EventLogger().log(response):
+               log.print()
+   ```
+
+4. **Run the Asynchronous Response Generation**
+
+   Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.
+
+   ```python
+   import asyncio
+
+   # Initiate the response generation process
+   asyncio.run(get_llama_response())
+   ```
+
+
+### Complete code
+Summing it up, here's the code for local-cloud model implementation with llama-stack:
+
+```python
+import asyncio
+
+import httpx
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.lib.inference.event_logger import EventLogger
+from llama_stack_client.types import UserMessage
+from termcolor import cprint
+
+local_client = LlamaStackClient(base_url="http://localhost:5000")
+cloud_client = LlamaStackClient(base_url="http://localhost:5001")
+
+
+async def select_client() -> LlamaStackClient:
+    try:
+        async with httpx.AsyncClient() as http_client:
+            response = await http_client.get(f"{local_client.base_url}/health")
+            if response.status_code == 200:
+                cprint("Using local client.", "yellow")
+                return local_client
+    except httpx.RequestError:
+        pass
+    cprint("Local client unavailable. Switching to cloud client.", "yellow")
+    return cloud_client
+
+
+async def get_llama_response(stream: bool = True):
+    client = await select_client()
+    message = UserMessage(
+        content="hello world, write me a 2 sentence poem about the moon", role="user"
+    )
+    cprint(f"User> {message.content}", "green")
+
+    response = client.inference.chat_completion(
+        messages=[message],
+        model="Llama3.2-11B-Vision-Instruct",
+        stream=stream,
+    )
+
+    if not stream:
+        cprint(f"> Response: {response}", "cyan")
+    else:
+        async for log in EventLogger().log(response):
+            log.print()
+
+
+asyncio.run(get_llama_response())
+```
+
+---
+
+With this fundamental, you should be well on your way to leveraging Llama Stack’s text generation capabilities! For more advanced features, refer to the [Llama Stack Documentation](https://llama-stack.readthedocs.io/en/latest/).
--- a/docs/zero_to_hero_guide/quickstart.md
+++ b/docs/zero_to_hero_guide/quickstart.md
@ -0,0 +1,184 @@
+# Llama Stack Quickstart Guide
+
+This guide will walk you through setting up an end-to-end workflow with Llama Stack, enabling you to perform text generation using the `Llama3.2-11B-Vision-Instruct` model. Follow these steps to get started quickly.
+
+## Table of Contents
+1. [Prerequisite](#prerequisite)
+2. [Installation](#installation)
+3. [Download Llama Models](#download-llama-models)
+4. [Build, Configure, and Run Llama Stack](#build-configure-and-run-llama-stack)
+5. [Testing with `curl`](#testing-with-curl)
+6. [Testing with Python](#testing-with-python)
+7. [Next Steps](#next-steps)
+
+---
+
+## Prerequisite
+
+Ensure you have the following installed on your system:
+
+- **Conda**: A package, dependency, and environment management tool.
+
+
+---
+
+## Installation
+
+The `llama` CLI tool helps you manage the Llama Stack toolchain and agent systems.
+
+**Install via PyPI:**
+
+```bash
+pip install llama-stack
+```
+
+*After installation, the `llama` command should be available in your PATH.*
+
+---
+
+## Download Llama Models
+
+Download the necessary Llama model checkpoints using the `llama` CLI:
+
+```bash
+llama download --model-id Llama3.2-11B-Vision-Instruct
+```
+
+*Follow the CLI prompts to complete the download. You may need to accept a license agreement. Obtain an instant license [here](https://www.llama.com/llama-downloads/).*
+
+---
+
+## Build, Configure, and Run Llama Stack
+
+### 1. Build the Llama Stack Distribution
+
+We will default into building a `meta-reference-gpu` distribution, however you could read more about the different distriubtion [here](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/index.html).
+
+```bash
+llama stack build --template meta-reference-gpu --image-type conda
+```
+
+
+### 2. Run the Llama Stack Distribution
+> Launching a distribution initializes and configures the necessary APIs and Providers, enabling seamless interaction with the underlying model.
+
+Start the server with the configured stack:
+
+```bash
+cd llama-stack/distributions/meta-reference-gpu
+llama stack run ./run.yaml
+```
+
+*The server will start and listen on `http://localhost:5000` by default.*
+
+---
+
+## Testing with `curl`
+
+After setting up the server, verify it's working by sending a `POST` request using `curl`:
+
+```bash
+curl http://localhost:5000/inference/chat_completion \
+-H "Content-Type: application/json" \
+-d '{
+    "model": "Llama3.1-8B-Instruct",
+    "messages": [
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
+    ],
+    "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
+}'
+```
+
+**Expected Output:**
+```json
+{
+  "completion_message": {
+    "role": "assistant",
+    "content": "The moon glows softly in the midnight sky,\nA beacon of wonder, as it catches the eye.",
+    "stop_reason": "out_of_tokens",
+    "tool_calls": []
+  },
+  "logprobs": null
+}
+```
+
+---
+
+## Testing with Python
+
+You can also interact with the Llama Stack server using a simple Python script. Below is an example:
+
+### 1. Install Required Python Packages
+The `llama-stack-client` library offers a robust and efficient python methods for interacting with the Llama Stack server.
+
+```bash
+pip install llama-stack-client
+```
+
+### 2. Create a Python Script (`test_llama_stack.py`)
+
+```python
+from llama_stack_client import LlamaStackClient
+from llama_stack_client.types import SystemMessage, UserMessage
+
+# Initialize the client
+client = LlamaStackClient(base_url="http://localhost:5000")
+
+# Create a chat completion request
+response = client.inference.chat_completion(
+    messages=[
+        SystemMessage(content="You are a helpful assistant.", role="system"),
+        UserMessage(content="Write me a 2-sentence poem about the moon", role="user")
+    ],
+    model="Llama3.1-8B-Instruct",
+)
+
+# Print the response
+print(response.completion_message.content)
+```
+
+### 3. Run the Python Script
+
+```bash
+python test_llama_stack.py
+```
+
+**Expected Output:**
+```
+The moon glows softly in the midnight sky,
+A beacon of wonder, as it catches the eye.
+```
+
+With these steps, you should have a functional Llama Stack setup capable of generating text using the specified model. For more detailed information and advanced configurations, refer to some of our documentation below.
+
+---
+
+## Next Steps
+
+- **Explore Other Guides**: Dive deeper into specific topics by following these guides:
+  - [Understanding Distributions](#)
+  - [Configure your Distro](#)
+  - [Doing Inference API Call and Fetching a Response from Endpoints](#)
+  - [Creating a Conversation Loop](#)
+  - [Sending Image to the Model](#)
+  - [Tool Calling: How to and Details](#)
+  - [Memory API: Show Simple In-Memory Retrieval](#)
+  - [Agents API: Explain Components](#)
+  - [Using Safety API in Conversation](#)
+  - [Prompt Engineering Guide](#)
+
+- **Explore Client SDKs**: Utilize our client SDKs for various languages to integrate Llama Stack into your applications:
+  - [Python SDK](https://github.com/meta-llama/llama-stack-client-python)
+  - [Node SDK](https://github.com/meta-llama/llama-stack-client-node)
+  - [Swift SDK](https://github.com/meta-llama/llama-stack-client-swift)
+  - [Kotlin SDK](https://github.com/meta-llama/llama-stack-client-kotlin)
+
+- **Advanced Configuration**: Learn how to customize your Llama Stack distribution by referring to the [Building a Llama Stack Distribution](./building_distro.md) guide.
+
+- **Explore Example Apps**: Check out [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) for example applications built using Llama Stack.
+
+
+---
+
+