diff --git a/docs/zero_to_hero_guide/01_Local_Cloud_Inference101.ipynb b/docs/zero_to_hero_guide/01_Local_Cloud_Inference101.ipynb index 19a7fe3be..4f6ca4080 100644 --- a/docs/zero_to_hero_guide/01_Local_Cloud_Inference101.ipynb +++ b/docs/zero_to_hero_guide/01_Local_Cloud_Inference101.ipynb @@ -1,262 +1,239 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "a0ed972d", - "metadata": {}, - "source": [ - "# Switching between Local and Cloud Model with Llama Stack\n", - "\n", - "This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.\n", - "\n", - "### Prerequisites\n", - "Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.\n", - "\n", - "### Implementation" - ] - }, - { - "cell_type": "markdown", - "id": "bfac8382", - "metadata": {}, - "source": [ - "### 1. Configuration\n", - "Set up your connection parameters:" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "d80c0926", - "metadata": {}, - "outputs": [], - "source": [ - "HOST = \"localhost\" # Replace with your host\n", - "LOCAL_PORT = 8321 # Replace with your local distro port\n", - "CLOUD_PORT = 8322 # Replace with your cloud distro port" - ] - }, - { - "cell_type": "markdown", - "id": "df89cff7", - "metadata": {}, - "source": [ - "#### 2. Set Up Local and Cloud Clients\n", - "\n", - "Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:8321` and the cloud distribution running on `http://localhost:8322`.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "7f868dfe", - "metadata": {}, - "outputs": [], - "source": [ - "from llama_stack_client import LlamaStackClient\n", - "\n", - "# Configure local and cloud clients\n", - "local_client = LlamaStackClient(base_url=f'http://{HOST}:{LOCAL_PORT}')\n", - "cloud_client = LlamaStackClient(base_url=f'http://{HOST}:{CLOUD_PORT}')" - ] - }, - { - "cell_type": "markdown", - "id": "894689c1", - "metadata": {}, - "source": [ - "#### 3. Client Selection with Fallback\n", - "\n", - "The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "ff0c8277", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mUsing local client.\u001b[0m\n" - ] - } - ], - "source": [ - "import httpx\n", - "from termcolor import cprint\n", - "\n", - "async def check_client_health(client, client_name: str) -> bool:\n", - " try:\n", - " async with httpx.AsyncClient() as http_client:\n", - " response = await http_client.get(f'{client.base_url}/health')\n", - " if response.status_code == 200:\n", - " cprint(f'Using {client_name} client.', 'yellow')\n", - " return True\n", - " else:\n", - " cprint(f'{client_name} client health check failed.', 'red')\n", - " return False\n", - " except httpx.RequestError:\n", - " cprint(f'Failed to connect to {client_name} client.', 'red')\n", - " return False\n", - "\n", - "async def select_client(use_local: bool) -> LlamaStackClient:\n", - " if use_local and await check_client_health(local_client, 'local'):\n", - " return local_client\n", - "\n", - " if await check_client_health(cloud_client, 'cloud'):\n", - " return cloud_client\n", - "\n", - " raise ConnectionError('Unable to connect to any client.')\n", - "\n", - "# Example usage: pass True for local, False for cloud\n", - "client = await select_client(use_local=True)\n" - ] - }, - { - "cell_type": "markdown", - "id": "9ccfe66f", - "metadata": {}, - "source": [ - "#### 4. Generate a Response\n", - "\n", - "After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "id": "5e19cc20", - "metadata": {}, - "outputs": [], - "source": [ - "from termcolor import cprint\n", - "from llama_stack_client.lib.inference.event_logger import EventLogger\n", - "\n", - "async def get_llama_response(stream: bool = True, use_local: bool = True):\n", - " client = await select_client(use_local) # Selects the available client\n", - " message = {\n", - " \"role\": \"user\",\n", - " \"content\": 'hello world, write me a 2 sentence poem about the moon'\n", - " }\n", - " cprint(f'User> {message[\"content\"]}', 'green')\n", - "\n", - " response = client.inference.chat_completion(\n", - " messages=[message],\n", - " model='Llama3.2-11B-Vision-Instruct',\n", - " stream=stream,\n", - " )\n", - "\n", - " if not stream:\n", - " cprint(f'> Response: {response.completion_message.content}', 'cyan')\n", - " else:\n", - " async for log in EventLogger().log(response):\n", - " log.print()\n" - ] - }, - { - "cell_type": "markdown", - "id": "6edf5e57", - "metadata": {}, - "source": [ - "#### 5. Run with Cloud Model\n", - "\n", - "Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "id": "c10f487e", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mUsing cloud client.\u001b[0m\n", - "\u001b[32mUser> hello world, write me a 2 sentence poem about the moon\u001b[0m\n", - "\u001b[36mAssistant> \u001b[0m\u001b[33mSilver\u001b[0m\u001b[33m cres\u001b[0m\u001b[33mcent\u001b[0m\u001b[33m in\u001b[0m\u001b[33m the\u001b[0m\u001b[33m midnight\u001b[0m\u001b[33m sky\u001b[0m\u001b[33m,\n", - "\u001b[0m\u001b[33mA\u001b[0m\u001b[33m gentle\u001b[0m\u001b[33m glow\u001b[0m\u001b[33m that\u001b[0m\u001b[33m whispers\u001b[0m\u001b[33m,\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mI\u001b[0m\u001b[33m'm\u001b[0m\u001b[33m passing\u001b[0m\u001b[33m by\u001b[0m\u001b[33m.\"\u001b[0m\u001b[97m\u001b[0m\n" - ] - } - ], - "source": [ - "import asyncio\n", - "\n", - "\n", - "# Run this function directly in a Jupyter Notebook cell with `await`\n", - "await get_llama_response(use_local=False)\n", - "# To run it in a python file, use this line instead\n", - "# asyncio.run(get_llama_response(use_local=False))" - ] - }, - { - "cell_type": "markdown", - "id": "5c433511-9321-4718-ab7f-e21cf6b5ca79", - "metadata": {}, - "source": [ - "#### 6. Run with Local Model\n" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "id": "02eacfaf-c7f1-494b-ac28-129d2a0258e3", - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "\u001b[33mUsing local client.\u001b[0m\n", - "\u001b[32mUser> hello world, write me a 2 sentence poem about the moon\u001b[0m\n", - "\u001b[36mAssistant> \u001b[0m\u001b[33mSilver\u001b[0m\u001b[33m cres\u001b[0m\u001b[33mcent\u001b[0m\u001b[33m in\u001b[0m\u001b[33m the\u001b[0m\u001b[33m midnight\u001b[0m\u001b[33m sky\u001b[0m\u001b[33m,\n", - "\u001b[0m\u001b[33mA\u001b[0m\u001b[33m gentle\u001b[0m\u001b[33m glow\u001b[0m\u001b[33m that\u001b[0m\u001b[33m whispers\u001b[0m\u001b[33m,\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mI\u001b[0m\u001b[33m'm\u001b[0m\u001b[33m passing\u001b[0m\u001b[33m by\u001b[0m\u001b[33m.\"\u001b[0m\u001b[97m\u001b[0m\n" - ] - } - ], - "source": [ - "import asyncio\n", - "\n", - "await get_llama_response(use_local=True)" - ] - }, - { - "cell_type": "markdown", - "id": "7e3a3ffa", - "metadata": {}, - "source": [ - "Thanks for checking out this notebook! \n", - "\n", - "The next one will be a guide on [Prompt Engineering](./02_Prompt_Engineering101.ipynb), please continue learning!" - ] - } - ], - "metadata": { - "fileHeader": "", - "fileUid": "e11939ac-dfbc-4a1c-83be-e494c7f803b8", - "isAdHoc": false, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.15" - } + "cells": [ + { + "cell_type": "markdown", + "id": "a0ed972d", + "metadata": {}, + "source": [ + "# Switching between Local and Cloud Model with Llama Stack\n", + "\n", + "This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s `chat_completion` API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.\n", + "\n", + "### Prerequisites\n", + "Before you begin, please ensure Llama Stack is installed and the distribution is set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/). You will need to run two distributions, a local and a cloud distribution, for this demo to work.\n", + "\n", + "### Implementation" + ] }, - "nbformat": 4, - "nbformat_minor": 5 + { + "cell_type": "markdown", + "id": "bfac8382", + "metadata": {}, + "source": [ + "### 1. Configuration\n", + "Set up your connection parameters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d80c0926", + "metadata": {}, + "outputs": [], + "source": [ + "HOST = \"localhost\" # Replace with your host\n", + "LOCAL_PORT = 8321 # Replace with your local distro port\n", + "CLOUD_PORT = 8322 # Replace with your cloud distro port" + ] + }, + { + "cell_type": "markdown", + "id": "df89cff7", + "metadata": {}, + "source": [ + "#### 2. Set Up Local and Cloud Clients\n", + "\n", + "Initialize both clients, specifying the `base_url` for each instance. In this case, we have the local distribution running on `http://localhost:8321` and the cloud distribution running on `http://localhost:8322`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7f868dfe", + "metadata": {}, + "outputs": [], + "source": [ + "from llama_stack_client import AsyncLlamaStackClient\n", + "\n", + "# Configure local and cloud clients\n", + "local_client = AsyncLlamaStackClient(base_url=f'http://{HOST}:{LOCAL_PORT}')\n", + "cloud_client = AsyncLlamaStackClient(base_url=f'http://{HOST}:{CLOUD_PORT}')" + ] + }, + { + "cell_type": "markdown", + "id": "894689c1", + "metadata": {}, + "source": [ + "#### 3. Client Selection with Fallback\n", + "\n", + "The `select_client` function checks if the local client is available using a lightweight `/v1/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff0c8277", + "metadata": {}, + "outputs": [], + "source": [ + "import httpx\n", + "from termcolor import cprint\n", + "\n", + "async def check_client_health(client, client_name: str) -> bool:\n", + " try:\n", + " async with httpx.AsyncClient() as http_client:\n", + " response = await http_client.get(f'{client.base_url}/v1/health')\n", + " if response.status_code == 200:\n", + " cprint(f'Using {client_name} client.', 'yellow')\n", + " return True\n", + " else:\n", + " cprint(f'{client_name} client health check failed.', 'red')\n", + " return False\n", + " except httpx.RequestError:\n", + " cprint(f'Failed to connect to {client_name} client.', 'red')\n", + " return False\n", + "\n", + "async def select_client(use_local: bool) -> AsyncLlamaStackClient:\n", + " if use_local and await check_client_health(local_client, 'local'):\n", + " return local_client\n", + "\n", + " if await check_client_health(cloud_client, 'cloud'):\n", + " return cloud_client\n", + "\n", + " raise ConnectionError('Unable to connect to any client.')\n", + "\n", + "# Example usage: pass True for local, False for cloud\n", + "client = await select_client(use_local=True)\n" + ] + }, + { + "cell_type": "markdown", + "id": "9ccfe66f", + "metadata": {}, + "source": [ + "#### 4. Generate a Response\n", + "\n", + "After selecting the client, you can generate text using `chat_completion`. This example sends a sample prompt to the model and prints the response.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5e19cc20", + "metadata": {}, + "outputs": [], + "source": [ + "from termcolor import cprint\n", + "\n", + "async def get_llama_response(stream: bool = True, use_local: bool = True):\n", + " client = await select_client(use_local) # Selects the available client\n", + " message = {\n", + " \"role\": \"user\",\n", + " \"content\": 'hello world, write me a 2 sentence poem about the moon'\n", + " }\n", + " cprint(f'User> {message[\"content\"]}', 'green')\n", + "\n", + " response = await client.inference.chat_completion(\n", + " messages=[message],\n", + " model_id='meta-llama/Llama3.2-11B-Vision-Instruct',\n", + " stream=stream,\n", + " )\n", + "\n", + " cprint(f'Assistant> ', color='cyan', end='')\n", + " if not stream:\n", + " cprint(response.completion_message.content, color='yellow')\n", + " else:\n", + " async for chunk in response:\n", + " cprint(chunk.event.delta.text, color='yellow', end='')\n", + " cprint('')" + ] + }, + { + "cell_type": "markdown", + "id": "6edf5e57", + "metadata": {}, + "source": [ + "#### 5. Run with Cloud Model\n", + "\n", + "Use `asyncio.run()` to execute `get_llama_response` in an asynchronous event loop.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c10f487e", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "\n", + "\n", + "# Run this function directly in a Jupyter Notebook cell with `await`\n", + "await get_llama_response(use_local=False)\n", + "# To run it in a python file, use this line instead\n", + "# asyncio.run(get_llama_response(use_local=False))" + ] + }, + { + "cell_type": "markdown", + "id": "5c433511-9321-4718-ab7f-e21cf6b5ca79", + "metadata": {}, + "source": [ + "#### 6. Run with Local Model\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "02eacfaf-c7f1-494b-ac28-129d2a0258e3", + "metadata": {}, + "outputs": [], + "source": [ + "import asyncio\n", + "\n", + "await get_llama_response(use_local=True)" + ] + }, + { + "cell_type": "markdown", + "id": "7e3a3ffa", + "metadata": {}, + "source": [ + "Thanks for checking out this notebook! \n", + "\n", + "The next one will be a guide on [Prompt Engineering](./02_Prompt_Engineering101.ipynb), please continue learning!" + ] + }, + { + "cell_type": "markdown", + "id": "3ad6db48", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "fileHeader": "", + "fileUid": "e11939ac-dfbc-4a1c-83be-e494c7f803b8", + "isAdHoc": false, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 }