Switching between Local and Cloud Model with Llama Stack

This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stack’s chat_completion API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.

Pre-requisite

Before you begin, please ensure Llama Stack is installed and the distribution are set up by following the Getting Started Guide. You will need to run two distribution, a local and a cloud distribution, for this demo to work.

Implementation

Set Up Local and Cloud Clients

Initialize both clients, specifying the base_url for you intialized each instance. In this case, we have the local distribution running on http://localhost:5000 and the cloud distribution running on http://localhost:5001.
```
from llama_stack_client import LlamaStackClient

# Configure local and cloud clients
local_client = LlamaStackClient(base_url="http://localhost:5000")
cloud_client = LlamaStackClient(base_url="http://localhost:5001")
```

Client Selection with Fallback

The select_client function checks if the local client is available using a lightweight /health check. If the local client is unavailable, it automatically switches to the cloud client.

import httpx
from termcolor import cprint

async def select_client() -> LlamaStackClient:
    """Use local client if available; otherwise, switch to cloud client."""
    try:
        async with httpx.AsyncClient() as http_client:
            response = await http_client.get(f"{local_client.base_url}/health")
            if response.status_code == 200:
                cprint("Using local client.", "yellow")
                return local_client
    except httpx.RequestError:
        pass
    cprint("Local client unavailable. Switching to cloud client.", "yellow")
    return cloud_client

Generate a Response

After selecting the client, you can generate text using chat_completion. This example sends a sample prompt to the model and prints the response.

from llama_stack_client.types import UserMessage

async def get_llama_response(stream: bool = True):
    client = await select_client()  # Selects the available client
    message = UserMessage(content="hello world, write me a 2 sentence poem about the moon", role="user")
    cprint(f"User> {message.content}", "green")

    response = client.inference.chat_completion(
        messages=[message],
        model="Llama3.2-11B-Vision-Instruct",
        stream=stream,
    )

    if not stream:
        cprint(f"> Response: {response}", "cyan")
    else:
        # Stream tokens progressively
        async for log in EventLogger().log(response):
            log.print()

Run the Asynchronous Response Generation

Use asyncio.run() to execute get_llama_response in an asynchronous event loop.
```
import asyncio

# Initiate the response generation process
asyncio.run(get_llama_response())
```

Complete code

Summing it up, here's the code for local-cloud model implementation with llama-stack:

import asyncio

import httpx
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.inference.event_logger import EventLogger
from llama_stack_client.types import UserMessage
from termcolor import cprint

local_client = LlamaStackClient(base_url="http://localhost:5000")
cloud_client = LlamaStackClient(base_url="http://localhost:5001")


async def select_client() -> LlamaStackClient:
    try:
        async with httpx.AsyncClient() as http_client:
            response = await http_client.get(f"{local_client.base_url}/health")
            if response.status_code == 200:
                cprint("Using local client.", "yellow")
                return local_client
    except httpx.RequestError:
        pass
    cprint("Local client unavailable. Switching to cloud client.", "yellow")
    return cloud_client


async def get_llama_response(stream: bool = True):
    client = await select_client()
    message = UserMessage(
        content="hello world, write me a 2 sentence poem about the moon", role="user"
    )
    cprint(f"User> {message.content}", "green")

    response = client.inference.chat_completion(
        messages=[message],
        model="Llama3.2-11B-Vision-Instruct",
        stream=stream,
    )

    if not stream:
        cprint(f"> Response: {response}", "cyan")
    else:
        async for log in EventLogger().log(response):
            log.print()


asyncio.run(get_llama_response())

With these fundamentals, you should be well on your way to leveraging Llama Stack’s text generation capabilities! For more advanced features, refer to the Llama Stack Documentation.

4.9 KiB Raw Blame History Unescape Escape

Switching between Local and Cloud Model with Llama Stack

Pre-requisite

Implementation

Complete code

4.9 KiB

Raw Blame History