llama-stack-mirror/docs/source/chat_local_cloud_guide.md
2024-11-01 13:50:24 -07:00

4.9 KiB
Raw Blame History

Switching between Local and Cloud Model with Llama Stack

This guide provides a streamlined setup to switch between local and cloud clients for text generation with Llama Stacks chat_completion API. This setup enables automatic fallback to a cloud instance if the local client is unavailable.

Pre-requisite

Before you begin, please ensure Llama Stack is installed and the distribution are set up by following the Getting Started Guide. You will need to run two distribution, a local and a cloud distribution, for this demo to work.

Implementation

  1. Set Up Local and Cloud Clients

    Initialize both clients, specifying the base_url for you intialized each instance. In this case, we have the local distribution running on http://localhost:5000 and the cloud distribution running on http://localhost:5001.

    from llama_stack_client import LlamaStackClient
    
    # Configure local and cloud clients
    local_client = LlamaStackClient(base_url="http://localhost:5000")
    cloud_client = LlamaStackClient(base_url="http://localhost:5001")
    
  2. Client Selection with Fallback

    The select_client function checks if the local client is available using a lightweight /health check. If the local client is unavailable, it automatically switches to the cloud client.

    import httpx
    from termcolor import cprint
    
    async def select_client() -> LlamaStackClient:
        """Use local client if available; otherwise, switch to cloud client."""
        try:
            async with httpx.AsyncClient() as http_client:
                response = await http_client.get(f"{local_client.base_url}/health")
                if response.status_code == 200:
                    cprint("Using local client.", "yellow")
                    return local_client
        except httpx.RequestError:
            pass
        cprint("Local client unavailable. Switching to cloud client.", "yellow")
        return cloud_client
    
  3. Generate a Response

    After selecting the client, you can generate text using chat_completion. This example sends a sample prompt to the model and prints the response.

    from llama_stack_client.types import UserMessage
    
    async def get_llama_response(stream: bool = True):
        client = await select_client()  # Selects the available client
        message = UserMessage(content="hello world, write me a 2 sentence poem about the moon", role="user")
        cprint(f"User> {message.content}", "green")
    
        response = client.inference.chat_completion(
            messages=[message],
            model="Llama3.2-11B-Vision-Instruct",
            stream=stream,
        )
    
        if not stream:
            cprint(f"> Response: {response}", "cyan")
        else:
            # Stream tokens progressively
            async for log in EventLogger().log(response):
                log.print()
    
  4. Run the Asynchronous Response Generation

    Use asyncio.run() to execute get_llama_response in an asynchronous event loop.

    import asyncio
    
    # Initiate the response generation process
    asyncio.run(get_llama_response())
    

Complete code

Summing it up, here's the code for local-cloud model implementation with llama-stack:

import asyncio

import httpx
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.inference.event_logger import EventLogger
from llama_stack_client.types import UserMessage
from termcolor import cprint

local_client = LlamaStackClient(base_url="http://localhost:5000")
cloud_client = LlamaStackClient(base_url="http://localhost:5001")


async def select_client() -> LlamaStackClient:
    try:
        async with httpx.AsyncClient() as http_client:
            response = await http_client.get(f"{local_client.base_url}/health")
            if response.status_code == 200:
                cprint("Using local client.", "yellow")
                return local_client
    except httpx.RequestError:
        pass
    cprint("Local client unavailable. Switching to cloud client.", "yellow")
    return cloud_client


async def get_llama_response(stream: bool = True):
    client = await select_client()
    message = UserMessage(
        content="hello world, write me a 2 sentence poem about the moon", role="user"
    )
    cprint(f"User> {message.content}", "green")

    response = client.inference.chat_completion(
        messages=[message],
        model="Llama3.2-11B-Vision-Instruct",
        stream=stream,
    )

    if not stream:
        cprint(f"> Response: {response}", "cyan")
    else:
        async for log in EventLogger().log(response):
            log.print()


asyncio.run(get_llama_response())

With these fundamentals, you should be well on your way to leveraging Llama Stacks text generation capabilities! For more advanced features, refer to the Llama Stack Documentation.