## LlamaStack + LangChain Integration Tutorial

This notebook demonstrates how to integrate **LlamaStack** with **LangChain** to build a complete RAG (Retrieval-Augmented Generation) system.

### Overview

- **LlamaStack**: Provides the infrastructure for running LLMs and Open AI Compatible Vector Stores
- **LangChain**: Provides the framework for chaining operations and prompt templates
- **Integration**: Uses LlamaStack's OpenAI-compatible API with LangChain

### What You'll See

1. Setting up LlamaStack server with Fireworks AI provider
2. Creating and Querying Vector Stores
3. Building RAG chains with LangChain + LLAMAStack
4. Querying the chain for relevant information

### Prerequisites

- Fireworks API key

---

### 1. Installation and Setup

#### Install Required Dependencies

First, we install all the necessary packages for LangChain and FastAPI integration.

In [1]:
!pip install uv
!uv pip install fastapi uvicorn "langchain>=0.2" langchain-openai \
             langchain-community langchain-text-splitters \
             faiss-cpu

[2mUsing Python 3.12.11 environment at: /Users/swapna942/miniconda3[0m
[2mAudited [1m7 packages[0m [2min 42ms[0m[0m


### 2. LlamaStack Server Setup

#### Build and Start LlamaStack Server

This section sets up the LlamaStack server with:
- **Fireworks AI** as the inference provider
- **Sentence Transformers** for embeddings

The server runs on `localhost:8321` and provides OpenAI-compatible endpoints.

In [2]:
import os
import subprocess
import time

# Remove UV_SYSTEM_PYTHON to ensure uv creates a proper virtual environment
# instead of trying to use system Python globally, which could cause permission issues
# and package conflicts with the system's Python installation
if "UV_SYSTEM_PYTHON" in os.environ:
    del os.environ["UV_SYSTEM_PYTHON"]

def run_llama_stack_server_background():
    """Build and run LlamaStack server in one step using --run flag"""
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        "uv run --with llama-stack llama stack list-deps starter | xargs -L1 uv pip install",
        "uv run --with llama-stack llama stack run starter",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True,
    )

    print(f"Building and starting Llama Stack server with PID: {process.pid}")
    return process


def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


def kill_llama_stack_server():
    # Kill any existing llama stack server processes using pkill command
    os.system("pkill -f llama_stack.core.server.server")

In [3]:
server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()

Building and starting Llama Stack server with PID: 19747
Waiting for server to start....
Server is ready!


#### Install LlamaStack Client

Install the client library to interact with the LlamaStack server.

In [4]:
!uv pip install llama_stack_client

[2mUsing Python 3.12.11 environment at: /Users/swapna942/miniconda3[0m
[2mAudited [1m1 package[0m [2min 27ms[0m[0m


### 3. Initialize LlamaStack Client

Create a client connection to the LlamaStack server with API keys for different providers:

- **Fireworks API Key**: For Fireworks models



In [5]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321",
    provider_data={"fireworks_api_key": "***"},
)

#### Explore Available Models and Safety Features

Check what models and safety shields are available through your LlamaStack instance.

In [6]:
print("Available Fireworks models:")
for m in client.models.list():
    if m.identifier.startswith("fireworks/"):
        print(f"- {m.identifier}")

print("----")
print("Available shields (safety models):")
for s in client.shields.list():
    print(s.identifier)
print("----")

INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/shields "HTTP/1.1 200 OK"


Available Fireworks models:
- fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct
- fireworks/accounts/fireworks/models/llama-v3p1-70b-instruct
- fireworks/accounts/fireworks/models/llama-v3p1-405b-instruct
- fireworks/accounts/fireworks/models/llama-v3p2-3b-instruct
- fireworks/accounts/fireworks/models/llama-v3p2-11b-vision-instruct
- fireworks/accounts/fireworks/models/llama-v3p2-90b-vision-instruct
- fireworks/accounts/fireworks/models/llama-v3p3-70b-instruct
- fireworks/accounts/fireworks/models/llama4-scout-instruct-basic
- fireworks/accounts/fireworks/models/llama4-maverick-instruct-basic
- fireworks/nomic-ai/nomic-embed-text-v1.5
- fireworks/accounts/fireworks/models/llama-guard-3-8b
- fireworks/accounts/fireworks/models/llama-guard-3-11b-vision
----
Available shields (safety models):
code-scanner
llama-guard
nemo-guardrail
----


### 4. Vector Store Setup

#### Create a Vector Store with File Upload

Create a vector store using the OpenAI-compatible vector stores API:

- **Vector Store**: OpenAI-compatible vector store for document storage
- **File Upload**: Automatic chunking and embedding of uploaded files  
- **Embedding Model**: Sentence Transformers model for text embeddings
- **Dimensions**: 384-dimensional embeddings

In [7]:
from io import BytesIO

docs = [
    ("Acme ships globally in 3-5 business days.", {"title": "Shipping Policy"}),
    ("Returns are accepted within 30 days of purchase.", {"title": "Returns Policy"}),
    ("Support is available 24/7 via chat and email.", {"title": "Support"}),
]

file_ids = []
for content, metadata in docs:
  with BytesIO(content.encode()) as file_buffer:
      file_buffer.name = f"{metadata['title'].replace(' ', '_').lower()}.txt"
      create_file_response = client.files.create(file=file_buffer, purpose="assistants")
      print(create_file_response)
      file_ids.append(create_file_response.id)

# Create vector store with files
vector_store = client.vector_stores.create(
  name="acme_docs",
  file_ids=file_ids,
  embedding_model="sentence-transformers/all-MiniLM-L6-v2",
  embedding_dimension=384,
  provider_id="faiss"
)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/files "HTTP/1.1 200 OK"


File(id='file-54652c95c56c4c34918a97d7ff8a4320', bytes=41, created_at=1757442621, expires_at=1788978621, filename='shipping_policy.txt', object='file', purpose='assistants')
File(id='file-fb1227c1d1854da1bd774d21e5b7e41c', bytes=48, created_at=1757442621, expires_at=1788978621, filename='returns_policy.txt', object='file', purpose='assistants')
File(id='file-673f874852fe42798675a13d06a256e2', bytes=45, created_at=1757442621, expires_at=1788978621, filename='support.txt', object='file', purpose='assistants')


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores "HTTP/1.1 200 OK"


#### Test Vector Store Search

Query the vector store. This performs semantic search to find relevant documents based on the query.

In [8]:
search_response = client.vector_stores.search(
  vector_store_id=vector_store.id,
  query="How long does shipping take?",
  max_num_results=2
)
for result in search_response.data:
  content = result.content[0].text
  print(content)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores/vs_708c060b-45da-423e-8354-68529b4fd1a6/search "HTTP/1.1 200 OK"


Acme ships globally in 3-5 business days.
Returns are accepted within 30 days of purchase.


### 5. LangChain Integration

#### Configure LangChain with LlamaStack

Set up LangChain to use LlamaStack's OpenAI-compatible API:

- **Base URL**: Points to LlamaStack's OpenAI endpoint
- **Headers**: Include Fireworks API key for model access
- **Model**: Use Meta Llama v3p1 8b instruct model for inference

In [9]:
import os

from langchain_openai import ChatOpenAI

# Point LangChain to Llamastack Server
llm = ChatOpenAI(
    base_url="http://0.0.0.0:8321/v1/openai/v1",
    api_key="dummy",
    model="fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct",
    default_headers={"X-LlamaStack-Provider-Data": '{"fireworks_api_key": "***"}'},
)

#### Test LLM Connection

Verify that LangChain can successfully communicate with the LlamaStack server.

In [10]:
# Test llm with simple message
messages = [
    {"role": "system", "content": "You are a friendly assistant."},
    {"role": "user", "content": "Write a two-sentence poem about llama."},
]
llm.invoke(messages)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"


AIMessage(content="A llama's gentle eyes shine bright,\nIn the Andes, it roams through morning light.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': None, 'model_name': 'fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct', 'system_fingerprint': None, 'id': 'chatcmpl-602b5967-82a3-476b-9cd2-7d3b29b76ee8', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--0933c465-ff4d-4a7b-b7fb-fd97dd8244f3-0')

### 6. Building the RAG Chain

#### Create a Complete RAG Pipeline

Build a LangChain pipeline that combines:

1. **Vector Search**: Query LlamaStack's Open AI compatible Vector Store
2. **Context Assembly**: Format retrieved documents
3. **Prompt Template**: Structure the input for the LLM
4. **LLM Generation**: Generate answers using context
5. **Output Parsing**: Extract the final response

**Chain Flow**: `Query ‚Üí Vector Search ‚Üí Context + Question ‚Üí LLM ‚Üí Response`

In [11]:
# LangChain for prompt template and chaining + LLAMA Stack Client Vector DB and LLM chat completion
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough


def join_docs(docs):
    return "\n\n".join([f"[{d.filename}] {d.content[0].text}" for d in docs.data])

PROMPT = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant. Use the following context to answer."),
        ("user", "Question: {question}\n\nContext:\n{context}"),
    ]
)

vector_step = RunnableLambda(
      lambda x: client.vector_stores.search(
          vector_store_id=vector_store.id,
          query=x,
          max_num_results=2
      )
  )

chain = (
    {"context": vector_step | RunnableLambda(join_docs), "question": RunnablePassthrough()}
    | PROMPT
    | llm
    | StrOutputParser()
)

### 7. Testing the RAG System

#### Example 1: Shipping Query

In [12]:
query = "How long does shipping take?"
response = chain.invoke(query)
print("‚ùì", query)
print("üí°", response)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores/vs_708c060b-45da-423e-8354-68529b4fd1a6/search "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"


‚ùì How long does shipping take?
üí° Acme ships globally in 3-5 business days. This means that shipping typically takes between 3 to 5 working days from the date of dispatch or order fulfillment.


#### Example 2: Returns Policy Query

In [13]:
query = "Can I return a product after 40 days?"
response = chain.invoke(query)
print("‚ùì", query)
print("üí°", response)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores/vs_708c060b-45da-423e-8354-68529b4fd1a6/search "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"


‚ùì Can I return a product after 40 days?
üí° Based on the provided context, you cannot return a product after 40 days. The return window is limited to 30 days from the date of purchase.


---
We have successfully built a RAG system that combines:

- **LlamaStack** for infrastructure (LLM serving + Vector Store)
- **LangChain** for orchestration (prompts + chains)
- **Fireworks** for high-quality language models

### Key Benefits

1. **Unified Infrastructure**: Single server for LLMs and Vector Store
2. **OpenAI Compatibility**: Easy integration with existing LangChain code
3. **Multi-Provider Support**: Switch between different LLM providers
4. **Production Ready**: Built-in safety shields and monitoring

### Next Steps

- Add more sophisticated document processing
- Implement conversation memory
- Add safety filtering and monitoring
- Scale to larger document collections
- Integrate with web frameworks like FastAPI or Streamlit

---

##### üîß Cleanup

Don't forget to stop the LlamaStack server when you're done:

```python
kill_llama_stack_server()
```

In [14]:
kill_llama_stack_server()