## LlamaStack + LangChain Integration Tutorial

This notebook demonstrates how to integrate **LlamaStack** with **LangChain** to build a complete RAG (Retrieval-Augmented Generation) system.

### Overview

- **LlamaStack**: Provides the infrastructure for running LLMs and vector databases
- **LangChain**: Provides the framework for chaining operations and prompt templates
- **Integration**: Uses LlamaStack's OpenAI-compatible API with LangChain

### What You'll See

1. Setting up LlamaStack server with Together AI provider
2. Creating and managing vector databases
3. Building RAG chains with LangChain + LLAMAStack
4. Querying the chain for relevant information

### Prerequisites

- Together AI API key

---

### 1. Installation and Setup

#### Install Required Dependencies

First, we install all the necessary packages for LangChain and FastAPI integration.

In [1]:
!pip install fastapi uvicorn "langchain>=0.2" langchain-openai \
             langchain-community langchain-text-splitters \
             faiss-cpu



### 2. LlamaStack Server Setup

#### Build and Start LlamaStack Server

This section sets up the LlamaStack server with:
- **Together AI** as the inference provider
- **FAISS** as the vector database
- **Sentence Transformers** for embeddings

The server runs on `localhost:8321` and provides OpenAI-compatible endpoints.

In [2]:
import os
import subprocess
import time

!pip install uv

if "UV_SYSTEM_PYTHON" in os.environ:
    del os.environ["UV_SYSTEM_PYTHON"]

# this command installs all the dependencies needed for the llama stack server with the together inference provider
!uv run --with llama-stack llama stack build --distro starter --image-type venv


def run_llama_stack_server_background():
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        "uv run --with llama-stack llama stack run /Users/swapna942/.llama/distributions/starter/starter-run.yaml --image-type venv",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True,
    )

    print(f"Starting Llama Stack server with PID: {process.pid}")
    return process


def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


# use this helper if needed to kill the server
def kill_llama_stack_server():
    # Kill any existing llama stack server processes
    os.system("ps aux | grep -v grep | grep llama_stack.core.server.server | awk '{print $2}' | xargs kill -9")

Environment '/Users/swapna942/llama-stack/.venv' already exists, re-using it.
Virtual environment /Users/swapna942/llama-stack/.venv is already active
[2mAudited [1m1 package[0m [2min 86ms[0m[0m
Installing pip dependencies
[2K[2mResolved [1m178 packages[0m [2min 462ms[0m[0m                                       [0m
[2mUninstalled [1m2 packages[0m [2min 28ms[0m[0m
[2K[2mInstalled [1m2 packages[0m [2min 5ms[0m[0m                                 [0m
 [31m-[39m [1mprotobuf[0m[2m==5.29.5[0m
 [32m+[39m [1mprotobuf[0m[2m==5.29.4[0m
 [31m-[39m [1mruff[0m[2m==0.12.5[0m
 [32m+[39m [1mruff[0m[2m==0.9.10[0m
Installing special provider module: torch torchvision --index-url https://download.pytorch.org/whl/cpu
[2mAudited [1m2 packages[0m [2min 5ms[0m[0m
Installing special provider module: sentence-transformers --no-deps
[2mAudited [1m1 package[0m [2min 9ms[0m[0m
[32mBuild Successful![0m
[34mYou can find the newly-built distribution h

In [3]:
server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()

Starting Llama Stack server with PID: 99016
Waiting for server to start....
Server is ready!


#### Install LlamaStack Client

Install the client library to interact with the LlamaStack server.

In [4]:
import sys

# Install directly to the current Python environment
subprocess.check_call([sys.executable, "-m", "pip", "install", "llama_stack_client"])



0

### 3. Initialize LlamaStack Client

Create a client connection to the LlamaStack server with API keys for different providers:

- **OpenAI API Key**: For OpenAI models
- **Gemini API Key**: For Google's Gemini models  
- **Together API Key**: For Together AI models



In [5]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321",
    provider_data={"openai_api_key": "****", "gemini_api_key": "****", "together_api_key": "****"},
)

#### Explore Available Models and Safety Features

Check what models and safety shields are available through your LlamaStack instance.

In [6]:
print("Available models:")
for m in client.models.list():
    print(f"- {m.identifier}")

print("----")
print("Available shields (safety models):")
for s in client.shields.list():
    print(s.identifier)
print("----")

INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/shields "HTTP/1.1 200 OK"


Available models:
- all-minilm
- ollama/all-minilm:l6-v2
- ollama/llama-guard3:1b
- ollama/llama-guard3:8b
- ollama/llama3.2:3b-instruct-fp16
- ollama/nomic-embed-text
- fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct
- fireworks/accounts/fireworks/models/llama-v3p1-70b-instruct
- fireworks/accounts/fireworks/models/llama-v3p1-405b-instruct
- fireworks/accounts/fireworks/models/llama-v3p2-3b-instruct
- fireworks/accounts/fireworks/models/llama-v3p2-11b-vision-instruct
- fireworks/accounts/fireworks/models/llama-v3p2-90b-vision-instruct
- fireworks/accounts/fireworks/models/llama-v3p3-70b-instruct
- fireworks/accounts/fireworks/models/llama4-scout-instruct-basic
- fireworks/accounts/fireworks/models/llama4-maverick-instruct-basic
- fireworks/nomic-ai/nomic-embed-text-v1.5
- fireworks/accounts/fireworks/models/llama-guard-3-8b
- fireworks/accounts/fireworks/models/llama-guard-3-11b-vision
- together/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
- together/meta-llama/Meta-Llama-3

### 4. Vector Database Setup

#### Register a Vector Database

Create a FAISS vector database for storing document embeddings:

- **Vector DB ID**: Unique identifier for the database
- **Provider**: FAISS (Facebook AI Similarity Search)
- **Embedding Model**: Sentence Transformers model for text embeddings
- **Dimensions**: 384-dimensional embeddings

In [7]:
# Register a new clean vector database
vector_db = client.vector_dbs.register(
    vector_db_id="acme_docs",  # Use a new unique name
    provider_id="faiss",
    provider_vector_db_id="acme_docs_v2",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    embedding_dimension=384,
)
print("Registered new vector DB:", vector_db)

# List all registered vector databases
dbs = client.vector_dbs.list()
print("Existing vector DBs:", dbs)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector-dbs "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/vector-dbs "HTTP/1.1 200 OK"


Registered new vector DB: VectorDBRegisterResponse(embedding_dimension=384, embedding_model='sentence-transformers/all-MiniLM-L6-v2', identifier='acme_docs', provider_id='faiss', type='vector_db', provider_resource_id='acme_docs_v2', owner=None, source='via_register_api', vector_db_name=None)
Existing vector DBs: [VectorDBListResponseItem(embedding_dimension=384, embedding_model='sentence-transformers/all-MiniLM-L6-v2', identifier='acme_docs', provider_id='faiss', type='vector_db', provider_resource_id='acme_docs_v2', vector_db_name=None)]


#### Prepare Sample Documents

Create LLAMA Stack Chunks for FAISS vector store

In [None]:
from llama_stack_client.types.vector_io_insert_params import Chunk

docs = [
    ("Acme ships globally in 3-5 business days.", {"title": "Shipping Policy"}),
    ("Returns are accepted within 30 days of purchase.", {"title": "Returns Policy"}),
    ("Support is available 24/7 via chat and email.", {"title": "Support"}),
]

# Convert to Chunk objects
chunks = []
for _, (content, metadata) in enumerate(docs):
    # Transform metadata to required format with document_id from title
    metadata = {"document_id": metadata["title"]}
    chunk = Chunk(
        content=content,  # Required[InterleavedContent]
        metadata=metadata,  # Required[Dict]
    )
    chunks.append(chunk)

#### Insert Documents into Vector Database

Store the prepared documents in the FAISS vector database. This process:
1. Generates embeddings for each document
2. Stores embeddings with metadata
3. Enables semantic search capabilities

In [9]:
# Insert chunks into FAISS vector store

response = client.vector_io.insert(vector_db_id="acme_docs", chunks=chunks)
print("Documents inserted:", response)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector-io/insert "HTTP/1.1 200 OK"


Documents inserted: None


#### Test Vector Search

Query the vector database to verify it's working correctly. This performs semantic search to find relevant documents based on the query.

In [10]:
# Query chunks from FAISS vector store

query_chunk_response = client.vector_io.query(
    vector_db_id="acme_docs",
    query="How long does Acme take to ship orders?",
)
for chunk in query_chunk_response.chunks:
    print("metadata", ":", chunk.metadata)
    print("content", ":", chunk.content)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector-io/query "HTTP/1.1 200 OK"


metadata : {'document_id': 'Shipping Policy'}
content : Acme ships globally in 3‚Äì5 business days.
metadata : {'document_id': 'Shipping Policy'}
content : Acme ships globally in 3‚Äì5 business days.
metadata : {'document_id': 'Returns Policy'}
content : Returns are accepted within 30 days of purchase.


### 5. LangChain Integration

#### Configure LangChain with LlamaStack

Set up LangChain to use LlamaStack's OpenAI-compatible API:

- **Base URL**: Points to LlamaStack's OpenAI endpoint
- **Headers**: Include Together AI API key for model access
- **Model**: Use Meta Llama 3.1 8B model via Together AI

In [11]:
import os

from langchain_openai import ChatOpenAI

# Point LangChain to Llamastack Server
os.environ["OPENAI_API_KEY"] = "dummy"
os.environ["OPENAI_BASE_URL"] = "http://0.0.0.0:8321/v1/openai/v1"

# LLM from Llamastack together model
llm = ChatOpenAI(
    model="together/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    default_headers={"X-LlamaStack-Provider-Data": '{"together_api_key": "***"}'},
)

#### Test LLM Connection

Verify that LangChain can successfully communicate with the LlamaStack server.

In [12]:
# Test llm with simple message
messages = [
    {"role": "system", "content": "You are a friendly assistant."},
    {"role": "user", "content": "Write a two-sentence poem about llama."},
]
llm.invoke(messages)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"


AIMessage(content="In the Andes, a gentle soul resides, \nA llama's soft eyes, with kindness abide.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 22, 'prompt_tokens': 50, 'total_tokens': 72, 'completion_tokens_details': None, 'prompt_tokens_details': None, 'cached_tokens': 0}, 'model_name': 'meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo', 'system_fingerprint': None, 'id': 'o86Jy3i-2j9zxn-972d7b27f8f22aaa', 'service_tier': None, 'finish_reason': 'stop', 'logprobs': None}, id='run--4797f8b9-a5f6-4730-aece-80c1fd88ac55-0', usage_metadata={'input_tokens': 50, 'output_tokens': 22, 'total_tokens': 72, 'input_token_details': {}, 'output_token_details': {}})

### 6. Building the RAG Chain

#### Create a Complete RAG Pipeline

Build a LangChain pipeline that combines:

1. **Vector Search**: Query LlamaStack's vector database
2. **Context Assembly**: Format retrieved documents
3. **Prompt Template**: Structure the input for the LLM
4. **LLM Generation**: Generate answers using context
5. **Output Parsing**: Extract the final response

**Chain Flow**: `Query ‚Üí Vector Search ‚Üí Context + Question ‚Üí LLM ‚Üí Response`

In [None]:
# LangChain for prompt template and chaining + LLAMA Stack Client Vector DB and LLM chat completion
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough


def join_docs(docs):
    return "\n\n".join([f"[{d.metadata.get('document_id')}] {d.content}" for d in docs.chunks])


PROMPT = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant. Use the following context to answer."),
        ("user", "Question: {question}\n\nContext:\n{context}"),
    ]
)

vector_step = RunnableLambda(
    lambda x: client.vector_io.query(
        vector_db_id="acme_docs",
        query=x,
    )
)

chain = (
    {"context": vector_step | RunnableLambda(join_docs), "question": RunnablePassthrough()}
    | PROMPT
    | llm
    | StrOutputParser()
)

### 7. Testing the RAG System

#### Example 1: Shipping Query

In [14]:
query = "How long does shipping take?"
response = chain.invoke(query)
print("‚ùì", query)
print("üí°", response)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector-io/query "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"


‚ùì How long does shipping take?
üí° According to the Shipping Policy, shipping from Acme takes 3-5 business days.


#### Example 2: Returns Policy Query

In [15]:
query = "Can I return a product after 40 days?"
response = chain.invoke(query)
print("‚ùì", query)
print("üí°", response)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector-io/query "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"


‚ùì Can I return a product after 40 days?
üí° Based on the provided returns policy, it appears that returns are only accepted within 30 days of purchase. Since you're asking about returning a product after 40 days, it would not be within the specified 30-day return window.

Unfortunately, it seems that you would not be eligible for a return in this case. However, I would recommend reaching out to the support team via chat or email to confirm their policy and see if there are any exceptions or alternative solutions available.


---
We have successfully built a RAG system that combines:

- **LlamaStack** for infrastructure (LLM serving + vector database)
- **LangChain** for orchestration (prompts + chains)
- **Together AI** for high-quality language models

### Key Benefits

1. **Unified Infrastructure**: Single server for LLMs and vector databases
2. **OpenAI Compatibility**: Easy integration with existing LangChain code
3. **Multi-Provider Support**: Switch between different LLM providers
4. **Production Ready**: Built-in safety shields and monitoring

### Next Steps

- Add more sophisticated document processing
- Implement conversation memory
- Add safety filtering and monitoring
- Scale to larger document collections
- Integrate with web frameworks like FastAPI or Streamlit

---

##### üîß Cleanup

Don't forget to stop the LlamaStack server when you're done:

```python
kill_llama_stack_server()
```