docs: concepts and building_applications migration (#3534)

# What does this PR do? - Migrates the remaining documentation sections to the new documentation format    ## Test Plan - Partial migration
2025-12-05 18:27:22 +00:00 · 2025-09-24 14:05:30 -07:00 · 2025-09-24 14:05:30 -07:00 · c71ce8df61
commit c71ce8df61
parent 05ff4c4420
82 changed files with 2535 additions and 1237 deletions
--- a/docs/docs/contributing/index.mdx
+++ b/docs/docs/contributing/index.mdx
@ -0,0 +1,244 @@
+# Contributing to Llama Stack
+We want to make contributing to this project as easy and transparent as
+possible.
+
+## Set up your development environment
+
+We use [uv](https://github.com/astral-sh/uv) to manage python dependencies and virtual environments.
+You can install `uv` by following this [guide](https://docs.astral.sh/uv/getting-started/installation/).
+
+You can install the dependencies by running:
+
+```bash
+cd llama-stack
+uv sync --group dev
+uv pip install -e .
+source .venv/bin/activate
+```
+
+```{note}
+You can use a specific version of Python with `uv` by adding the `--python <version>` flag (e.g. `--python 3.12`).
+Otherwise, `uv` will automatically select a Python version according to the `requires-python` section of the `pyproject.toml`.
+For more info, see the [uv docs around Python versions](https://docs.astral.sh/uv/concepts/python-versions/).
+```
+
+Note that you can create a dotenv file `.env` that includes necessary environment variables:
+```
+LLAMA_STACK_BASE_URL=http://localhost:8321
+LLAMA_STACK_CLIENT_LOG=debug
+LLAMA_STACK_PORT=8321
+LLAMA_STACK_CONFIG=<provider-name>
+TAVILY_SEARCH_API_KEY=
+BRAVE_SEARCH_API_KEY=
+```
+
+And then use this dotenv file when running client SDK tests via the following:
+```bash
+uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
+```
+
+### Pre-commit Hooks
+
+We use [pre-commit](https://pre-commit.com/) to run linting and formatting checks on your code. You can install the pre-commit hooks by running:
+
+```bash
+uv run pre-commit install
+```
+
+After that, pre-commit hooks will run automatically before each commit.
+
+Alternatively, if you don't want to install the pre-commit hooks, you can run the checks manually by running:
+
+```bash
+uv run pre-commit run --all-files
+```
+
+```{caution}
+Before pushing your changes, make sure that the pre-commit hooks have passed successfully.
+```
+
+## Discussions -> Issues -> Pull Requests
+
+We actively welcome your pull requests. However, please read the following. This is heavily inspired by [Ghostty](https://github.com/ghostty-org/ghostty/blob/main/CONTRIBUTING.md).
+
+If in doubt, please open a [discussion](https://github.com/meta-llama/llama-stack/discussions); we can always convert that to an issue later.
+
+### Issues
+We use GitHub issues to track public bugs. Please ensure your description is
+clear and has sufficient instructions to be able to reproduce the issue.
+
+Meta has a [bounty program](http://facebook.com/whitehat/info) for the safe
+disclosure of security bugs. In those cases, please go through the process
+outlined on that page and do not file a public issue.
+
+### Contributor License Agreement ("CLA")
+In order to accept your pull request, we need you to submit a CLA. You only need
+to do this once to work on any of Meta's open source projects.
+
+Complete your CLA here: <https://code.facebook.com/cla>
+
+**I'd like to contribute!**
+
+If you are new to the project, start by looking at the issues tagged with "good first issue". If you're interested
+leave a comment on the issue and a triager will assign it to you.
+
+Please avoid picking up too many issues at once. This helps you stay focused and ensures that others in the community also have opportunities to contribute.
+- Try to work on only 1–2 issues at a time, especially if you’re still getting familiar with the codebase.
+- Before taking an issue, check if it’s already assigned or being actively discussed.
+- If you’re blocked or can’t continue with an issue, feel free to unassign yourself or leave a comment so others can step in.
+
+**I have a bug!**
+
+1. Search the issue tracker and discussions for similar issues.
+2. If you don't have steps to reproduce, open a discussion.
+3. If you have steps to reproduce, open an issue.
+
+**I have an idea for a feature!**
+
+1. Open a discussion.
+
+**I've implemented a feature!**
+
+1. If there is an issue for the feature, open a pull request.
+2. If there is no issue, open a discussion and link to your branch.
+
+**I have a question!**
+
+1. Open a discussion or use [Discord](https://discord.gg/llama-stack).
+
+
+**Opening a Pull Request**
+
+1. Fork the repo and create your branch from `main`.
+2. If you've changed APIs, update the documentation.
+3. Ensure the test suite passes.
+4. Make sure your code lints using `pre-commit`.
+5. If you haven't already, complete the Contributor License Agreement ("CLA").
+6. Ensure your pull request follows the [conventional commits format](https://www.conventionalcommits.org/en/v1.0.0/).
+7. Ensure your pull request follows the [coding style](#coding-style).
+
+
+Please keep pull requests (PRs) small and focused. If you have a large set of changes, consider splitting them into logically grouped, smaller PRs to facilitate review and testing.
+
+```{tip}
+As a general guideline:
+- Experienced contributors should try to keep no more than 5 open PRs at a time.
+- New contributors are encouraged to have only one open PR at a time until they’re familiar with the codebase and process.
+```
+
+## Repository guidelines
+
+### Coding Style
+
+* Comments should provide meaningful insights into the code. Avoid filler comments that simply
+  describe the next step, as they create unnecessary clutter, same goes for docstrings.
+* Prefer comments to clarify surprising behavior and/or relationships between parts of the code
+  rather than explain what the next line of code does.
+* Catching exceptions, prefer using a specific exception type rather than a broad catch-all like
+  `Exception`.
+* Error messages should be prefixed with "Failed to ..."
+* 4 spaces for indentation rather than tab
+* When using `# noqa` to suppress a style or linter warning, include a comment explaining the
+  justification for bypassing the check.
+* When using `# type: ignore` to suppress a mypy warning, include a comment explaining the
+  justification for bypassing the check.
+* Don't use unicode characters in the codebase. ASCII-only is preferred for compatibility or
+  readability reasons.
+* Providers configuration class should be Pydantic Field class. It should have a `description` field
+  that describes the configuration. These descriptions will be used to generate the provider
+  documentation.
+* When possible, use keyword arguments only when calling functions.
+* Llama Stack utilizes [custom Exception classes](llama_stack/apis/common/errors.py) for certain Resources that should be used where applicable.
+
+### License
+By contributing to Llama, you agree that your contributions will be licensed
+under the LICENSE file in the root directory of this source tree.
+
+## Common Tasks
+
+Some tips about common tasks you work on while contributing to Llama Stack:
+
+### Using `llama stack build`
+
+Building a stack image will use the production version of the `llama-stack` and `llama-stack-client` packages. If you are developing with a llama-stack repository checked out and need your code to be reflected in the stack image, set `LLAMA_STACK_DIR` and `LLAMA_STACK_CLIENT_DIR` to the appropriate checked out directories when running any of the `llama` CLI commands.
+
+Example:
+```bash
+cd work/
+git clone https://github.com/meta-llama/llama-stack.git
+git clone https://github.com/meta-llama/llama-stack-client-python.git
+cd llama-stack
+LLAMA_STACK_DIR=$(pwd) LLAMA_STACK_CLIENT_DIR=../llama-stack-client-python llama stack build --distro <...>
+```
+
+### Updating distribution configurations
+
+If you have made changes to a provider's configuration in any form (introducing a new config key, or
+changing models, etc.), you should run `./scripts/distro_codegen.py` to re-generate various YAML
+files as well as the documentation. You should not change `docs/source/.../distributions/` files
+manually as they are auto-generated.
+
+### Updating the provider documentation
+
+If you have made changes to a provider's configuration, you should run `./scripts/provider_codegen.py`
+to re-generate the documentation. You should not change `docs/source/.../providers/` files manually
+as they are auto-generated.
+Note that the provider "description" field will be used to generate the provider documentation.
+
+### Building the Documentation
+
+If you are making changes to the documentation at [https://llamastack.github.io/latest/](https://llamastack.github.io/latest/), you can use the following command to build the documentation and preview your changes. You will need [Sphinx](https://www.sphinx-doc.org/en/master/) and the readthedocs theme.
+
+```bash
+# This rebuilds the documentation pages.
+uv run --group docs make -C docs/ html
+
+# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
+uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
+```
+
+### Update API Documentation
+
+If you modify or add new API endpoints, update the API documentation accordingly. You can do this by running the following command:
+
+```bash
+uv run ./docs/openapi_generator/run_openapi_generator.sh
+```
+
+The generated API documentation will be available in `docs/_static/`. Make sure to review the changes before committing.
+
+## Adding a New Provider
+
+See:
+- [Adding a New API Provider Page](new_api_provider.md) which describes how to add new API providers to the Stack.
+- [Vector Database Page](new_vector_database.md) which describes how to add a new vector databases with Llama Stack.
+- [External Provider Page](../providers/external/index.md) which describes how to add external providers to the Stack.
+
+```{toctree}
+:maxdepth: 1
+:hidden:
+
+new_api_provider
+new_vector_database
+```
+
+## Testing
+
+
+```{include} ../../../tests/README.md
+```
+
+## Advanced Topics
+
+For developers who need deeper understanding of the testing system internals:
+
+```{toctree}
+:maxdepth: 1
+
+testing/record-replay
+```
+
+### Benchmarking
+
+```{include} ../../../benchmarking/k8s-benchmark/README.md
+```
--- a/docs/docs/contributing/new_api_provider.mdx
+++ b/docs/docs/contributing/new_api_provider.mdx
@ -0,0 +1,98 @@
+---
+title: Adding a New API Provider
+description: Guide for adding new API providers to Llama Stack
+sidebar_label: New API Provider
+sidebar_position: 2
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+This guide will walk you through the process of adding a new API provider to Llama Stack.
+
+
+- Begin by reviewing the [core concepts](../concepts/index.md) of Llama Stack and choose the API your provider belongs to (Inference, Safety, VectorIO, etc.)
+- Determine the provider type ({repopath}`Remote::llama_stack/providers/remote` or {repopath}`Inline::llama_stack/providers/inline`). Remote providers make requests to external services, while inline providers execute implementation locally.
+- Add your provider to the appropriate {repopath}`Registry::llama_stack/providers/registry/`. Specify pip dependencies necessary.
+- Update any distribution {repopath}`Templates::llama_stack/distributions/` `build.yaml` and `run.yaml` files if they should include your provider by default. Run {repopath}`./scripts/distro_codegen.py` if necessary. Note that `distro_codegen.py` will fail if the new provider causes any distribution template to attempt to import provider-specific dependencies. This usually means the distribution's `get_distribution_template()` code path should only import any necessary Config or model alias definitions from each provider and not the provider's actual implementation.
+
+
+Here are some example PRs to help you get started:
+   - [Grok Inference Implementation](https://github.com/meta-llama/llama-stack/pull/609)
+   - [Nvidia Inference Implementation](https://github.com/meta-llama/llama-stack/pull/355)
+   - [Model context protocol Tool Runtime](https://github.com/meta-llama/llama-stack/pull/665)
+
+## Guidelines for creating Internal or External Providers
+
+|**Type** |Internal (In-tree) |External (out-of-tree)
+|---------|-------------------|---------------------|
+|**Description** |A provider that is directly in the Llama Stack code|A provider that is outside of the Llama stack core codebase but is still accessible and usable by Llama Stack.
+|**Benefits** |Ability to interact with the provider with minimal additional configurations or installations| Contributors do not have to add directly to the code to create providers accessible on Llama Stack. Keep provider-specific code separate from the core Llama Stack code.
+
+## Inference Provider Patterns
+
+When implementing Inference providers for OpenAI-compatible APIs, Llama Stack provides several mixin classes to simplify development and ensure consistent behavior across providers.
+
+### OpenAIMixin
+
+The `OpenAIMixin` class provides direct OpenAI API functionality for providers that work with OpenAI-compatible endpoints. It includes:
+
+#### Direct API Methods
+- **`openai_completion()`**: Legacy text completion API with full parameter support
+- **`openai_chat_completion()`**: Chat completion API supporting streaming, tools, and function calling
+- **`openai_embeddings()`**: Text embeddings generation with customizable encoding and dimensions
+
+#### Model Management
+- **`check_model_availability()`**: Queries the API endpoint to verify if a model exists and is accessible
+
+#### Client Management
+- **`client` property**: Automatically creates and configures AsyncOpenAI client instances using your provider's credentials
+
+#### Required Implementation
+
+To use `OpenAIMixin`, your provider must implement these abstract methods:
+
+```python
+@abstractmethod
+def get_api_key(self) -> str:
+    """Return the API key for authentication"""
+    pass
+
+
+@abstractmethod
+def get_base_url(self) -> str:
+    """Return the OpenAI-compatible API base URL"""
+    pass
+```
+
+## Testing the Provider
+
+Before running tests, you must have required dependencies installed. This depends on the providers or distributions you are testing. For example, if you are testing the `together` distribution, you should install dependencies via `llama stack build --distro together`.
+
+### 1. Integration Testing
+
+Integration tests are located in {repopath}`tests/integration`. These tests use the python client-SDK APIs (from the `llama_stack_client` package) to test functionality. Since these tests use client APIs, they can be run either by pointing to an instance of the Llama Stack server or "inline" by using `LlamaStackAsLibraryClient`.
+
+Consult {repopath}`tests/integration/README.md` for more details on how to run the tests.
+
+Note that each provider's `sample_run_config()` method (in the configuration class for that provider)
+ typically references some environment variables for specifying API keys and the like. You can set these in the environment or pass these via the `--env` flag to the test command.
+
+
+### 2. Unit Testing
+
+Unit tests are located in {repopath}`tests/unit`. Provider-specific unit tests are located in {repopath}`tests/unit/providers`. These tests are all run automatically as part of the CI process.
+
+Consult {repopath}`tests/unit/README.md` for more details on how to run the tests manually.
+
+### 3. Additional end-to-end testing
+
+1. Start a Llama Stack server with your new provider
+2. Verify compatibility with existing client scripts in the [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) repository
+3. Document which scripts are compatible with your provider
+
+## Submitting Your PR
+
+1. Ensure all tests pass
+2. Include a comprehensive test plan in your PR summary
+3. Document any known limitations or considerations
--- a/docs/docs/contributing/new_vector_database.mdx
+++ b/docs/docs/contributing/new_vector_database.mdx
@ -0,0 +1,83 @@
+---
+title: Adding a New Vector Database
+description: Guide for adding new vector database providers to Llama Stack
+sidebar_label: New Vector Database
+sidebar_position: 3
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+This guide will walk you through the process of adding a new vector database to Llama Stack.
+
+> **_NOTE:_** Here's an example Pull Request of the [Milvus Vector Database Provider](https://github.com/meta-llama/llama-stack/pull/1467).
+
+Vector Database providers are used to store and retrieve vector embeddings. Vector databases are not limited to vector
+search but can support keyword and hybrid search. Additionally, vector database can also support operations like
+filtering, sorting, and aggregating vectors.
+
+## Steps to Add a New Vector Database Provider
+1. **Choose the Database Type**: Determine if your vector database is a remote service, inline, or both.
+   - Remote databases make requests to external services, while inline databases execute locally. Some providers support both.
+2. **Implement the Provider**: Create a new provider class that inherits from `VectorDatabaseProvider` and implements the required methods.
+   - Implement methods for vector storage, retrieval, search, and any additional features your database supports.
+     - You will need to implement the following methods for `YourVectorIndex`:
+        - `YourVectorIndex.create()`
+        - `YourVectorIndex.initialize()`
+        - `YourVectorIndex.add_chunks()`
+        - `YourVectorIndex.delete_chunk()`
+        - `YourVectorIndex.query_vector()`
+        - `YourVectorIndex.query_keyword()`
+        - `YourVectorIndex.query_hybrid()`
+     - You will need to implement the following methods for `YourVectorIOAdapter`:
+        - `YourVectorIOAdapter.initialize()`
+        - `YourVectorIOAdapter.shutdown()`
+        - `YourVectorIOAdapter.list_vector_dbs()`
+        - `YourVectorIOAdapter.register_vector_db()`
+        - `YourVectorIOAdapter.unregister_vector_db()`
+        - `YourVectorIOAdapter.insert_chunks()`
+        - `YourVectorIOAdapter.query_chunks()`
+        - `YourVectorIOAdapter.delete_chunks()`
+3. **Add to Registry**: Register your provider in the appropriate registry file.
+   - Update {repopath}`llama_stack/providers/registry/vector_io.py` to include your new provider.
+```python
+from llama_stack.providers.registry.specs import InlineProviderSpec
+from llama_stack.providers.registry.api import Api
+
+InlineProviderSpec(
+    api=Api.vector_io,
+    provider_type="inline::milvus",
+    pip_packages=["pymilvus>=2.4.10"],
+    module="llama_stack.providers.inline.vector_io.milvus",
+    config_class="llama_stack.providers.inline.vector_io.milvus.MilvusVectorIOConfig",
+    api_dependencies=[Api.inference],
+    optional_api_dependencies=[Api.files],
+    description="",
+),
+```
+4. **Add Tests**: Create unit tests and integration tests for your provider in the `tests/` directory.
+   - Unit Tests
+     - By following the structure of the class methods, you will be able to easily run unit and integration tests for your database.
+       1. You have to configure the tests for your provide in `/tests/unit/providers/vector_io/conftest.py`.
+       2. Update the `vector_provider` fixture to include your provider if they are an inline provider.
+       3. Create a `your_vectorprovider_index` fixture that initializes your vector index.
+       4. Create a `your_vectorprovider_adapter` fixture that initializes your vector adapter.
+       5. Add your provider to the `vector_io_providers` fixture dictionary.
+         - Please follow the naming convention of `your_vectorprovider_index` and `your_vectorprovider_adapter` as the tests require this to execute properly.
+   - Integration Tests
+     - Integration tests are located in {repopath}`tests/integration`. These tests use the python client-SDK APIs (from the `llama_stack_client` package) to test functionality.
+     - The two set of integration tests are:
+       - `tests/integration/vector_io/test_vector_io.py`: This file tests registration, insertion, and retrieval.
+       - `tests/integration/vector_io/test_openai_vector_stores.py`: These tests are for OpenAI-compatible vector stores and test the OpenAI API compatibility.
+        - You will need to update `skip_if_provider_doesnt_support_openai_vector_stores` to include your provider as well as `skip_if_provider_doesnt_support_openai_vector_stores_search` to test the appropriate search functionality.
+     - Running the tests in the GitHub CI
+       - You will need to update the `.github/workflows/integration-vector-io-tests.yml` file to include your provider.
+        - If your provider is a remote provider, you will also have to add a container to spin up and run it in the action.
+   - Updating the pyproject.yml
+     - If you are adding tests for the `inline` provider you will have to update the `unit` group.
+       - `uv add new_pip_package --group unit`
+     - If you are adding tests for the `remote` provider you will have to update the `test` group, which is used in the GitHub CI for integration tests.
+       - `uv add new_pip_package --group test`
+5. **Update Documentation**: Please update the documentation for end users
+   - Generate the provider documentation by running {repopath}`./scripts/provider_codegen.py`.
+   - Update the autogenerated content in the registry/vector_io.py file with information about your provider. Please see other providers for examples.
--- a/docs/docs/contributing/testing/record-replay.mdx
+++ b/docs/docs/contributing/testing/record-replay.mdx
@ -0,0 +1,241 @@
+---
+title: Record-Replay Testing System
+description: Understanding how Llama Stack captures and replays API interactions for testing
+sidebar_label: Record-Replay System
+sidebar_position: 4
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# Record-Replay System
+
+Understanding how Llama Stack captures and replays API interactions for testing.
+
+## Overview
+
+The record-replay system solves a fundamental challenge in AI testing: how do you test against expensive, non-deterministic APIs without breaking the bank or dealing with flaky tests?
+
+The solution: intercept API calls, store real responses, and replay them later. This gives you real API behavior without the cost or variability.
+
+## How It Works
+
+### Request Hashing
+
+Every API request gets converted to a deterministic hash for lookup:
+
+```python
+def normalize_request(method: str, url: str, headers: dict, body: dict) -> str:
+    normalized = {
+        "method": method.upper(),
+        "endpoint": urlparse(url).path,  # Just the path, not full URL
+        "body": body,  # Request parameters
+    }
+    return hashlib.sha256(json.dumps(normalized, sort_keys=True).encode()).hexdigest()
+```
+
+**Key insight:** The hashing is intentionally precise. Different whitespace, float precision, or parameter order produces different hashes. This prevents subtle bugs from false cache hits.
+
+```python
+# These produce DIFFERENT hashes:
+{"content": "Hello world"}
+{"content": "Hello   world\n"}
+{"temperature": 0.7}
+{"temperature": 0.7000001}
+```
+
+### Client Interception
+
+The system patches OpenAI and Ollama client methods to intercept calls before they leave your application. This happens transparently - your test code doesn't change.
+
+### Storage Architecture
+
+Recordings are stored as JSON files in the recording directory. They are looked up by their request hash.
+
+```
+recordings/
+└── responses/
+    ├── abc123def456.json  # Individual response files
+    └── def789ghi012.json
+```
+
+**JSON files** store complete request/response pairs in human-readable format for debugging.
+
+## Recording Modes
+
+### LIVE Mode
+
+Direct API calls with no recording or replay:
+
+```python
+with inference_recording(mode=InferenceMode.LIVE):
+    response = await client.chat.completions.create(...)
+```
+
+Use for initial development and debugging against real APIs.
+
+### RECORD Mode
+
+Captures API interactions while passing through real responses:
+
+```python
+with inference_recording(mode=InferenceMode.RECORD, storage_dir="./recordings"):
+    response = await client.chat.completions.create(...)
+    # Real API call made, response captured AND returned
+```
+
+The recording process:
+1. Request intercepted and hashed
+2. Real API call executed
+3. Response captured and serialized
+4. Recording stored to disk
+5. Original response returned to caller
+
+### REPLAY Mode
+
+Returns stored responses instead of making API calls:
+
+```python
+with inference_recording(mode=InferenceMode.REPLAY, storage_dir="./recordings"):
+    response = await client.chat.completions.create(...)
+    # No API call made, cached response returned instantly
+```
+
+The replay process:
+1. Request intercepted and hashed
+2. Hash looked up in SQLite index
+3. Response loaded from JSON file
+4. Response deserialized and returned
+5. Error if no recording found
+
+## Streaming Support
+
+Streaming APIs present a unique challenge: how do you capture an async generator?
+
+### The Problem
+
+```python
+# How do you record this?
+async for chunk in client.chat.completions.create(stream=True):
+    process(chunk)
+```
+
+### The Solution
+
+The system captures all chunks immediately before yielding any:
+
+```python
+async def handle_streaming_record(response):
+    # Capture complete stream first
+    chunks = []
+    async for chunk in response:
+        chunks.append(chunk)
+
+    # Store complete recording
+    storage.store_recording(
+        request_hash, request_data, {"body": chunks, "is_streaming": True}
+    )
+
+    # Return generator that replays captured chunks
+    async def replay_stream():
+        for chunk in chunks:
+            yield chunk
+
+    return replay_stream()
+```
+
+This ensures:
+- **Complete capture** - The entire stream is saved atomically
+- **Interface preservation** - The returned object behaves like the original API
+- **Deterministic replay** - Same chunks in the same order every time
+
+## Serialization
+
+API responses contain complex Pydantic objects that need careful serialization:
+
+```python
+def _serialize_response(response):
+    if hasattr(response, "model_dump"):
+        # Preserve type information for proper deserialization
+        return {
+            "__type__": f"{response.__class__.__module__}.{response.__class__.__qualname__}",
+            "__data__": response.model_dump(mode="json"),
+        }
+    return response
+```
+
+This preserves type safety - when replayed, you get the same Pydantic objects with all their validation and methods.
+
+## Environment Integration
+
+### Environment Variables
+
+Control recording behavior globally:
+
+```bash
+export LLAMA_STACK_TEST_INFERENCE_MODE=replay   # this is the default
+export LLAMA_STACK_TEST_RECORDING_DIR=/path/to/recordings   # default is tests/integration/recordings
+pytest tests/integration/
+```
+
+### Pytest Integration
+
+The system integrates automatically based on environment variables, requiring no changes to test code.
+
+## Debugging Recordings
+
+### Inspecting Storage
+
+```bash
+# See what's recorded
+sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings LIMIT 10;"
+
+# View specific response
+cat recordings/responses/abc123def456.json | jq '.response.body'
+
+# Find recordings by endpoint
+sqlite3 recordings/index.sqlite "SELECT * FROM recordings WHERE endpoint='/v1/chat/completions';"
+```
+
+### Common Issues
+
+**Hash mismatches:** Request parameters changed slightly between record and replay
+```bash
+# Compare request details
+cat recordings/responses/abc123.json | jq '.request'
+```
+
+**Serialization errors:** Response types changed between versions
+```bash
+# Re-record with updated types
+rm recordings/responses/failing_hash.json
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_failing.py
+```
+
+**Missing recordings:** New test or changed parameters
+```bash
+# Record the missing interaction
+LLAMA_STACK_TEST_INFERENCE_MODE=record pytest test_new.py
+```
+
+## Design Decisions
+
+### Why Not Mocks?
+
+Traditional mocking breaks down with AI APIs because:
+- Response structures are complex and evolve frequently
+- Streaming behavior is hard to mock correctly
+- Edge cases in real APIs get missed
+- Mocks become brittle maintenance burdens
+
+### Why Precise Hashing?
+
+Loose hashing (normalizing whitespace, rounding floats) seems convenient but hides bugs. If a test changes slightly, you want to know about it rather than accidentally getting the wrong cached response.
+
+### Why JSON + SQLite?
+
+- **JSON** - Human readable, diff-friendly, easy to inspect and modify
+- **SQLite** - Fast indexed lookups without loading response bodies
+- **Hybrid** - Best of both worlds for different use cases
+
+This system provides reliable, fast testing against real AI APIs while maintaining the ability to debug issues when they arise.