feat: Add synthetic-data-kit for file_search doc conversion

This adds a `builtin::document_conversion` tool for converting
documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local
implementation that uses Docling, but need to debug some segfault
issues I'm hitting locally with that so pushing this first as a
simpler reference implementation.

Long-term I think we'll want a remote implemention here as well - like
perhaps docling-serve or unstructured.io - but need to look more into
that.

This passes the existing
`tests/verifications/openai_api/test_responses.py` but doesn't yet add
any new tests for file types besides text and pdf.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
This commit is contained in:
Ben Browning 2025-06-20 18:09:14 -04:00
parent 6fde601765
commit e56690abef
18 changed files with 230 additions and 18 deletions

View file

@ -34,6 +34,14 @@ def available_providers() -> list[ProviderSpec]:
config_class="llama_stack.providers.inline.tool_runtime.rag.config.RagToolRuntimeConfig",
api_dependencies=[Api.vector_io, Api.inference],
),
InlineProviderSpec(
api=Api.tool_runtime,
provider_type="inline::synthetic-data-kit",
pip_packages=["synthetic-data-kit"],
module="llama_stack.providers.inline.tool_runtime.synthetic-data-kit",
config_class="llama_stack.providers.inline.tool_runtime.synthetic-data-kit.config.SyntheticDataKitToolRuntimeConfig",
api_dependencies=[Api.files],
),
remote_provider_spec(
api=Api.tool_runtime,
adapter=AdapterSpec(