feat: Add synthetic-data-kit for file_search doc conversion

This adds a `builtin::document_conversion` tool for converting
documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local
implementation that uses Docling, but need to debug some segfault
issues I'm hitting locally with that so pushing this first as a
simpler reference implementation.

Long-term I think we'll want a remote implemention here as well - like
perhaps docling-serve or unstructured.io - but need to look more into
that.

This passes the existing
`tests/verifications/openai_api/test_responses.py` but doesn't yet add
any new tests for file types besides text and pdf.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
This commit is contained in:
Ben Browning 2025-06-20 18:09:14 -04:00
parent 9baa16e498
commit 8bf1d91d38
18 changed files with 230 additions and 18 deletions

View file

@ -31,6 +31,7 @@ distribution_spec:
- remote::brave-search
- remote::tavily-search
- inline::rag-runtime
- inline::synthetic-data-kit
- remote::model-context-protocol
- remote::wolfram-alpha
image_type: conda

View file

@ -36,6 +36,7 @@ def get_distribution_template() -> DistributionTemplate:
"remote::brave-search",
"remote::tavily-search",
"inline::rag-runtime",
"inline::synthetic-data-kit",
"remote::model-context-protocol",
"remote::wolfram-alpha",
],
@ -91,6 +92,10 @@ def get_distribution_template() -> DistributionTemplate:
toolgroup_id="builtin::wolfram_alpha",
provider_id="wolfram-alpha",
),
ToolGroupInput(
toolgroup_id="builtin::document_conversion",
provider_id="synthetic-data-kit",
),
]
return DistributionTemplate(

View file

@ -115,6 +115,9 @@ providers:
- provider_id: rag-runtime
provider_type: inline::rag-runtime
config: {}
- provider_id: synthetic-data-kit
provider_type: inline::synthetic-data-kit
config: {}
- provider_id: model-context-protocol
provider_type: remote::model-context-protocol
config: {}
@ -159,5 +162,7 @@ tool_groups:
provider_id: rag-runtime
- toolgroup_id: builtin::wolfram_alpha
provider_id: wolfram-alpha
- toolgroup_id: builtin::document_conversion
provider_id: synthetic-data-kit
server:
port: 8321

View file

@ -113,6 +113,9 @@ providers:
- provider_id: rag-runtime
provider_type: inline::rag-runtime
config: {}
- provider_id: synthetic-data-kit
provider_type: inline::synthetic-data-kit
config: {}
- provider_id: model-context-protocol
provider_type: remote::model-context-protocol
config: {}
@ -149,5 +152,7 @@ tool_groups:
provider_id: rag-runtime
- toolgroup_id: builtin::wolfram_alpha
provider_id: wolfram-alpha
- toolgroup_id: builtin::document_conversion
provider_id: synthetic-data-kit
server:
port: 8321