Commit graph

2 commits

Author SHA1 Message Date
Ben Browning
1485f3bb4c Expand file types tested with file_search
This expands the file types tested with file_search to include Word
documents (.docx), Markdown (.md), text (.txt), PDF (.pdf), and
PowerPoint (.pptx) files.

Python's mimetypes library doesn't actually recognize markdown docs as
text, so we have to handle that case specifically instead of relying
on mimetypes to get it right.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-06-27 13:42:05 -04:00
Ben Browning
8bf1d91d38 feat: Add synthetic-data-kit for file_search doc conversion
This adds a `builtin::document_conversion` tool for converting
documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local
implementation that uses Docling, but need to debug some segfault
issues I'm hitting locally with that so pushing this first as a
simpler reference implementation.

Long-term I think we'll want a remote implemention here as well - like
perhaps docling-serve or unstructured.io - but need to look more into
that.

This passes the existing
`tests/verifications/openai_api/test_responses.py` but doesn't yet add
any new tests for file types besides text and pdf.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-06-27 13:31:38 -04:00