This expands the file types tested with file_search to include Word
documents (.docx), Markdown (.md), text (.txt), PDF (.pdf), and
PowerPoint (.pptx) files.
Python's mimetypes library doesn't actually recognize markdown docs as
text, so we have to handle that case specifically instead of relying
on mimetypes to get it right.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
This adds a `builtin::document_conversion` tool for converting
documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local
implementation that uses Docling, but need to debug some segfault
issues I'm hitting locally with that so pushing this first as a
simpler reference implementation.
Long-term I think we'll want a remote implemention here as well - like
perhaps docling-serve or unstructured.io - but need to look more into
that.
This passes the existing
`tests/verifications/openai_api/test_responses.py` but doesn't yet add
any new tests for file types besides text and pdf.
Signed-off-by: Ben Browning <bbrownin@redhat.com>