fix: Gracefully handle errors when listing MCP tools

When listing (and lazily indexing) tools, it's possible for an error to get thrown by individual toolgroups if for example an MCP toolgroup is unable to connect to its `mcp_endpoint`. This logs a warning in the server when that happens, logs a full stack trace of the error if debug logging is enabled, and just returns the list of tools from all working toolgroups instead of throwing an error to the client when a single toolgroup is temporarily or permanently misbehaving. Fixes #2540 A new unit test was added to test this exception handling, which is run as part of our regular test suite but also manually run to specifically verify this fix via: ``` uv run pytest -sv --asyncio-mode=auto \ tests/unit/distribution/routers/test_routing_tables.py ``` To verify the additional debug logging is printing properly: ``` LLAMA_STACK_LOGGING=core=debug \ uv run pytest -sv --asyncio-mode=auto \ tests/unit/distribution/routers/test_routing_tables.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-12-08 11:07:22 +00:00 · 2025-06-27 11:43:39 -04:00 · 2025-06-27 11:43:39 -04:00 · c658c5912c
commit c658c5912c
parent c88c4ff2c6
2 changed files with 32 additions and 3 deletions
--- a/llama_stack/core/routing_tables/toolgroups.py
+++ b/llama_stack/core/routing_tables/toolgroups.py
@ -53,9 +53,14 @@ class ToolGroupsRoutingTable(CommonRoutingTableImpl, ToolGroups):

        all_tools = []
        for toolgroup in toolgroups:
-            if toolgroup.identifier not in self.toolgroups_to_tools:
-                await self._index_tools(toolgroup)
-            all_tools.extend(self.toolgroups_to_tools[toolgroup.identifier])
+            try:
+                if toolgroup.identifier not in self.toolgroups_to_tools:
+                    await self._index_tools(toolgroup)
+                all_tools.extend(self.toolgroups_to_tools[toolgroup.identifier])
+            except Exception as e:
+                logger.warning(f"Error listing tools for toolgroup {toolgroup.identifier}: {e}")
+                logger.debug(e, exc_info=True)
+                continue

        return ListToolsResponse(data=all_tools)