From d95853d784491c9f5475ed8ed117e1dddcb3f7c2 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Mon, 29 Sep 2025 10:02:51 +0200
Subject: [PATCH 01/13] chore(ui-deps): bump shiki from 1.29.2 to 3.13.0 in
 /llama_stack/ui (#3585)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps [shiki](https://github.com/shikijs/shiki/tree/HEAD/packages/shiki)
from 1.29.2 to 3.13.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/shikijs/shiki/releases">shiki's
releases</a>.</em></p>
<blockquote>
<h2>v3.13.0</h2>
<h3>   🚀 Features</h3>
<ul>
<li><strong>transformers</strong>: Render indent guides  -  by <a
href="https://github.com/KazariEX"><code>@​KazariEX</code></a> and <a
href="https://github.com/antfu"><code>@​antfu</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1060">shikijs/shiki#1060</a>
<a href="https://github.com/shikijs/shiki/commit/aecd1617"><!-- raw HTML
omitted -->(aecd1)<!-- raw HTML omitted --></a></li>
</ul>
<h5>    <a
href="https://github.com/shikijs/shiki/compare/v3.12.3...v3.13.0">View
changes on GitHub</a></h5>
<h2>v3.12.3</h2>
<h3>   🐞 Bug Fixes</h3>
<ul>
<li><code>@shikijs/twoslash</code> version specifier  -  by <a
href="https://github.com/9romise"><code>@​9romise</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1078">shikijs/shiki#1078</a>
<a href="https://github.com/shikijs/shiki/commit/a1cdea41"><!-- raw HTML
omitted -->(a1cde)<!-- raw HTML omitted --></a></li>
</ul>
<h5>    <a
href="https://github.com/shikijs/shiki/compare/v3.12.2...v3.12.3">View
changes on GitHub</a></h5>
<h2>v3.12.2</h2>
<h3>   🐞 Bug Fixes</h3>
<ul>
<li><strong>twoslash</strong>: Fix <code>onTwoslashError</code> return
value handling  -  by <a
href="https://github.com/Karibash"><code>@​Karibash</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1070">shikijs/shiki#1070</a>
<a href="https://github.com/shikijs/shiki/commit/e86b0a7c"><!-- raw HTML
omitted -->(e86b0)<!-- raw HTML omitted --></a></li>
</ul>
<h5>    <a
href="https://github.com/shikijs/shiki/compare/v3.12.1...v3.12.2">View
changes on GitHub</a></h5>
<h2>v3.12.1</h2>
<p><em>No significant changes</em></p>
<h5>    <a
href="https://github.com/shikijs/shiki/compare/v3.12.0...v3.12.1">View
changes on GitHub</a></h5>
<h2>v3.12.0</h2>
<h3>   🚀 Features</h3>
<ul>
<li><strong>vitepress-twoslash</strong>:
<ul>
<li>Improve UX for option customization  -  by <a
href="https://github.com/9romise"><code>@​9romise</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1066">shikijs/shiki#1066</a>
<a href="https://github.com/shikijs/shiki/commit/e3cfdeca"><!-- raw HTML
omitted -->(e3cfd)<!-- raw HTML omitted --></a></li>
<li>Twoslash inline type cache for markdown  -  by <a
href="https://github.com/serkodev"><code>@​serkodev</code></a> and <a
href="https://github.com/antfu"><code>@​antfu</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1063">shikijs/shiki#1063</a>
<a href="https://github.com/shikijs/shiki/commit/dc7fbc70"><!-- raw HTML
omitted -->(dc7fb)<!-- raw HTML omitted --></a></li>
</ul>
</li>
</ul>
<h3>   🐞 Bug Fixes</h3>
<ul>
<li><strong>remove-notation-escape</strong>: Correct escape sequence  - 
by <a href="https://github.com/sor4chi"><code>@​sor4chi</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1065">shikijs/shiki#1065</a>
<a href="https://github.com/shikijs/shiki/commit/22d0c780"><!-- raw HTML
omitted -->(22d0c)<!-- raw HTML omitted --></a></li>
</ul>
<h5>    <a
href="https://github.com/shikijs/shiki/compare/v3.11.0...v3.12.0">View
changes on GitHub</a></h5>
<h2>v3.11.0</h2>
<h3>   🚀 Features</h3>
<ul>
<li><strong>core</strong>: Add <code>enforce</code> options to
<code>ShikiTransformer</code>  -  by <a
href="https://github.com/serkodev"><code>@​serkodev</code></a> and <a
href="https://github.com/antfu"><code>@​antfu</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1062">shikijs/shiki#1062</a>
<a href="https://github.com/shikijs/shiki/commit/8ad05bd8"><!-- raw HTML
omitted -->(8ad05)<!-- raw HTML omitted --></a></li>
</ul>
<h5>    <a
href="https://github.com/shikijs/shiki/compare/v3.10.0...v3.11.0">View
changes on GitHub</a></h5>
<h2>v3.10.0</h2>
<h3>   🚀 Features</h3>
<ul>
<li>Add funding links to playground  -  by <a
href="https://github.com/jtbandes"><code>@​jtbandes</code></a> in <a
href="https://redirect.github.com/shikijs/shiki/issues/1054">shikijs/shiki#1054</a>
<a href="https://github.com/shikijs/shiki/commit/e36eb4d8"><!-- raw HTML
omitted -->(e36eb)<!-- raw HTML omitted --></a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/shikijs/shiki/commit/fd7326a82f9495489acf0c999347cb45b2156644"><code>fd7326a</code></a>
chore: release v3.13.0</li>
<li><a
href="https://github.com/shikijs/shiki/commit/5cbb05219e548a4d407ce375fdaf7e05dd627b48"><code>5cbb052</code></a>
chore: release v3.12.3</li>
<li><a
href="https://github.com/shikijs/shiki/commit/e46261819005a7519c64b54679a504d783dfdcbc"><code>e462618</code></a>
chore: release v3.12.2</li>
<li><a
href="https://github.com/shikijs/shiki/commit/793d71e68f17e949ce5b11cbb4a148eb1cb2ecbd"><code>793d71e</code></a>
chore: release v3.12.1</li>
<li><a
href="https://github.com/shikijs/shiki/commit/9260f3fd109eca7bece80c92196f627ccae202d0"><code>9260f3f</code></a>
chore: release v3.12.0</li>
<li><a
href="https://github.com/shikijs/shiki/commit/d05f39b1e89ccd9c90380ab9870c7085da62db21"><code>d05f39b</code></a>
chore: release v3.11.0</li>
<li><a
href="https://github.com/shikijs/shiki/commit/bda1a767432552f15a226a74f06a0de182d674ab"><code>bda1a76</code></a>
chore: release v3.10.0</li>
<li><a
href="https://github.com/shikijs/shiki/commit/09921f1cb8559e5099eb9c5fe7bf9500be11afd1"><code>09921f1</code></a>
chore: release v3.9.2</li>
<li><a
href="https://github.com/shikijs/shiki/commit/854eddf2eda1ef7831d5dccb9e04dbb884b8c89c"><code>854eddf</code></a>
chore: release v3.9.1</li>
<li><a
href="https://github.com/shikijs/shiki/commit/950ede5ae5747a879c175fa4d9c0ed856901d520"><code>950ede5</code></a>
chore: release v3.9.0</li>
<li>Additional commits viewable in <a
href="https://github.com/shikijs/shiki/commits/v3.13.0/packages/shiki">compare
view</a></li>
</ul>
</details>
<details>
<summary>Maintainer changes</summary>
<p>This version was pushed to npm by [GitHub Actions](<a
href="https://www.npmjs.com/~GitHub">https://www.npmjs.com/~GitHub</a>
Actions), a new releaser for shiki since your current version.</p>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=shiki&package-manager=npm_and_yarn&previous-version=1.29.2&new-version=3.13.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---
 llama_stack/ui/package-lock.json | 119 +++++++++++++++----------------
 llama_stack/ui/package.json      |   2 +-
 2 files changed, 59 insertions(+), 62 deletions(-)
diff --git a/llama_stack/ui/package-lock.json b/llama_stack/ui/package-lock.json
index 21dec59c3..979bc0781 100644
--- a/llama_stack/ui/package-lock.json
+++ b/llama_stack/ui/package-lock.json
@@ -28,7 +28,7 @@
         "react-markdown": "^10.1.0",
         "remark-gfm": "^4.0.1",
         "remeda": "^2.32.0",
-        "shiki": "^1.29.2",
+        "shiki": "^3.13.0",
         "sonner": "^2.0.7",
         "tailwind-merge": "^3.3.1"
       },
@@ -3250,65 +3250,63 @@
       "license": "MIT"
     },
     "node_modules/@shikijs/core": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/@shikijs/core/-/core-1.29.2.tgz",
-      "integrity": "sha512-vju0lY9r27jJfOY4Z7+Rt/nIOjzJpZ3y+nYpqtUZInVoXQ/TJZcfGnNOGnKjFdVZb8qexiCuSlZRKcGfhhTTZQ==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/@shikijs/core/-/core-3.13.0.tgz",
+      "integrity": "sha512-3P8rGsg2Eh2qIHekwuQjzWhKI4jV97PhvYjYUzGqjvJfqdQPz+nMlfWahU24GZAyW1FxFI1sYjyhfh5CoLmIUA==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/engine-javascript": "1.29.2",
-        "@shikijs/engine-oniguruma": "1.29.2",
-        "@shikijs/types": "1.29.2",
-        "@shikijs/vscode-textmate": "^10.0.1",
+        "@shikijs/types": "3.13.0",
+        "@shikijs/vscode-textmate": "^10.0.2",
         "@types/hast": "^3.0.4",
-        "hast-util-to-html": "^9.0.4"
+        "hast-util-to-html": "^9.0.5"
       }
     },
     "node_modules/@shikijs/engine-javascript": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/@shikijs/engine-javascript/-/engine-javascript-1.29.2.tgz",
-      "integrity": "sha512-iNEZv4IrLYPv64Q6k7EPpOCE/nuvGiKl7zxdq0WFuRPF5PAE9PRo2JGq/d8crLusM59BRemJ4eOqrFrC4wiQ+A==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/@shikijs/engine-javascript/-/engine-javascript-3.13.0.tgz",
+      "integrity": "sha512-Ty7xv32XCp8u0eQt8rItpMs6rU9Ki6LJ1dQOW3V/56PKDcpvfHPnYFbsx5FFUP2Yim34m/UkazidamMNVR4vKg==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/types": "1.29.2",
-        "@shikijs/vscode-textmate": "^10.0.1",
-        "oniguruma-to-es": "^2.2.0"
+        "@shikijs/types": "3.13.0",
+        "@shikijs/vscode-textmate": "^10.0.2",
+        "oniguruma-to-es": "^4.3.3"
       }
     },
     "node_modules/@shikijs/engine-oniguruma": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/@shikijs/engine-oniguruma/-/engine-oniguruma-1.29.2.tgz",
-      "integrity": "sha512-7iiOx3SG8+g1MnlzZVDYiaeHe7Ez2Kf2HrJzdmGwkRisT7r4rak0e655AcM/tF9JG/kg5fMNYlLLKglbN7gBqA==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/@shikijs/engine-oniguruma/-/engine-oniguruma-3.13.0.tgz",
+      "integrity": "sha512-O42rBGr4UDSlhT2ZFMxqM7QzIU+IcpoTMzb3W7AlziI1ZF7R8eS2M0yt5Ry35nnnTX/LTLXFPUjRFCIW+Operg==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/types": "1.29.2",
-        "@shikijs/vscode-textmate": "^10.0.1"
+        "@shikijs/types": "3.13.0",
+        "@shikijs/vscode-textmate": "^10.0.2"
       }
     },
     "node_modules/@shikijs/langs": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/@shikijs/langs/-/langs-1.29.2.tgz",
-      "integrity": "sha512-FIBA7N3LZ+223U7cJDUYd5shmciFQlYkFXlkKVaHsCPgfVLiO+e12FmQE6Tf9vuyEsFe3dIl8qGWKXgEHL9wmQ==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/@shikijs/langs/-/langs-3.13.0.tgz",
+      "integrity": "sha512-672c3WAETDYHwrRP0yLy3W1QYB89Hbpj+pO4KhxK6FzIrDI2FoEXNiNCut6BQmEApYLfuYfpgOZaqbY+E9b8wQ==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/types": "1.29.2"
+        "@shikijs/types": "3.13.0"
       }
     },
     "node_modules/@shikijs/themes": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/@shikijs/themes/-/themes-1.29.2.tgz",
-      "integrity": "sha512-i9TNZlsq4uoyqSbluIcZkmPL9Bfi3djVxRnofUHwvx/h6SRW3cwgBC5SML7vsDcWyukY0eCzVN980rqP6qNl9g==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/@shikijs/themes/-/themes-3.13.0.tgz",
+      "integrity": "sha512-Vxw1Nm1/Od8jyA7QuAenaV78BG2nSr3/gCGdBkLpfLscddCkzkL36Q5b67SrLLfvAJTOUzW39x4FHVCFriPVgg==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/types": "1.29.2"
+        "@shikijs/types": "3.13.0"
       }
     },
     "node_modules/@shikijs/types": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/@shikijs/types/-/types-1.29.2.tgz",
-      "integrity": "sha512-VJjK0eIijTZf0QSTODEXCqinjBn0joAHQ+aPSBzrv4O2d/QSbsMw+ZeSRx03kV34Hy7NzUvV/7NqfYGRLrASmw==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/@shikijs/types/-/types-3.13.0.tgz",
+      "integrity": "sha512-oM9P+NCFri/mmQ8LoFGVfVyemm5Hi27330zuOBp0annwJdKH1kOLndw3zCtAVDehPLg9fKqoEx3Ht/wNZxolfw==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/vscode-textmate": "^10.0.1",
+        "@shikijs/vscode-textmate": "^10.0.2",
         "@types/hast": "^3.0.4"
       }
     },
@@ -6084,12 +6082,6 @@
       "dev": true,
       "license": "MIT"
     },
-    "node_modules/emoji-regex-xs": {
-      "version": "1.0.0",
-      "resolved": "https://registry.npmjs.org/emoji-regex-xs/-/emoji-regex-xs-1.0.0.tgz",
-      "integrity": "sha512-LRlerrMYoIDrT6jgpeZ2YYl/L8EulRTt5hQcYjy5AInh7HWXKimpqx68aknBFpGL2+/IcogTcaydJEgaTmOpDg==",
-      "license": "MIT"
-    },
     "node_modules/encodeurl": {
       "version": "2.0.0",
       "resolved": "https://registry.npmjs.org/encodeurl/-/encodeurl-2.0.0.tgz",
@@ -11813,15 +11805,21 @@
         "url": "https://github.com/sponsors/sindresorhus"
       }
     },
+    "node_modules/oniguruma-parser": {
+      "version": "0.12.1",
+      "resolved": "https://registry.npmjs.org/oniguruma-parser/-/oniguruma-parser-0.12.1.tgz",
+      "integrity": "sha512-8Unqkvk1RYc6yq2WBYRj4hdnsAxVze8i7iPfQr8e4uSP3tRv0rpZcbGUDvxfQQcdwHt/e9PrMvGCsa8OqG9X3w==",
+      "license": "MIT"
+    },
     "node_modules/oniguruma-to-es": {
-      "version": "2.3.0",
-      "resolved": "https://registry.npmjs.org/oniguruma-to-es/-/oniguruma-to-es-2.3.0.tgz",
-      "integrity": "sha512-bwALDxriqfKGfUufKGGepCzu9x7nJQuoRoAFp4AnwehhC2crqrDIAP/uN2qdlsAvSMpeRC3+Yzhqc7hLmle5+g==",
+      "version": "4.3.3",
+      "resolved": "https://registry.npmjs.org/oniguruma-to-es/-/oniguruma-to-es-4.3.3.tgz",
+      "integrity": "sha512-rPiZhzC3wXwE59YQMRDodUwwT9FZ9nNBwQQfsd1wfdtlKEyCdRV0avrTcSZ5xlIvGRVPd/cx6ZN45ECmS39xvg==",
       "license": "MIT",
       "dependencies": {
-        "emoji-regex-xs": "^1.0.0",
-        "regex": "^5.1.1",
-        "regex-recursion": "^5.1.1"
+        "oniguruma-parser": "^0.12.1",
+        "regex": "^6.0.1",
+        "regex-recursion": "^6.0.2"
       }
     },
     "node_modules/openid-client": {
@@ -12613,21 +12611,20 @@
       }
     },
     "node_modules/regex": {
-      "version": "5.1.1",
-      "resolved": "https://registry.npmjs.org/regex/-/regex-5.1.1.tgz",
-      "integrity": "sha512-dN5I359AVGPnwzJm2jN1k0W9LPZ+ePvoOeVMMfqIMFz53sSwXkxaJoxr50ptnsC771lK95BnTrVSZxq0b9yCGw==",
+      "version": "6.0.1",
+      "resolved": "https://registry.npmjs.org/regex/-/regex-6.0.1.tgz",
+      "integrity": "sha512-uorlqlzAKjKQZ5P+kTJr3eeJGSVroLKoHmquUj4zHWuR+hEyNqlXsSKlYYF5F4NI6nl7tWCs0apKJ0lmfsXAPA==",
       "license": "MIT",
       "dependencies": {
         "regex-utilities": "^2.3.0"
       }
     },
     "node_modules/regex-recursion": {
-      "version": "5.1.1",
-      "resolved": "https://registry.npmjs.org/regex-recursion/-/regex-recursion-5.1.1.tgz",
-      "integrity": "sha512-ae7SBCbzVNrIjgSbh7wMznPcQel1DNlDtzensnFxpiNpXt1U2ju/bHugH422r+4LAVS1FpW1YCwilmnNsjum9w==",
+      "version": "6.0.2",
+      "resolved": "https://registry.npmjs.org/regex-recursion/-/regex-recursion-6.0.2.tgz",
+      "integrity": "sha512-0YCaSCq2VRIebiaUviZNs0cBz1kg5kVS2UKUfNIx8YVs1cN3AV7NTctO5FOKBA+UT2BPJIWZauYHPqJODG50cg==",
       "license": "MIT",
       "dependencies": {
-        "regex": "^5.1.1",
         "regex-utilities": "^2.3.0"
       }
     },
@@ -13165,18 +13162,18 @@
       }
     },
     "node_modules/shiki": {
-      "version": "1.29.2",
-      "resolved": "https://registry.npmjs.org/shiki/-/shiki-1.29.2.tgz",
-      "integrity": "sha512-njXuliz/cP+67jU2hukkxCNuH1yUi4QfdZZY+sMr5PPrIyXSu5iTb/qYC4BiWWB0vZ+7TbdvYUCeL23zpwCfbg==",
+      "version": "3.13.0",
+      "resolved": "https://registry.npmjs.org/shiki/-/shiki-3.13.0.tgz",
+      "integrity": "sha512-aZW4l8Og16CokuCLf8CF8kq+KK2yOygapU5m3+hoGw0Mdosc6fPitjM+ujYarppj5ZIKGyPDPP1vqmQhr+5/0g==",
       "license": "MIT",
       "dependencies": {
-        "@shikijs/core": "1.29.2",
-        "@shikijs/engine-javascript": "1.29.2",
-        "@shikijs/engine-oniguruma": "1.29.2",
-        "@shikijs/langs": "1.29.2",
-        "@shikijs/themes": "1.29.2",
-        "@shikijs/types": "1.29.2",
-        "@shikijs/vscode-textmate": "^10.0.1",
+        "@shikijs/core": "3.13.0",
+        "@shikijs/engine-javascript": "3.13.0",
+        "@shikijs/engine-oniguruma": "3.13.0",
+        "@shikijs/langs": "3.13.0",
+        "@shikijs/themes": "3.13.0",
+        "@shikijs/types": "3.13.0",
+        "@shikijs/vscode-textmate": "^10.0.2",
         "@types/hast": "^3.0.4"
       }
     },
diff --git a/llama_stack/ui/package.json b/llama_stack/ui/package.json
index 70462b534..adfccc072 100644
--- a/llama_stack/ui/package.json
+++ b/llama_stack/ui/package.json
@@ -33,7 +33,7 @@
     "react-markdown": "^10.1.0",
     "remark-gfm": "^4.0.1",
     "remeda": "^2.32.0",
-    "shiki": "^1.29.2",
+    "shiki": "^3.13.0",
     "sonner": "^2.0.7",
     "tailwind-merge": "^3.3.1"
   },

From 9fdfd3a2ad7ec5626fb6ee2ae00d06f9d9106a45 Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Mon, 29 Sep 2025 10:03:26 +0200
Subject: [PATCH 02/13] chore(ui-deps): bump tw-animate-css from 1.2.9 to 1.4.0
 in /llama_stack/ui (#3583)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps [tw-animate-css](https://github.com/Wombosvideo/tw-animate-css)
from 1.2.9 to 1.4.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/Wombosvideo/tw-animate-css/releases">tw-animate-css's
releases</a>.</em></p>
<blockquote>
<h2>v1.4.0</h2>
<h2>Changelog</h2>
<p>902e37a019ffd165ba078e0b3c02634526c54bf0: fix: remove support for
prefix, add new export for prefixed version. Closes <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/58">#58</a>.
fab2a5bf817605be1976e159976718a83489fc1c: chore: bump version to 1.4.0
and update dependencies
c20dc32e2b532a8e74546879b4ce7d9ce89ba710: fix(build): make transform.ts
accept two arguments</p>
<h2>⚠️ BREAKING CHANGE ⚠️</h2>
<p>Support for Tailwind CSS's prefix option was moved to
<code>tw-animate-css/prefix</code> because it was breaking the
<code>--spacing</code> function. Users requiring prefixes should replace
their import:</p>
<pre lang="diff"><code>- import &quot;tw-animate-css&quot;;
+ import &quot;tw-animate-css/prefix&quot;;
</code></pre>
<p><em>I do not plan to introduce breaking changes like this to
non-major releases in the future. But because more people use spacing
rather than prefixes, reverting the previous version's (obviously
breaking) change seems reasonable.</em></p>
<h2>v1.3.8</h2>
<h2>Changelog</h2>
<ul>
<li>b5ff23a: fix: add support for global CSS variable prefix. Closes <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/48">#48</a></li>
<li>03e5f12: feat: add support for ng-primitives height variables <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/56">#56</a>
(thanks <a
href="https://github.com/immohammadjaved"><code>@​immohammadjaved</code></a>)</li>
<li>b076cfb: docs: fix various issues in accordion and collapsible
docs</li>
<li>9485e33: chore: bump version to 1.3.8 and update dependencies</li>
</ul>
<h2>⚠️ BREAKING CHANGE ⚠️</h2>
<p>Adding support for prefixes broke custom spacing. It is recommended
that you skip this version if you do not use Tailwind CSS's prefix
option, and use v1.4.0 instead. If you are actually using prefixes, you
can use a special version supporting prefixes:</p>
<pre lang="diff"><code>- import &quot;tw-animate-css&quot;; /* Version
with spacing support */
+ import &quot;tw-animate-css/prefix&quot;; /* Version with prefix
support */
</code></pre>
<p><em>I do not plan to fix the incompatibility between the spacing and
prefix versions due to time constraints. Feel free to investigate and
open a pull request if you manage to fix it.</em></p>
<h2>v1.3.7</h2>
<h2>Changelog</h2>
<ul>
<li>80dbfcc: feat: add utilities for blur transitions <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/54">#54</a>
(thanks <a
href="https://github.com/coffeeispower"><code>@​coffeeispower</code></a>)</li>
<li>dc294f9: docs: add upcoming changes warning</li>
<li>c640bb8: chore: update dependencies and package manager version</li>
<li>9e63e34: chore: bump version to 1.3.7</li>
</ul>
<h2>v1.3.6</h2>
<h2>Changelog</h2>
<ul>
<li>58f3396: fix: allow changing animation parameters for ready-to-use
animations</li>
<li>8313476: chore: update dependencies nd package manager version</li>
<li>f81346c: chore: bump version to 1.3.6</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/c20dc32e2b532a8e74546879b4ce7d9ce89ba710"><code>c20dc32</code></a>
fix(build): make transform.ts accept two arguments</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/fab2a5bf817605be1976e159976718a83489fc1c"><code>fab2a5b</code></a>
chore: bump version to 1.4.0 and update dependencies</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/902e37a019ffd165ba078e0b3c02634526c54bf0"><code>902e37a</code></a>
fix: remove support for prefix, add new export for prefixed version</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/9485e33d99ab770262bcecff90eb8d55b8fdbb65"><code>9485e33</code></a>
chore: bump version to 1.3.8 and update dependencies</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/b076cfb04a24812b93788a2b7422375a0e065d37"><code>b076cfb</code></a>
docs: fix various issues in accordion and collapsible docs</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/03e5f12418e8877f4b23a107dd50dc470b7fb2ec"><code>03e5f12</code></a>
feat: add support for ng-primitives height variables (<a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/56">#56</a>)</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/b5ff23a0d5770e066138d90b0923dd180835bd59"><code>b5ff23a</code></a>
fix: add support for global CSS variable prefix. Closes <a
href="https://redirect.github.com/Wombosvideo/tw-animate-css/issues/48">#48</a></li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/9e63e342864c263a746a57b5f3aa04f9eff32548"><code>9e63e34</code></a>
chore: bump version to 1.3.7</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/c640bb893362abdd73077571543a9c3c8ff4760b"><code>c640bb8</code></a>
chore: update dependencies and package manager version</li>
<li><a
href="https://github.com/Wombosvideo/tw-animate-css/commit/dc294f990add7e7519088bb535521a52e96bdd71"><code>dc294f9</code></a>
docs: add upcoming changes warning</li>
<li>Additional commits viewable in <a
href="https://github.com/Wombosvideo/tw-animate-css/compare/v1.2.9...v1.4.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tw-animate-css&package-manager=npm_and_yarn&previous-version=1.2.9&new-version=1.4.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---
 llama_stack/ui/package-lock.json | 8 ++++----
 llama_stack/ui/package.json      | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/llama_stack/ui/package-lock.json b/llama_stack/ui/package-lock.json
index 979bc0781..1d2bd0de7 100644
--- a/llama_stack/ui/package-lock.json
+++ b/llama_stack/ui/package-lock.json
@@ -51,7 +51,7 @@
         "prettier": "3.6.2",
         "tailwindcss": "^4",
         "ts-node": "^10.9.2",
-        "tw-animate-css": "^1.2.9",
+        "tw-animate-css": "^1.4.0",
         "typescript": "^5"
       }
     },
@@ -13967,9 +13967,9 @@
       "license": "0BSD"
     },
     "node_modules/tw-animate-css": {
-      "version": "1.2.9",
-      "resolved": "https://registry.npmjs.org/tw-animate-css/-/tw-animate-css-1.2.9.tgz",
-      "integrity": "sha512-9O4k1at9pMQff9EAcCEuy1UNO43JmaPQvq+0lwza9Y0BQ6LB38NiMj+qHqjoQf40355MX+gs6wtlR6H9WsSXFg==",
+      "version": "1.4.0",
+      "resolved": "https://registry.npmjs.org/tw-animate-css/-/tw-animate-css-1.4.0.tgz",
+      "integrity": "sha512-7bziOlRqH0hJx80h/3mbicLW7o8qLsH5+RaLR2t+OHM3D0JlWGODQKQ4cxbK7WlvmUxpcj6Kgu6EKqjrGFe3QQ==",
       "dev": true,
       "license": "MIT",
       "funding": {
diff --git a/llama_stack/ui/package.json b/llama_stack/ui/package.json
index adfccc072..1e01c347c 100644
--- a/llama_stack/ui/package.json
+++ b/llama_stack/ui/package.json
@@ -56,7 +56,7 @@
     "prettier": "3.6.2",
     "tailwindcss": "^4",
     "ts-node": "^10.9.2",
-    "tw-animate-css": "^1.2.9",
+    "tw-animate-css": "^1.4.0",
     "typescript": "^5"
   }
 }

From 90bb9cfb0a32ffc0659a88e2b43df419aa5147ac Mon Sep 17 00:00:00 2001
From: "dependabot[bot]" <49699333+dependabot[bot]@users.noreply.github.com>
Date: Mon, 29 Sep 2025 10:04:58 +0200
Subject: [PATCH 03/13] chore(github-deps): bump actions/cache from 4.2.4 to
 4.3.0 (#3577)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Bumps [actions/cache](https://github.com/actions/cache) from 4.2.4 to
4.3.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/actions/cache/releases">actions/cache's
releases</a>.</em></p>
<blockquote>
<h2>v4.3.0</h2>
<h2>What's Changed</h2>
<ul>
<li>Add note on runner versions by <a
href="https://github.com/GhadimiR"><code>@​GhadimiR</code></a> in <a
href="https://redirect.github.com/actions/cache/pull/1642">actions/cache#1642</a></li>
<li>Prepare <code>v4.3.0</code> release by <a
href="https://github.com/Link"><code>@​Link</code></a>- in <a
href="https://redirect.github.com/actions/cache/pull/1655">actions/cache#1655</a></li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/GhadimiR"><code>@​GhadimiR</code></a>
made their first contribution in <a
href="https://redirect.github.com/actions/cache/pull/1642">actions/cache#1642</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/actions/cache/compare/v4...v4.3.0">https://github.com/actions/cache/compare/v4...v4.3.0</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/actions/cache/blob/main/RELEASES.md">actions/cache's
changelog</a>.</em></p>
<blockquote>
<h1>Releases</h1>
<h3>4.3.0</h3>
<ul>
<li>Bump <code>@actions/cache</code> to <a
href="https://redirect.github.com/actions/toolkit/pull/2132">v4.1.0</a></li>
</ul>
<h3>4.2.4</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.5</li>
</ul>
<h3>4.2.3</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.3 (obfuscates SAS token in
debug logs for cache entries)</li>
</ul>
<h3>4.2.2</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.2</li>
</ul>
<h3>4.2.1</h3>
<ul>
<li>Bump <code>@actions/cache</code> to v4.0.1</li>
</ul>
<h3>4.2.0</h3>
<p>TLDR; The cache backend service has been rewritten from the ground up
for improved performance and reliability. <a
href="https://github.com/actions/cache">actions/cache</a> now integrates
with the new cache service (v2) APIs.</p>
<p>The new service will gradually roll out as of <strong>February 1st,
2025</strong>. The legacy service will also be sunset on the same date.
Changes in these release are <strong>fully backward
compatible</strong>.</p>
<p><strong>We are deprecating some versions of this action</strong>. We
recommend upgrading to version <code>v4</code> or <code>v3</code> as
soon as possible before <strong>February 1st, 2025.</strong> (Upgrade
instructions below).</p>
<p>If you are using pinned SHAs, please use the SHAs of versions
<code>v4.2.0</code> or <code>v3.4.0</code></p>
<p>If you do not upgrade, all workflow runs using any of the deprecated
<a href="https://github.com/actions/cache">actions/cache</a> will
fail.</p>
<p>Upgrading to the recommended versions will not break your
workflows.</p>
<h3>4.1.2</h3>
<ul>
<li>Add GitHub Enterprise Cloud instances hostname filters to inform API
endpoint choices - <a
href="https://redirect.github.com/actions/cache/pull/1474">#1474</a></li>
<li>Security fix: Bump braces from 3.0.2 to 3.0.3 - <a
href="https://redirect.github.com/actions/cache/pull/1475">#1475</a></li>
</ul>
<h3>4.1.1</h3>
<ul>
<li>Restore original behavior of <code>cache-hit</code> output - <a
href="https://redirect.github.com/actions/cache/pull/1467">#1467</a></li>
</ul>
<h3>4.1.0</h3>
<ul>
<li>Ensure <code>cache-hit</code> output is set when a cache is missed -
<a
href="https://redirect.github.com/actions/cache/pull/1404">#1404</a></li>
<li>Deprecate <code>save-always</code> input - <a
href="https://redirect.github.com/actions/cache/pull/1452">#1452</a></li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/actions/cache/commit/0057852bfaa89a56745cba8c7296529d2fc39830"><code>0057852</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1655">#1655</a>
from actions/Link-/prepare-4.3.0</li>
<li><a
href="https://github.com/actions/cache/commit/4f5ea67f1cc87b2d4239690fa12a12fc32096d68"><code>4f5ea67</code></a>
Update licensed cache</li>
<li><a
href="https://github.com/actions/cache/commit/9fcad95d03062fb8399cdbd79ae6041c7692b6c8"><code>9fcad95</code></a>
Upgrade actions/cache to 4.1.0 and prepare 4.3.0 release</li>
<li><a
href="https://github.com/actions/cache/commit/638ed79f9dc94c1de1baef91bcab5edaa19451f4"><code>638ed79</code></a>
Merge pull request <a
href="https://redirect.github.com/actions/cache/issues/1642">#1642</a>
from actions/GhadimiR-patch-1</li>
<li><a
href="https://github.com/actions/cache/commit/3862dccb1765f1ff6e623be1f4fd3a5b47a30d27"><code>3862dcc</code></a>
Add note on runner versions</li>
<li>See full diff in <a
href="https://github.com/actions/cache/compare/0400d5f644dc74513175e3cd8d07132dd4860809...0057852bfaa89a56745cba8c7296529d2fc39830">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=actions/cache&package-manager=github_actions&previous-version=4.2.4&new-version=4.3.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
---
 .github/workflows/conformance.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/conformance.yml b/.github/workflows/conformance.yml
index 9383476f5..b19b77cce 100644
--- a/.github/workflows/conformance.yml
+++ b/.github/workflows/conformance.yml
@@ -43,7 +43,7 @@ jobs:
       # Cache oasdiff to avoid checksum failures and speed up builds
       - name: Cache oasdiff
         id: cache-oasdiff
-        uses: actions/cache@0400d5f644dc74513175e3cd8d07132dd4860809
+        uses: actions/cache@0057852bfaa89a56745cba8c7296529d2fc39830
         with:
           path: ~/oasdiff
           key: oasdiff-${{ runner.os }}

From 7c466a7ec5b5e3180db475e38b9f0aff5c7f3433 Mon Sep 17 00:00:00 2001
From: Matthew Farrellee <matt@cs.wisc.edu>
Date: Mon, 29 Sep 2025 05:15:58 -0400
Subject: [PATCH 04/13] chore: skip nvidia datastore tests when nvidia
 datastore is not enabled (#3590)

# What does this PR do?

the nvidia datastore tests were running when the datastore was not
configured. they would always fail.

this introduces a skip when the nvidia datastore is not configured.


## Test Plan

ci
---
 tests/integration/providers/nvidia/test_datastore.py | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/tests/integration/providers/nvidia/test_datastore.py b/tests/integration/providers/nvidia/test_datastore.py
index 5cddd7781..ebae02bdd 100644
--- a/tests/integration/providers/nvidia/test_datastore.py
+++ b/tests/integration/providers/nvidia/test_datastore.py
@@ -14,6 +14,13 @@ from . import skip_in_github_actions
 # LLAMA_STACK_CONFIG="nvidia" pytest -v tests/integration/providers/nvidia/test_datastore.py
 
 
+@pytest.fixture(autouse=True)
+def skip_if_no_nvidia_provider(llama_stack_client):
+    provider_types = {p.provider_type for p in llama_stack_client.providers.list() if p.api == "datasetio"}
+    if "remote::nvidia" not in provider_types:
+        pytest.skip("datasetio=remote::nvidia provider not configured, skipping")
+
+
 # nvidia provider only
 @skip_in_github_actions
 @pytest.mark.parametrize(

From 8ab6684a9483bee2f669e8682f3358f5a1f6a3b2 Mon Sep 17 00:00:00 2001
From: ehhuang <ehhuang@users.noreply.github.com>
Date: Mon, 29 Sep 2025 10:36:16 -0700
Subject: [PATCH 05/13] chore: introduce write queue for response_store (#3497)

# What does this PR do?
Mirroring the same changes that was used for inference_store:
https://github.com/llamastack/llama-stack/pull/3383

Will follow up with a shared internal API for managing these write
queues.

## Test Plan
existing tests
---
 llama_stack/core/datatypes.py                 |   6 +
 .../utils/responses/responses_store.py        | 110 +++++++++++++++++-
 .../meta_reference/test_openai_responses.py   |   6 +-
 .../utils/responses/test_responses_store.py   |  21 ++++
 4 files changed, 136 insertions(+), 7 deletions(-)

diff --git a/llama_stack/core/datatypes.py b/llama_stack/core/datatypes.py
index b5558c66f..6a297f012 100644
--- a/llama_stack/core/datatypes.py
+++ b/llama_stack/core/datatypes.py
@@ -433,6 +433,12 @@ class InferenceStoreConfig(BaseModel):
     num_writers: int = Field(default=4, description="Number of concurrent background writers")
 
 
+class ResponsesStoreConfig(BaseModel):
+    sql_store_config: SqlStoreConfig
+    max_write_queue_size: int = Field(default=10000, description="Max queued writes for responses store")
+    num_writers: int = Field(default=4, description="Number of concurrent background writers")
+
+
 class StackRunConfig(BaseModel):
     version: int = LLAMA_STACK_RUN_CONFIG_VERSION
 
diff --git a/llama_stack/providers/utils/responses/responses_store.py b/llama_stack/providers/utils/responses/responses_store.py
index 829cd8a62..b9fceb1ab 100644
--- a/llama_stack/providers/utils/responses/responses_store.py
+++ b/llama_stack/providers/utils/responses/responses_store.py
@@ -3,6 +3,9 @@
 #
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.
+import asyncio
+from typing import Any
+
 from llama_stack.apis.agents import (
     Order,
 )
@@ -14,24 +17,51 @@ from llama_stack.apis.agents.openai_responses import (
     OpenAIResponseObject,
     OpenAIResponseObjectWithInput,
 )
-from llama_stack.core.datatypes import AccessRule
+from llama_stack.core.datatypes import AccessRule, ResponsesStoreConfig
 from llama_stack.core.utils.config_dirs import RUNTIME_BASE_DIR
+from llama_stack.log import get_logger
 
 from ..sqlstore.api import ColumnDefinition, ColumnType
 from ..sqlstore.authorized_sqlstore import AuthorizedSqlStore
-from ..sqlstore.sqlstore import SqliteSqlStoreConfig, SqlStoreConfig, sqlstore_impl
+from ..sqlstore.sqlstore import SqliteSqlStoreConfig, SqlStoreConfig, SqlStoreType, sqlstore_impl
+
+logger = get_logger(name=__name__, category="responses_store")
 
 
 class ResponsesStore:
-    def __init__(self, sql_store_config: SqlStoreConfig, policy: list[AccessRule]):
-        if not sql_store_config:
-            sql_store_config = SqliteSqlStoreConfig(
+    def __init__(
+        self,
+        config: ResponsesStoreConfig | SqlStoreConfig,
+        policy: list[AccessRule],
+    ):
+        # Handle backward compatibility
+        if not isinstance(config, ResponsesStoreConfig):
+            # Legacy: SqlStoreConfig passed directly as config
+            config = ResponsesStoreConfig(
+                sql_store_config=config,
+            )
+
+        self.config = config
+        self.sql_store_config = config.sql_store_config
+        if not self.sql_store_config:
+            self.sql_store_config = SqliteSqlStoreConfig(
                 db_path=(RUNTIME_BASE_DIR / "sqlstore.db").as_posix(),
             )
-        self.sql_store = AuthorizedSqlStore(sqlstore_impl(sql_store_config), policy)
+        self.sql_store = None
+        self.policy = policy
+
+        # Disable write queue for SQLite to avoid concurrency issues
+        self.enable_write_queue = self.sql_store_config.type != SqlStoreType.sqlite
+
+        # Async write queue and worker control
+        self._queue: asyncio.Queue[tuple[OpenAIResponseObject, list[OpenAIResponseInput]]] | None = None
+        self._worker_tasks: list[asyncio.Task[Any]] = []
+        self._max_write_queue_size: int = config.max_write_queue_size
+        self._num_writers: int = max(1, config.num_writers)
 
     async def initialize(self):
         """Create the necessary tables if they don't exist."""
+        self.sql_store = AuthorizedSqlStore(sqlstore_impl(self.sql_store_config), self.policy)
         await self.sql_store.create_table(
             "openai_responses",
             {
@@ -42,9 +72,68 @@ class ResponsesStore:
             },
         )
 
+        if self.enable_write_queue:
+            self._queue = asyncio.Queue(maxsize=self._max_write_queue_size)
+            for _ in range(self._num_writers):
+                self._worker_tasks.append(asyncio.create_task(self._worker_loop()))
+        else:
+            logger.info("Write queue disabled for SQLite to avoid concurrency issues")
+
+    async def shutdown(self) -> None:
+        if not self._worker_tasks:
+            return
+        if self._queue is not None:
+            await self._queue.join()
+        for t in self._worker_tasks:
+            if not t.done():
+                t.cancel()
+        for t in self._worker_tasks:
+            try:
+                await t
+            except asyncio.CancelledError:
+                pass
+        self._worker_tasks.clear()
+
+    async def flush(self) -> None:
+        """Wait for all queued writes to complete. Useful for testing."""
+        if self.enable_write_queue and self._queue is not None:
+            await self._queue.join()
+
     async def store_response_object(
         self, response_object: OpenAIResponseObject, input: list[OpenAIResponseInput]
     ) -> None:
+        if self.enable_write_queue:
+            if self._queue is None:
+                raise ValueError("Responses store is not initialized")
+            try:
+                self._queue.put_nowait((response_object, input))
+            except asyncio.QueueFull:
+                logger.warning(f"Write queue full; adding response id={getattr(response_object, 'id', '<unknown>')}")
+                await self._queue.put((response_object, input))
+        else:
+            await self._write_response_object(response_object, input)
+
+    async def _worker_loop(self) -> None:
+        assert self._queue is not None
+        while True:
+            try:
+                item = await self._queue.get()
+            except asyncio.CancelledError:
+                break
+            response_object, input = item
+            try:
+                await self._write_response_object(response_object, input)
+            except Exception as e:  # noqa: BLE001
+                logger.error(f"Error writing response object: {e}")
+            finally:
+                self._queue.task_done()
+
+    async def _write_response_object(
+        self, response_object: OpenAIResponseObject, input: list[OpenAIResponseInput]
+    ) -> None:
+        if self.sql_store is None:
+            raise ValueError("Responses store is not initialized")
+
         data = response_object.model_dump()
         data["input"] = [input_item.model_dump() for input_item in input]
 
@@ -73,6 +162,9 @@ class ResponsesStore:
         :param model: The model to filter by.
         :param order: The order to sort the responses by.
         """
+        if not self.sql_store:
+            raise ValueError("Responses store is not initialized")
+
         if not order:
             order = Order.desc
 
@@ -100,6 +192,9 @@ class ResponsesStore:
         """
         Get a response object with automatic access control checking.
         """
+        if not self.sql_store:
+            raise ValueError("Responses store is not initialized")
+
         row = await self.sql_store.fetch_one(
             "openai_responses",
             where={"id": response_id},
@@ -113,6 +208,9 @@ class ResponsesStore:
         return OpenAIResponseObjectWithInput(**row["response_object"])
 
     async def delete_response_object(self, response_id: str) -> OpenAIDeleteResponseObject:
+        if not self.sql_store:
+            raise ValueError("Responses store is not initialized")
+
         row = await self.sql_store.fetch_one("openai_responses", where={"id": response_id})
         if not row:
             raise ValueError(f"Response with id {response_id} not found")
diff --git a/tests/unit/providers/agents/meta_reference/test_openai_responses.py b/tests/unit/providers/agents/meta_reference/test_openai_responses.py
index a964bc219..38ce365c1 100644
--- a/tests/unit/providers/agents/meta_reference/test_openai_responses.py
+++ b/tests/unit/providers/agents/meta_reference/test_openai_responses.py
@@ -42,10 +42,12 @@ from llama_stack.apis.inference import (
 )
 from llama_stack.apis.tools.tools import Tool, ToolGroups, ToolInvocationResult, ToolParameter, ToolRuntime
 from llama_stack.core.access_control.access_control import default_policy
+from llama_stack.core.datatypes import ResponsesStoreConfig
 from llama_stack.providers.inline.agents.meta_reference.responses.openai_responses import (
     OpenAIResponsesImpl,
 )
 from llama_stack.providers.utils.responses.responses_store import ResponsesStore
+from llama_stack.providers.utils.sqlstore.sqlstore import SqliteSqlStoreConfig
 from tests.unit.providers.agents.meta_reference.fixtures import load_chat_completion_fixture
 
 
@@ -677,7 +679,9 @@ async def test_responses_store_list_input_items_logic():
 
     # Create mock store and response store
     mock_sql_store = AsyncMock()
-    responses_store = ResponsesStore(sql_store_config=None, policy=default_policy())
+    responses_store = ResponsesStore(
+        ResponsesStoreConfig(sql_store_config=SqliteSqlStoreConfig(db_path="mock_db_path")), policy=default_policy()
+    )
     responses_store.sql_store = mock_sql_store
 
     # Setup test data - multiple input items
diff --git a/tests/unit/utils/responses/test_responses_store.py b/tests/unit/utils/responses/test_responses_store.py
index 44d4b30da..4e5256c1b 100644
--- a/tests/unit/utils/responses/test_responses_store.py
+++ b/tests/unit/utils/responses/test_responses_store.py
@@ -67,6 +67,9 @@ async def test_responses_store_pagination_basic():
             input_list = [create_test_response_input(f"Input for {response_id}", f"input-{response_id}")]
             await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Test 1: First page with limit=2, descending order (default)
         result = await store.list_responses(limit=2, order=Order.desc)
         assert len(result.data) == 2
@@ -110,6 +113,9 @@ async def test_responses_store_pagination_ascending():
             input_list = [create_test_response_input(f"Input for {response_id}", f"input-{response_id}")]
             await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Test ascending order pagination
         result = await store.list_responses(limit=1, order=Order.asc)
         assert len(result.data) == 1
@@ -145,6 +151,9 @@ async def test_responses_store_pagination_with_model_filter():
             input_list = [create_test_response_input(f"Input for {response_id}", f"input-{response_id}")]
             await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Test pagination with model filter
         result = await store.list_responses(limit=1, model="model-a", order=Order.desc)
         assert len(result.data) == 1
@@ -192,6 +201,9 @@ async def test_responses_store_pagination_no_limit():
             input_list = [create_test_response_input(f"Input for {response_id}", f"input-{response_id}")]
             await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Test without limit (should use default of 50)
         result = await store.list_responses(order=Order.desc)
         assert len(result.data) == 2
@@ -212,6 +224,9 @@ async def test_responses_store_get_response_object():
         input_list = [create_test_response_input("Test input content", "input-test-resp")]
         await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Retrieve the response
         retrieved = await store.get_response_object("test-resp")
         assert retrieved.id == "test-resp"
@@ -242,6 +257,9 @@ async def test_responses_store_input_items_pagination():
         ]
         await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Verify all items are stored correctly with explicit IDs
         all_items = await store.list_response_input_items("test-resp", order=Order.desc)
         assert len(all_items.data) == 5
@@ -319,6 +337,9 @@ async def test_responses_store_input_items_before_pagination():
         ]
         await store.store_response_object(response, input_list)
 
+        # Wait for all queued writes to complete
+        await store.flush()
+
         # Test before pagination with descending order
         # In desc order: [Fifth, Fourth, Third, Second, First]
         # before="before-3" should return [Fifth, Fourth]

From fdb144f0095efa6b50929cd7b47562ddc205ea97 Mon Sep 17 00:00:00 2001
From: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Date: Mon, 29 Sep 2025 10:41:04 -0700
Subject: [PATCH 06/13] revert: feat(ci): use @next branch from
 llama-stack-client (#3593)

Reverts llamastack/llama-stack#3576

When I edit Stainless and codegen succeeds, the `next` branch is updated
directly. It provides us no chance to see if there might be something
unideal going on. If something is wrong, all CI will start breaking
immediately. This is not ideal. I will likely create another staging
branch `next-release` or something to accomodate the special workflow
that Stainless requires.
---
 .github/actions/setup-runner/action.yml           | 4 ++--
 .github/actions/setup-test-environment/action.yml | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/.github/actions/setup-runner/action.yml b/.github/actions/setup-runner/action.yml
index 448fdbbfe..905d6b73a 100644
--- a/.github/actions/setup-runner/action.yml
+++ b/.github/actions/setup-runner/action.yml
@@ -29,8 +29,8 @@ runs:
 
         # Install llama-stack-client-python based on the client-version input
         if [ "${{ inputs.client-version }}" = "latest" ]; then
-          echo "Installing latest llama-stack-client-python from next branch"
-          uv pip install git+https://github.com/llamastack/llama-stack-client-python.git@next
+          echo "Installing latest llama-stack-client-python from main branch"
+          uv pip install git+https://github.com/llamastack/llama-stack-client-python.git@main
         elif [ "${{ inputs.client-version }}" = "published" ]; then
           echo "Installing published llama-stack-client-python from PyPI"
           uv pip install llama-stack-client
diff --git a/.github/actions/setup-test-environment/action.yml b/.github/actions/setup-test-environment/action.yml
index ececca0f6..478e8f598 100644
--- a/.github/actions/setup-test-environment/action.yml
+++ b/.github/actions/setup-test-environment/action.yml
@@ -44,8 +44,8 @@ runs:
       run: |
         # Install llama-stack-client-python based on the client-version input
         if [ "${{ inputs.client-version }}" = "latest" ]; then
-          echo "Installing latest llama-stack-client-python from next branch"
-          export LLAMA_STACK_CLIENT_DIR=git+https://github.com/llamastack/llama-stack-client-python.git@next
+          echo "Installing latest llama-stack-client-python from main branch"
+          export LLAMA_STACK_CLIENT_DIR=git+https://github.com/llamastack/llama-stack-client-python.git@main
         elif [ "${{ inputs.client-version }}" = "published" ]; then
           echo "Installing published llama-stack-client-python from PyPI"
           unset LLAMA_STACK_CLIENT_DIR

From aab22dc759b3de64a619911efbadfc8bd5799c5d Mon Sep 17 00:00:00 2001
From: Kai Wu <kaiwu@meta.com>
Date: Mon, 29 Sep 2025 11:27:31 -0700
Subject: [PATCH 07/13] fix: adding mime type of application/json support 
 (#3452)

# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fix #3300 by adding mime type of application/json support in
[agent_instance.py](https://github.com/llamastack/llama-stack/blob/4a59961a6cd764db739aefeba06601dfaee68d88/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L923)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[3300] -->

## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
all related pytest passed, see log:
```
./scripts/unit-tests.sh tests/unit/providers/agent/test_get_raw_document_text.py -vvv

/Users/kaiwu/work/kaiwu/llama-stack/.venv/bin/python3
Uninstalled 22 packages in 5.65s
Installed 47 packages in 1.24s
================= test session starts =================
platform darwin -- Python 3.12.9, pytest-8.4.2, pluggy-1.6.0 -- /Users/kaiwu/work/kaiwu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.9', 'Platform': 'macOS-15.6.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/kaiwu/work/kaiwu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 14 items

tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_yaml_mime_type PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_deprecated_text_yaml_with_warning PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_deprecated_text_yaml_with_url PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_deprecated_text_yaml_with_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_json_mime_type PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_json_url PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_json_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unsupported_mime_types PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_url_content PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_yaml_url PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_yaml_text_content_item PASSED
tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unexpected_content_type PASSED

================ slowest 10 durations =================
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_deprecated_text_yaml_with_url
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unsupported_mime_types
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unexpected_content_type
0.00s setup    tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types
0.00s teardown tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_yaml_url
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_url_content
0.00s teardown tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_rejects_unsupported_mime_types
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_with_json_url
0.00s call     tests/unit/providers/agent/test_get_raw_document_text.py::test_get_raw_document_text_supports_text_mime_types
================= 14 passed in 0.14s ==================
Generating coverage report...
Wrote HTML report to htmlcov-3.12/index.html
```
---
 .../agents/meta_reference/agent_instance.py   |  2 +-
 .../agent/test_get_raw_document_text.py       | 24 +++++++++++++++++--
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/llama_stack/providers/inline/agents/meta_reference/agent_instance.py b/llama_stack/providers/inline/agents/meta_reference/agent_instance.py
index dcc08a482..467777b72 100644
--- a/llama_stack/providers/inline/agents/meta_reference/agent_instance.py
+++ b/llama_stack/providers/inline/agents/meta_reference/agent_instance.py
@@ -924,7 +924,7 @@ async def get_raw_document_text(document: Document) -> str:
             DeprecationWarning,
             stacklevel=2,
         )
-    elif not (document.mime_type.startswith("text/") or document.mime_type == "application/yaml"):
+    elif not (document.mime_type.startswith("text/") or document.mime_type in ("application/yaml", "application/json")):
         raise ValueError(f"Unexpected document mime type: {document.mime_type}")
 
     if isinstance(document.content, URL):
diff --git a/tests/unit/providers/agent/test_get_raw_document_text.py b/tests/unit/providers/agent/test_get_raw_document_text.py
index eb481c0d8..302a893b1 100644
--- a/tests/unit/providers/agent/test_get_raw_document_text.py
+++ b/tests/unit/providers/agent/test_get_raw_document_text.py
@@ -107,14 +107,34 @@ async def test_get_raw_document_text_deprecated_text_yaml_with_text_content_item
         assert "text/yaml" in str(w[0].message)
 
 
+async def test_get_raw_document_text_supports_json_mime_type():
+    """Test that the function accepts application/json mime type."""
+    json_content = '{"name": "test", "version": "1.0", "items": ["item1", "item2"]}'
+
+    document = Document(content=json_content, mime_type="application/json")
+
+    result = await get_raw_document_text(document)
+    assert result == json_content
+
+
+async def test_get_raw_document_text_with_json_text_content_item():
+    """Test that the function handles JSON TextContentItem correctly."""
+    json_content = '{"key": "value", "nested": {"array": [1, 2, 3]}}'
+
+    document = Document(content=TextContentItem(text=json_content), mime_type="application/json")
+
+    result = await get_raw_document_text(document)
+    assert result == json_content
+
+
 async def test_get_raw_document_text_rejects_unsupported_mime_types():
     """Test that the function rejects unsupported mime types."""
     document = Document(
         content="Some content",
-        mime_type="application/json",  # Not supported
+        mime_type="application/pdf",  # Not supported
     )
 
-    with pytest.raises(ValueError, match="Unexpected document mime type: application/json"):
+    with pytest.raises(ValueError, match="Unexpected document mime type: application/pdf"):
         await get_raw_document_text(document)
 
 

From 975ead1d6ab04bcb67645c491593c8dc95b32aab Mon Sep 17 00:00:00 2001
From: Matthew Farrellee <matt@cs.wisc.edu>
Date: Mon, 29 Sep 2025 14:45:09 -0400
Subject: [PATCH 08/13] chore(api): remove deprecated embeddings impls (#3301)

# What does this PR do?

remove deprecated embeddings implementations
---
 docs/static/llama-stack-spec.html             | 118 ------------------
 docs/static/llama-stack-spec.yaml             | 101 ---------------
 llama_stack/apis/inference/inference.py       |  22 +---
 llama_stack/core/routers/inference.py         |  23 ----
 .../remote/inference/bedrock/bedrock.py       |  36 ------
 .../remote/inference/cerebras/cerebras.py     |  14 ---
 .../remote/inference/databricks/databricks.py |  14 ---
 .../remote/inference/fireworks/fireworks.py   |  31 -----
 .../remote/inference/nvidia/nvidia.py         |  59 ---------
 .../remote/inference/ollama/ollama.py         |  27 ----
 .../inference/passthrough/passthrough.py      |  22 ----
 .../remote/inference/runpod/runpod.py         |  10 --
 .../providers/remote/inference/tgi/tgi.py     |  14 ---
 .../remote/inference/together/together.py     |  26 ----
 .../providers/remote/inference/vllm/vllm.py   |  30 -----
 .../remote/inference/watsonx/watsonx.py       |  15 +--
 .../utils/inference/embedding_mixin.py        |  22 ----
 .../utils/inference/litellm_openai_mixin.py   |  25 ----
 tests/unit/providers/vector_io/test_faiss.py  |  26 +---
 19 files changed, 3 insertions(+), 632 deletions(-)

diff --git a/docs/static/llama-stack-spec.html b/docs/static/llama-stack-spec.html
index 9e28e0f42..7845fb068 100644
--- a/docs/static/llama-stack-spec.html
+++ b/docs/static/llama-stack-spec.html
@@ -1035,50 +1035,6 @@
                 ]
             }
         },
-        "/v1/inference/embeddings": {
-            "post": {
-                "responses": {
-                    "200": {
-                        "description": "An array of embeddings, one for each content. Each embedding is a list of floats. The dimensionality of the embedding is model-specific; you can check model metadata using /models/{model_id}.",
-                        "content": {
-                            "application/json": {
-                                "schema": {
-                                    "$ref": "#/components/schemas/EmbeddingsResponse"
-                                }
-                            }
-                        }
-                    },
-                    "400": {
-                        "$ref": "#/components/responses/BadRequest400"
-                    },
-                    "429": {
-                        "$ref": "#/components/responses/TooManyRequests429"
-                    },
-                    "500": {
-                        "$ref": "#/components/responses/InternalServerError500"
-                    },
-                    "default": {
-                        "$ref": "#/components/responses/DefaultError"
-                    }
-                },
-                "tags": [
-                    "Inference"
-                ],
-                "summary": "Generate embeddings for content pieces using the specified model.",
-                "description": "Generate embeddings for content pieces using the specified model.",
-                "parameters": [],
-                "requestBody": {
-                    "content": {
-                        "application/json": {
-                            "schema": {
-                                "$ref": "#/components/schemas/EmbeddingsRequest"
-                            }
-                        }
-                    },
-                    "required": true
-                }
-            }
-        },
         "/v1alpha/eval/benchmarks/{benchmark_id}/evaluations": {
             "post": {
                 "responses": {
@@ -10547,80 +10503,6 @@
                 "title": "OpenAIDeleteResponseObject",
                 "description": "Response object confirming deletion of an OpenAI response."
             },
-            "EmbeddingsRequest": {
-                "type": "object",
-                "properties": {
-                    "model_id": {
-                        "type": "string",
-                        "description": "The identifier of the model to use. The model must be an embedding model registered with Llama Stack and available via the /models endpoint."
-                    },
-                    "contents": {
-                        "oneOf": [
-                            {
-                                "type": "array",
-                                "items": {
-                                    "type": "string"
-                                }
-                            },
-                            {
-                                "type": "array",
-                                "items": {
-                                    "$ref": "#/components/schemas/InterleavedContentItem"
-                                }
-                            }
-                        ],
-                        "description": "List of contents to generate embeddings for. Each content can be a string or an InterleavedContentItem (and hence can be multimodal). The behavior depends on the model and provider. Some models may only support text."
-                    },
-                    "text_truncation": {
-                        "type": "string",
-                        "enum": [
-                            "none",
-                            "start",
-                            "end"
-                        ],
-                        "description": "(Optional) Config for how to truncate text for embedding when text is longer than the model's max sequence length."
-                    },
-                    "output_dimension": {
-                        "type": "integer",
-                        "description": "(Optional) Output dimensionality for the embeddings. Only supported by Matryoshka models."
-                    },
-                    "task_type": {
-                        "type": "string",
-                        "enum": [
-                            "query",
-                            "document"
-                        ],
-                        "description": "(Optional) How is the embedding being used? This is only supported by asymmetric embedding models."
-                    }
-                },
-                "additionalProperties": false,
-                "required": [
-                    "model_id",
-                    "contents"
-                ],
-                "title": "EmbeddingsRequest"
-            },
-            "EmbeddingsResponse": {
-                "type": "object",
-                "properties": {
-                    "embeddings": {
-                        "type": "array",
-                        "items": {
-                            "type": "array",
-                            "items": {
-                                "type": "number"
-                            }
-                        },
-                        "description": "List of embedding vectors, one per input content. Each embedding is a list of floats. The dimensionality of the embedding is model-specific; you can check model metadata using /models/{model_id}"
-                    }
-                },
-                "additionalProperties": false,
-                "required": [
-                    "embeddings"
-                ],
-                "title": "EmbeddingsResponse",
-                "description": "Response containing generated embeddings."
-            },
             "AgentCandidate": {
                 "type": "object",
                 "properties": {
diff --git a/docs/static/llama-stack-spec.yaml b/docs/static/llama-stack-spec.yaml
index 1c06c74a5..8cbbccaa2 100644
--- a/docs/static/llama-stack-spec.yaml
+++ b/docs/static/llama-stack-spec.yaml
@@ -720,41 +720,6 @@ paths:
           required: true
           schema:
             type: string
-  /v1/inference/embeddings:
-    post:
-      responses:
-        '200':
-          description: >-
-            An array of embeddings, one for each content. Each embedding is a list
-            of floats. The dimensionality of the embedding is model-specific; you
-            can check model metadata using /models/{model_id}.
-          content:
-            application/json:
-              schema:
-                $ref: '#/components/schemas/EmbeddingsResponse'
-        '400':
-          $ref: '#/components/responses/BadRequest400'
-        '429':
-          $ref: >-
-            #/components/responses/TooManyRequests429
-        '500':
-          $ref: >-
-            #/components/responses/InternalServerError500
-        default:
-          $ref: '#/components/responses/DefaultError'
-      tags:
-        - Inference
-      summary: >-
-        Generate embeddings for content pieces using the specified model.
-      description: >-
-        Generate embeddings for content pieces using the specified model.
-      parameters: []
-      requestBody:
-        content:
-          application/json:
-            schema:
-              $ref: '#/components/schemas/EmbeddingsRequest'
-        required: true
   /v1alpha/eval/benchmarks/{benchmark_id}/evaluations:
     post:
       responses:
@@ -7795,72 +7760,6 @@ components:
       title: OpenAIDeleteResponseObject
       description: >-
         Response object confirming deletion of an OpenAI response.
-    EmbeddingsRequest:
-      type: object
-      properties:
-        model_id:
-          type: string
-          description: >-
-            The identifier of the model to use. The model must be an embedding model
-            registered with Llama Stack and available via the /models endpoint.
-        contents:
-          oneOf:
-            - type: array
-              items:
-                type: string
-            - type: array
-              items:
-                $ref: '#/components/schemas/InterleavedContentItem'
-          description: >-
-            List of contents to generate embeddings for. Each content can be a string
-            or an InterleavedContentItem (and hence can be multimodal). The behavior
-            depends on the model and provider. Some models may only support text.
-        text_truncation:
-          type: string
-          enum:
-            - none
-            - start
-            - end
-          description: >-
-            (Optional) Config for how to truncate text for embedding when text is
-            longer than the model's max sequence length.
-        output_dimension:
-          type: integer
-          description: >-
-            (Optional) Output dimensionality for the embeddings. Only supported by
-            Matryoshka models.
-        task_type:
-          type: string
-          enum:
-            - query
-            - document
-          description: >-
-            (Optional) How is the embedding being used? This is only supported by
-            asymmetric embedding models.
-      additionalProperties: false
-      required:
-        - model_id
-        - contents
-      title: EmbeddingsRequest
-    EmbeddingsResponse:
-      type: object
-      properties:
-        embeddings:
-          type: array
-          items:
-            type: array
-            items:
-              type: number
-          description: >-
-            List of embedding vectors, one per input content. Each embedding is a
-            list of floats. The dimensionality of the embedding is model-specific;
-            you can check model metadata using /models/{model_id}
-      additionalProperties: false
-      required:
-        - embeddings
-      title: EmbeddingsResponse
-      description: >-
-        Response containing generated embeddings.
     AgentCandidate:
       type: object
       properties:
diff --git a/llama_stack/apis/inference/inference.py b/llama_stack/apis/inference/inference.py
index 756896796..c6a4e4f60 100644
--- a/llama_stack/apis/inference/inference.py
+++ b/llama_stack/apis/inference/inference.py
@@ -17,7 +17,7 @@ from typing import (
 from pydantic import BaseModel, Field, field_validator
 from typing_extensions import TypedDict
 
-from llama_stack.apis.common.content_types import ContentDelta, InterleavedContent, InterleavedContentItem
+from llama_stack.apis.common.content_types import ContentDelta, InterleavedContent
 from llama_stack.apis.common.responses import Order
 from llama_stack.apis.models import Model
 from llama_stack.apis.telemetry import MetricResponseMixin
@@ -1070,26 +1070,6 @@ class InferenceProvider(Protocol):
         """
         ...
 
-    @webmethod(route="/inference/embeddings", method="POST", level=LLAMA_STACK_API_V1)
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        """Generate embeddings for content pieces using the specified model.
-
-        :param model_id: The identifier of the model to use. The model must be an embedding model registered with Llama Stack and available via the /models endpoint.
-        :param contents: List of contents to generate embeddings for. Each content can be a string or an InterleavedContentItem (and hence can be multimodal). The behavior depends on the model and provider. Some models may only support text.
-        :param output_dimension: (Optional) Output dimensionality for the embeddings. Only supported by Matryoshka models.
-        :param text_truncation: (Optional) Config for how to truncate text for embedding when text is longer than the model's max sequence length.
-        :param task_type: (Optional) How is the embedding being used? This is only supported by asymmetric embedding models.
-        :returns: An array of embeddings, one for each content. Each embedding is a list of floats. The dimensionality of the embedding is model-specific; you can check model metadata using /models/{model_id}.
-        """
-        ...
-
     @webmethod(route="/inference/rerank", method="POST", experimental=True, level=LLAMA_STACK_API_V1)
     async def rerank(
         self,
diff --git a/llama_stack/core/routers/inference.py b/llama_stack/core/routers/inference.py
index fcf01a9c4..80f47fb5d 100644
--- a/llama_stack/core/routers/inference.py
+++ b/llama_stack/core/routers/inference.py
@@ -16,7 +16,6 @@ from pydantic import Field, TypeAdapter
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.common.errors import ModelNotFoundError, ModelTypeError
 from llama_stack.apis.inference import (
@@ -26,8 +25,6 @@ from llama_stack.apis.inference import (
     CompletionMessage,
     CompletionResponse,
     CompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     ListOpenAIChatCompletionResponse,
     LogProbConfig,
@@ -48,7 +45,6 @@ from llama_stack.apis.inference import (
     ResponseFormat,
     SamplingParams,
     StopReason,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -312,25 +308,6 @@ class InferenceRouter(Inference):
 
         return response
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        logger.debug(f"InferenceRouter.embeddings: {model_id}")
-        await self._get_model(model_id, ModelType.embedding)
-        provider = await self.routing_table.get_provider_impl(model_id)
-        return await provider.embeddings(
-            model_id=model_id,
-            contents=contents,
-            text_truncation=text_truncation,
-            output_dimension=output_dimension,
-            task_type=task_type,
-        )
-
     async def openai_completion(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/bedrock/bedrock.py b/llama_stack/providers/remote/inference/bedrock/bedrock.py
index 29b935bbd..2206aa641 100644
--- a/llama_stack/providers/remote/inference/bedrock/bedrock.py
+++ b/llama_stack/providers/remote/inference/bedrock/bedrock.py
@@ -11,21 +11,17 @@ from botocore.client import BaseClient
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     ChatCompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
     OpenAIEmbeddingsResponse,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -47,8 +43,6 @@ from llama_stack.providers.utils.inference.openai_compat import (
 )
 from llama_stack.providers.utils.inference.prompt_adapter import (
     chat_completion_request_to_prompt,
-    content_has_media,
-    interleaved_content_as_str,
 )
 
 from .models import MODEL_ENTRIES
@@ -218,36 +212,6 @@ class BedrockInferenceAdapter(
             ),
         }
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self.model_store.get_model(model_id)
-
-        # Convert foundation model ID to inference profile ID
-        region_name = self.client.meta.region_name
-        inference_profile_id = _to_inference_profile_id(model.provider_resource_id, region_name)
-
-        embeddings = []
-        for content in contents:
-            assert not content_has_media(content), "Bedrock does not support media for embeddings"
-            input_text = interleaved_content_as_str(content)
-            input_body = {"inputText": input_text}
-            body = json.dumps(input_body)
-            response = self.client.invoke_model(
-                body=body,
-                modelId=inference_profile_id,
-                accept="application/json",
-                contentType="application/json",
-            )
-            response_body = json.loads(response.get("body").read())
-            embeddings.append(response_body.get("embedding"))
-        return EmbeddingsResponse(embeddings=embeddings)
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/cerebras/cerebras.py b/llama_stack/providers/remote/inference/cerebras/cerebras.py
index 6662f004d..6be39fa5d 100644
--- a/llama_stack/providers/remote/inference/cerebras/cerebras.py
+++ b/llama_stack/providers/remote/inference/cerebras/cerebras.py
@@ -11,21 +11,17 @@ from cerebras.cloud.sdk import AsyncCerebras
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     CompletionRequest,
     CompletionResponse,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
     OpenAIEmbeddingsResponse,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -187,16 +183,6 @@ class CerebrasInferenceAdapter(
             **get_sampling_options(request.sampling_params),
         }
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        raise NotImplementedError()
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/databricks/databricks.py b/llama_stack/providers/remote/inference/databricks/databricks.py
index 6eac6e4f4..d85b477f5 100644
--- a/llama_stack/providers/remote/inference/databricks/databricks.py
+++ b/llama_stack/providers/remote/inference/databricks/databricks.py
@@ -11,15 +11,12 @@ from databricks.sdk import WorkspaceClient
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionResponse,
     ChatCompletionResponseStreamChunk,
     CompletionResponse,
     CompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
@@ -27,7 +24,6 @@ from llama_stack.apis.inference import (
     OpenAICompletion,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -118,16 +114,6 @@ class DatabricksInferenceAdapter(
     ) -> ChatCompletionResponse | AsyncIterator[ChatCompletionResponseStreamChunk]:
         raise NotImplementedError()
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        raise NotImplementedError()
-
     async def list_models(self) -> list[Model] | None:
         self._model_cache = {}  # from OpenAIMixin
         ws_client = WorkspaceClient(host=self.config.url, token=self.get_api_key())  # TODO: this is not async
diff --git a/llama_stack/providers/remote/inference/fireworks/fireworks.py b/llama_stack/providers/remote/inference/fireworks/fireworks.py
index 069a0a674..ed4b56fad 100644
--- a/llama_stack/providers/remote/inference/fireworks/fireworks.py
+++ b/llama_stack/providers/remote/inference/fireworks/fireworks.py
@@ -10,22 +10,18 @@ from fireworks.client import Fireworks
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     CompletionRequest,
     CompletionResponse,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
     ResponseFormat,
     ResponseFormatType,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -48,8 +44,6 @@ from llama_stack.providers.utils.inference.openai_mixin import OpenAIMixin
 from llama_stack.providers.utils.inference.prompt_adapter import (
     chat_completion_request_to_prompt,
     completion_request_to_prompt,
-    content_has_media,
-    interleaved_content_as_str,
     request_has_media,
 )
 
@@ -259,28 +253,3 @@ class FireworksInferenceAdapter(OpenAIMixin, ModelRegistryHelper, Inference, Nee
         logger.debug(f"params to fireworks: {params}")
 
         return params
-
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self.model_store.get_model(model_id)
-
-        kwargs = {}
-        if model.metadata.get("embedding_dimension"):
-            kwargs["dimensions"] = model.metadata.get("embedding_dimension")
-        assert all(not content_has_media(content) for content in contents), (
-            "Fireworks does not support media for embeddings"
-        )
-        response = self._get_client().embeddings.create(
-            model=model.provider_resource_id,
-            input=[interleaved_content_as_str(content) for content in contents],
-            **kwargs,
-        )
-
-        embeddings = [data.embedding for data in response.data]
-        return EmbeddingsResponse(embeddings=embeddings)
diff --git a/llama_stack/providers/remote/inference/nvidia/nvidia.py b/llama_stack/providers/remote/inference/nvidia/nvidia.py
index 92094a0f3..a31981adb 100644
--- a/llama_stack/providers/remote/inference/nvidia/nvidia.py
+++ b/llama_stack/providers/remote/inference/nvidia/nvidia.py
@@ -11,8 +11,6 @@ from openai import NOT_GIVEN, APIConnectionError
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
-    TextContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
@@ -21,8 +19,6 @@ from llama_stack.apis.inference import (
     CompletionRequest,
     CompletionResponse,
     CompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
@@ -31,7 +27,6 @@ from llama_stack.apis.inference import (
     OpenAIEmbeddingUsage,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
 )
@@ -156,60 +151,6 @@ class NVIDIAInferenceAdapter(OpenAIMixin, Inference):
             # we pass n=1 to get only one completion
             return convert_openai_completion_choice(response.choices[0])
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        if any(content_has_media(content) for content in contents):
-            raise NotImplementedError("Media is not supported")
-
-        #
-        # Llama Stack: contents = list[str] | list[InterleavedContentItem]
-        #  ->
-        # OpenAI: input = str | list[str]
-        #
-        # we can ignore str and always pass list[str] to OpenAI
-        #
-        flat_contents = [content.text if isinstance(content, TextContentItem) else content for content in contents]
-        input = [content.text if isinstance(content, TextContentItem) else content for content in flat_contents]
-        provider_model_id = await self._get_provider_model_id(model_id)
-
-        extra_body = {}
-
-        if text_truncation is not None:
-            text_truncation_options = {
-                TextTruncation.none: "NONE",
-                TextTruncation.end: "END",
-                TextTruncation.start: "START",
-            }
-            extra_body["truncate"] = text_truncation_options[text_truncation]
-
-        if output_dimension is not None:
-            extra_body["dimensions"] = output_dimension
-
-        if task_type is not None:
-            task_type_options = {
-                EmbeddingTaskType.document: "passage",
-                EmbeddingTaskType.query: "query",
-            }
-            extra_body["input_type"] = task_type_options[task_type]
-
-        response = await self.client.embeddings.create(
-            model=provider_model_id,
-            input=input,
-            extra_body=extra_body,
-        )
-        #
-        # OpenAI: CreateEmbeddingResponse(data=[Embedding(embedding=list[float], ...)], ...)
-        #  ->
-        # Llama Stack: EmbeddingsResponse(embeddings=list[list[float]])
-        #
-        return EmbeddingsResponse(embeddings=[embedding.embedding for embedding in response.data])
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/ollama/ollama.py b/llama_stack/providers/remote/inference/ollama/ollama.py
index 3fb10445f..16b104fb5 100644
--- a/llama_stack/providers/remote/inference/ollama/ollama.py
+++ b/llama_stack/providers/remote/inference/ollama/ollama.py
@@ -14,7 +14,6 @@ from ollama import AsyncClient as AsyncOllamaClient
 from llama_stack.apis.common.content_types import (
     ImageContentItem,
     InterleavedContent,
-    InterleavedContentItem,
     TextContentItem,
 )
 from llama_stack.apis.common.errors import UnsupportedModelError
@@ -25,8 +24,6 @@ from llama_stack.apis.inference import (
     CompletionRequest,
     CompletionResponse,
     CompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     GrammarResponseFormat,
     InferenceProvider,
     JsonSchemaResponseFormat,
@@ -34,7 +31,6 @@ from llama_stack.apis.inference import (
     Message,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -66,9 +62,7 @@ from llama_stack.providers.utils.inference.openai_mixin import OpenAIMixin
 from llama_stack.providers.utils.inference.prompt_adapter import (
     chat_completion_request_to_prompt,
     completion_request_to_prompt,
-    content_has_media,
     convert_image_content_to_url,
-    interleaved_content_as_str,
     request_has_media,
 )
 
@@ -363,27 +357,6 @@ class OllamaInferenceAdapter(
         async for chunk in process_chat_completion_stream_response(stream, request):
             yield chunk
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self._get_model(model_id)
-
-        assert all(not content_has_media(content) for content in contents), (
-            "Ollama does not support media for embeddings"
-        )
-        response = await self.ollama_client.embed(
-            model=model.provider_resource_id,
-            input=[interleaved_content_as_str(content) for content in contents],
-        )
-        embeddings = response["embeddings"]
-
-        return EmbeddingsResponse(embeddings=embeddings)
-
     async def register_model(self, model: Model) -> Model:
         if await self.check_model_availability(model.provider_model_id):
             return model
diff --git a/llama_stack/providers/remote/inference/passthrough/passthrough.py b/llama_stack/providers/remote/inference/passthrough/passthrough.py
index a2bdf0369..ae482b7b0 100644
--- a/llama_stack/providers/remote/inference/passthrough/passthrough.py
+++ b/llama_stack/providers/remote/inference/passthrough/passthrough.py
@@ -14,8 +14,6 @@ from llama_stack.apis.inference import (
     ChatCompletionResponse,
     ChatCompletionResponseStreamChunk,
     CompletionMessage,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
@@ -27,7 +25,6 @@ from llama_stack.apis.inference import (
     OpenAIResponseFormatParam,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -190,25 +187,6 @@ class PassthroughInferenceAdapter(Inference):
             chunk = convert_to_pydantic(ChatCompletionResponseStreamChunk, chunk)
             yield chunk
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[InterleavedContent],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        client = self._get_client()
-        model = await self.model_store.get_model(model_id)
-
-        return await client.inference.embeddings(
-            model_id=model.provider_resource_id,
-            contents=contents,
-            text_truncation=text_truncation,
-            output_dimension=output_dimension,
-            task_type=task_type,
-        )
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/runpod/runpod.py b/llama_stack/providers/remote/inference/runpod/runpod.py
index ff2fe6401..82252b04d 100644
--- a/llama_stack/providers/remote/inference/runpod/runpod.py
+++ b/llama_stack/providers/remote/inference/runpod/runpod.py
@@ -136,16 +136,6 @@ class RunpodInferenceAdapter(
             **get_sampling_options(request.sampling_params),
         }
 
-    async def embeddings(
-        self,
-        model: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        raise NotImplementedError()
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/tgi/tgi.py b/llama_stack/providers/remote/inference/tgi/tgi.py
index 27597900f..e1632e4a0 100644
--- a/llama_stack/providers/remote/inference/tgi/tgi.py
+++ b/llama_stack/providers/remote/inference/tgi/tgi.py
@@ -12,14 +12,11 @@ from pydantic import SecretStr
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     CompletionRequest,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
@@ -27,7 +24,6 @@ from llama_stack.apis.inference import (
     ResponseFormat,
     ResponseFormatType,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -306,16 +302,6 @@ class _HfAdapter(
             **self._build_options(request.sampling_params, request.response_format),
         )
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        raise NotImplementedError()
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/remote/inference/together/together.py b/llama_stack/providers/remote/inference/together/together.py
index c199677be..083c528bb 100644
--- a/llama_stack/providers/remote/inference/together/together.py
+++ b/llama_stack/providers/remote/inference/together/together.py
@@ -12,14 +12,11 @@ from together.constants import BASE_URL
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     CompletionRequest,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     Inference,
     LogProbConfig,
     Message,
@@ -27,7 +24,6 @@ from llama_stack.apis.inference import (
     ResponseFormat,
     ResponseFormatType,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -50,8 +46,6 @@ from llama_stack.providers.utils.inference.openai_mixin import OpenAIMixin
 from llama_stack.providers.utils.inference.prompt_adapter import (
     chat_completion_request_to_prompt,
     completion_request_to_prompt,
-    content_has_media,
-    interleaved_content_as_str,
     request_has_media,
 )
 
@@ -247,26 +241,6 @@ class TogetherInferenceAdapter(OpenAIMixin, ModelRegistryHelper, Inference, Need
         logger.debug(f"params to together: {params}")
         return params
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self.model_store.get_model(model_id)
-        assert all(not content_has_media(content) for content in contents), (
-            "Together does not support media for embeddings"
-        )
-        client = self._get_client()
-        r = await client.embeddings.create(
-            model=model.provider_resource_id,
-            input=[interleaved_content_as_str(content) for content in contents],
-        )
-        embeddings = [item.embedding for item in r.data]
-        return EmbeddingsResponse(embeddings=embeddings)
-
     async def list_models(self) -> list[Model] | None:
         self._model_cache = {}
         # Together's /v1/models is not compatible with OpenAI's /v1/models. Together support ticket #13355 -> will not fix, use Together's own client
diff --git a/llama_stack/providers/remote/inference/vllm/vllm.py b/llama_stack/providers/remote/inference/vllm/vllm.py
index 8fbb4b815..bef5cbf2c 100644
--- a/llama_stack/providers/remote/inference/vllm/vllm.py
+++ b/llama_stack/providers/remote/inference/vllm/vllm.py
@@ -16,7 +16,6 @@ from openai.types.chat.chat_completion_chunk import (
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
     TextDelta,
     ToolCallDelta,
     ToolCallParseStatus,
@@ -31,8 +30,6 @@ from llama_stack.apis.inference import (
     CompletionRequest,
     CompletionResponse,
     CompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     GrammarResponseFormat,
     Inference,
     JsonSchemaResponseFormat,
@@ -41,7 +38,6 @@ from llama_stack.apis.inference import (
     ModelStore,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -74,8 +70,6 @@ from llama_stack.providers.utils.inference.openai_compat import (
 from llama_stack.providers.utils.inference.openai_mixin import OpenAIMixin
 from llama_stack.providers.utils.inference.prompt_adapter import (
     completion_request_to_prompt,
-    content_has_media,
-    interleaved_content_as_str,
     request_has_media,
 )
 
@@ -550,27 +544,3 @@ class VLLMInferenceAdapter(OpenAIMixin, LiteLLMOpenAIMixin, Inference, ModelsPro
             "stream": request.stream,
             **options,
         }
-
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self._get_model(model_id)
-
-        kwargs = {}
-        assert model.model_type == ModelType.embedding
-        assert model.metadata.get("embedding_dimension")
-        kwargs["dimensions"] = model.metadata.get("embedding_dimension")
-        assert all(not content_has_media(content) for content in contents), "VLLM does not support media for embeddings"
-        response = await self.client.embeddings.create(
-            model=model.provider_resource_id,
-            input=[interleaved_content_as_str(content) for content in contents],
-            **kwargs,
-        )
-
-        embeddings = [data.embedding for data in response.data]
-        return EmbeddingsResponse(embeddings=embeddings)
diff --git a/llama_stack/providers/remote/inference/watsonx/watsonx.py b/llama_stack/providers/remote/inference/watsonx/watsonx.py
index cb8b45565..00b9acc06 100644
--- a/llama_stack/providers/remote/inference/watsonx/watsonx.py
+++ b/llama_stack/providers/remote/inference/watsonx/watsonx.py
@@ -11,13 +11,11 @@ from ibm_watsonx_ai.foundation_models import Model
 from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
 from openai import AsyncOpenAI
 
-from llama_stack.apis.common.content_types import InterleavedContent, InterleavedContentItem
+from llama_stack.apis.common.content_types import InterleavedContent
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     CompletionRequest,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     GreedySamplingStrategy,
     Inference,
     LogProbConfig,
@@ -30,7 +28,6 @@ from llama_stack.apis.inference import (
     OpenAIResponseFormatParam,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -265,16 +262,6 @@ class WatsonXInferenceAdapter(Inference, ModelRegistryHelper):
         }
         return params
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        raise NotImplementedError("embedding is not supported for watsonx")
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/utils/inference/embedding_mixin.py b/llama_stack/providers/utils/inference/embedding_mixin.py
index 9bd0aa8ce..facc59f65 100644
--- a/llama_stack/providers/utils/inference/embedding_mixin.py
+++ b/llama_stack/providers/utils/inference/embedding_mixin.py
@@ -15,16 +15,11 @@ if TYPE_CHECKING:
     from sentence_transformers import SentenceTransformer
 
 from llama_stack.apis.inference import (
-    EmbeddingsResponse,
-    EmbeddingTaskType,
-    InterleavedContentItem,
     ModelStore,
     OpenAIEmbeddingData,
     OpenAIEmbeddingsResponse,
     OpenAIEmbeddingUsage,
-    TextTruncation,
 )
-from llama_stack.providers.utils.inference.prompt_adapter import interleaved_content_as_str
 
 EMBEDDING_MODELS = {}
 
@@ -35,23 +30,6 @@ log = get_logger(name=__name__, category="providers::utils")
 class SentenceTransformerEmbeddingMixin:
     model_store: ModelStore
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self.model_store.get_model(model_id)
-        embedding_model = await self._load_sentence_transformer_model(model.provider_resource_id)
-        embeddings = await asyncio.to_thread(
-            embedding_model.encode,
-            [interleaved_content_as_str(content) for content in contents],
-            show_progress_bar=False,
-        )
-        return EmbeddingsResponse(embeddings=embeddings)
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/llama_stack/providers/utils/inference/litellm_openai_mixin.py b/llama_stack/providers/utils/inference/litellm_openai_mixin.py
index b1e38f323..966081e9f 100644
--- a/llama_stack/providers/utils/inference/litellm_openai_mixin.py
+++ b/llama_stack/providers/utils/inference/litellm_openai_mixin.py
@@ -11,14 +11,11 @@ import litellm
 
 from llama_stack.apis.common.content_types import (
     InterleavedContent,
-    InterleavedContentItem,
 )
 from llama_stack.apis.inference import (
     ChatCompletionRequest,
     ChatCompletionResponse,
     ChatCompletionResponseStreamChunk,
-    EmbeddingsResponse,
-    EmbeddingTaskType,
     InferenceProvider,
     JsonSchemaResponseFormat,
     LogProbConfig,
@@ -32,7 +29,6 @@ from llama_stack.apis.inference import (
     OpenAIResponseFormatParam,
     ResponseFormat,
     SamplingParams,
-    TextTruncation,
     ToolChoice,
     ToolConfig,
     ToolDefinition,
@@ -50,9 +46,6 @@ from llama_stack.providers.utils.inference.openai_compat import (
     get_sampling_options,
     prepare_openai_completion_params,
 )
-from llama_stack.providers.utils.inference.prompt_adapter import (
-    interleaved_content_as_str,
-)
 
 logger = get_logger(name=__name__, category="providers::utils")
 
@@ -269,24 +262,6 @@ class LiteLLMOpenAIMixin(
             )
         return api_key
 
-    async def embeddings(
-        self,
-        model_id: str,
-        contents: list[str] | list[InterleavedContentItem],
-        text_truncation: TextTruncation | None = TextTruncation.none,
-        output_dimension: int | None = None,
-        task_type: EmbeddingTaskType | None = None,
-    ) -> EmbeddingsResponse:
-        model = await self.model_store.get_model(model_id)
-
-        response = litellm.embedding(
-            model=self.get_litellm_model_name(model.provider_resource_id),
-            input=[interleaved_content_as_str(content) for content in contents],
-        )
-
-        embeddings = [data["embedding"] for data in response["data"]]
-        return EmbeddingsResponse(embeddings=embeddings)
-
     async def openai_embeddings(
         self,
         model: str,
diff --git a/tests/unit/providers/vector_io/test_faiss.py b/tests/unit/providers/vector_io/test_faiss.py
index 90108d7a0..9ee5c82f4 100644
--- a/tests/unit/providers/vector_io/test_faiss.py
+++ b/tests/unit/providers/vector_io/test_faiss.py
@@ -5,13 +5,12 @@
 # the root directory of this source tree.
 
 import asyncio
-from unittest.mock import AsyncMock, MagicMock, patch
+from unittest.mock import MagicMock, patch
 
 import numpy as np
 import pytest
 
 from llama_stack.apis.files import Files
-from llama_stack.apis.inference import EmbeddingsResponse, Inference
 from llama_stack.apis.vector_dbs import VectorDB
 from llama_stack.apis.vector_io import Chunk, QueryChunksResponse
 from llama_stack.providers.datatypes import HealthStatus
@@ -70,13 +69,6 @@ def mock_vector_db(vector_db_id, embedding_dimension) -> MagicMock:
     return mock_vector_db
 
 
-@pytest.fixture
-def mock_inference_api(sample_embeddings):
-    mock_api = MagicMock(spec=Inference)
-    mock_api.embeddings = AsyncMock(return_value=EmbeddingsResponse(embeddings=sample_embeddings))
-    return mock_api
-
-
 @pytest.fixture
 def mock_files_api():
     mock_api = MagicMock(spec=Files)
@@ -96,22 +88,6 @@ async def faiss_index(embedding_dimension):
     yield index
 
 
-@pytest.fixture
-async def faiss_adapter(faiss_config, mock_inference_api, mock_files_api) -> FaissVectorIOAdapter:
-    # Create the adapter
-    adapter = FaissVectorIOAdapter(config=faiss_config, inference_api=mock_inference_api, files_api=mock_files_api)
-
-    # Create a mock KVStore
-    mock_kvstore = MagicMock()
-    mock_kvstore.values_in_range = AsyncMock(return_value=[])
-
-    # Patch the initialize method to avoid the kvstore_impl call
-    with patch.object(FaissVectorIOAdapter, "initialize"):
-        # Set the kvstore directly
-        adapter.kvstore = mock_kvstore
-        yield adapter
-
-
 async def test_faiss_query_vector_returns_infinity_when_query_and_embedding_are_identical(
     faiss_index, sample_chunks, sample_embeddings, embedding_dimension
 ):

From aac42ddcc2832133873ba1b7cd1d74996e21564a Mon Sep 17 00:00:00 2001
From: Charlie Doern <cdoern@redhat.com>
Date: Mon, 29 Sep 2025 15:42:09 -0400
Subject: [PATCH 09/13] feat(api): level inference/rerank and remove
 experimental (#3565)

# What does this PR do?

inference/rerank is the one route in the API intended to not be
deprecated. Level it as v1alpha.

Additionally, remove `experimental` and opt to instead use `v1alpha`
which itself implies an experimental state based on the original
proposal

Signed-off-by: Charlie Doern <cdoern@redhat.com>
---
 docs/static/llama-stack-spec.html       |  2 +-
 docs/static/llama-stack-spec.yaml       |  2 +-
 llama_stack/apis/inference/inference.py |  4 ++--
 llama_stack/core/resolver.py            | 11 +++++++++--
 llama_stack/schema_utils.py             |  4 ----
 5 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/docs/static/llama-stack-spec.html b/docs/static/llama-stack-spec.html
index 7845fb068..32ead1764 100644
--- a/docs/static/llama-stack-spec.html
+++ b/docs/static/llama-stack-spec.html
@@ -5431,7 +5431,7 @@
                 }
             }
         },
-        "/v1/inference/rerank": {
+        "/v1alpha/inference/rerank": {
             "post": {
                 "responses": {
                     "200": {
diff --git a/docs/static/llama-stack-spec.yaml b/docs/static/llama-stack-spec.yaml
index 8cbbccaa2..3b5b92060 100644
--- a/docs/static/llama-stack-spec.yaml
+++ b/docs/static/llama-stack-spec.yaml
@@ -3895,7 +3895,7 @@ paths:
             schema:
               $ref: '#/components/schemas/QueryTracesRequest'
         required: true
-  /v1/inference/rerank:
+  /v1alpha/inference/rerank:
     post:
       responses:
         '200':
diff --git a/llama_stack/apis/inference/inference.py b/llama_stack/apis/inference/inference.py
index c6a4e4f60..134da5bf8 100644
--- a/llama_stack/apis/inference/inference.py
+++ b/llama_stack/apis/inference/inference.py
@@ -21,7 +21,7 @@ from llama_stack.apis.common.content_types import ContentDelta, InterleavedConte
 from llama_stack.apis.common.responses import Order
 from llama_stack.apis.models import Model
 from llama_stack.apis.telemetry import MetricResponseMixin
-from llama_stack.apis.version import LLAMA_STACK_API_V1
+from llama_stack.apis.version import LLAMA_STACK_API_V1, LLAMA_STACK_API_V1ALPHA
 from llama_stack.models.llama.datatypes import (
     BuiltinTool,
     StopReason,
@@ -1070,7 +1070,7 @@ class InferenceProvider(Protocol):
         """
         ...
 
-    @webmethod(route="/inference/rerank", method="POST", experimental=True, level=LLAMA_STACK_API_V1)
+    @webmethod(route="/inference/rerank", method="POST", level=LLAMA_STACK_API_V1ALPHA)
     async def rerank(
         self,
         model: str,
diff --git a/llama_stack/core/resolver.py b/llama_stack/core/resolver.py
index 373446de6..f421c47ed 100644
--- a/llama_stack/core/resolver.py
+++ b/llama_stack/core/resolver.py
@@ -29,6 +29,7 @@ from llama_stack.apis.telemetry import Telemetry
 from llama_stack.apis.tools import ToolGroups, ToolRuntime
 from llama_stack.apis.vector_dbs import VectorDBs
 from llama_stack.apis.vector_io import VectorIO
+from llama_stack.apis.version import LLAMA_STACK_API_V1ALPHA
 from llama_stack.core.client import get_client_impl
 from llama_stack.core.datatypes import (
     AccessRule,
@@ -412,8 +413,14 @@ def check_protocol_compliance(obj: Any, protocol: Any) -> None:
 
     mro = type(obj).__mro__
     for name, value in inspect.getmembers(protocol):
-        if inspect.isfunction(value) and hasattr(value, "__webmethod__"):
-            if value.__webmethod__.experimental:
+        if inspect.isfunction(value) and hasattr(value, "__webmethods__"):
+            has_alpha_api = False
+            for webmethod in value.__webmethods__:
+                if webmethod.level == LLAMA_STACK_API_V1ALPHA:
+                    has_alpha_api = True
+                    break
+            # if this API has multiple webmethods, and one of them is an alpha API, this API should be skipped when checking for missing or not callable routes
+            if has_alpha_api:
                 continue
             if not hasattr(obj, name):
                 missing_methods.append((name, "missing"))
diff --git a/llama_stack/schema_utils.py b/llama_stack/schema_utils.py
index 4f8b4edff..c58fcdd01 100644
--- a/llama_stack/schema_utils.py
+++ b/llama_stack/schema_utils.py
@@ -22,7 +22,6 @@ class WebMethod:
     raw_bytes_request_body: bool | None = False
     # A descriptive name of the corresponding span created by tracing
     descriptive_name: str | None = None
-    experimental: bool | None = False
     required_scope: str | None = None
     deprecated: bool | None = False
 
@@ -39,7 +38,6 @@ def webmethod(
     response_examples: list[Any] | None = None,
     raw_bytes_request_body: bool | None = False,
     descriptive_name: str | None = None,
-    experimental: bool | None = False,
     required_scope: str | None = None,
     deprecated: bool | None = False,
 ) -> Callable[[T], T]:
@@ -50,7 +48,6 @@ def webmethod(
     :param public: True if the operation can be invoked without prior authentication.
     :param request_examples: Sample requests that the operation might take. Pass a list of objects, not JSON.
     :param response_examples: Sample responses that the operation might produce. Pass a list of objects, not JSON.
-    :param experimental: True if the operation is experimental and subject to change.
     :param required_scope: Required scope for this endpoint (e.g., 'monitoring.viewer').
     """
 
@@ -64,7 +61,6 @@ def webmethod(
             response_examples=response_examples,
             raw_bytes_request_body=raw_bytes_request_body,
             descriptive_name=descriptive_name,
-            experimental=experimental,
             required_scope=required_scope,
             deprecated=deprecated,
         )

From 45f438c027176c783a1068894b9147eb008ffa5e Mon Sep 17 00:00:00 2001
From: Matthew Farrellee <matt@cs.wisc.edu>
Date: Mon, 29 Sep 2025 16:11:37 -0400
Subject: [PATCH 10/13] chore: skip safety tests when shield not available
 (#3592)

# What does this PR do?

we skip embedding tests when the embedding_model_id isn't provided. same
for completion / chat tests when text_model_id isn't given.

instead of failing safety tests when a shield_id isn't provided, we'll
skip them too.

## Test Plan

ci

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
---
 tests/integration/fixtures/common.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/integration/fixtures/common.py b/tests/integration/fixtures/common.py
index ee4c5755a..835a701bc 100644
--- a/tests/integration/fixtures/common.py
+++ b/tests/integration/fixtures/common.py
@@ -166,7 +166,7 @@ def model_providers(llama_stack_client):
 
 @pytest.fixture(autouse=True)
 def skip_if_no_model(request):
-    model_fixtures = ["text_model_id", "vision_model_id", "embedding_model_id", "judge_model_id"]
+    model_fixtures = ["text_model_id", "vision_model_id", "embedding_model_id", "judge_model_id", "shield_id"]
     test_func = request.node.function
 
     actual_params = inspect.signature(test_func).parameters.keys()

From 7c888fc0daa97fd550bb7cdc35f8fa1e3648c443 Mon Sep 17 00:00:00 2001
From: Matthew Farrellee <matt@cs.wisc.edu>
Date: Mon, 29 Sep 2025 16:13:53 -0400
Subject: [PATCH 11/13] feat: update eval runner to use openai endpoints
 (#3588)

# What does this PR do?

move the eval=inline::meta-reference implementation to use
openai_completion/openai_chat_completion

note: this breaks backward compatibility if eval setup used sampling
params' repetition_penalty or strategy

## Test Plan

ci w/ new recordings

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
---
 .../inline/eval/meta_reference/eval.py        | 31 ++++++----
 .../recordings/responses/8d035e153b6f.json    | 56 +++++++++++++++++++
 .../recordings/responses/92a9a916ef02.json    | 56 +++++++++++++++++++
 .../recordings/responses/c62eb5d7115e.json    | 56 +++++++++++++++++++
 .../recordings/responses/e25ab43491af.json    | 56 +++++++++++++++++++
 .../recordings/responses/f28a44c97ea7.json    | 56 +++++++++++++++++++
 6 files changed, 300 insertions(+), 11 deletions(-)
 create mode 100644 tests/integration/recordings/responses/8d035e153b6f.json
 create mode 100644 tests/integration/recordings/responses/92a9a916ef02.json
 create mode 100644 tests/integration/recordings/responses/c62eb5d7115e.json
 create mode 100644 tests/integration/recordings/responses/e25ab43491af.json
 create mode 100644 tests/integration/recordings/responses/f28a44c97ea7.json

diff --git a/llama_stack/providers/inline/eval/meta_reference/eval.py b/llama_stack/providers/inline/eval/meta_reference/eval.py
index a03e8951c..0dfe23dca 100644
--- a/llama_stack/providers/inline/eval/meta_reference/eval.py
+++ b/llama_stack/providers/inline/eval/meta_reference/eval.py
@@ -12,7 +12,7 @@ from llama_stack.apis.agents import Agents, StepType
 from llama_stack.apis.benchmarks import Benchmark
 from llama_stack.apis.datasetio import DatasetIO
 from llama_stack.apis.datasets import Datasets
-from llama_stack.apis.inference import Inference, SystemMessage, UserMessage
+from llama_stack.apis.inference import Inference, OpenAISystemMessageParam, OpenAIUserMessageParam, UserMessage
 from llama_stack.apis.scoring import Scoring
 from llama_stack.providers.datatypes import BenchmarksProtocolPrivate
 from llama_stack.providers.inline.agents.meta_reference.agent_instance import (
@@ -159,31 +159,40 @@ class MetaReferenceEvalImpl(
     ) -> list[dict[str, Any]]:
         candidate = benchmark_config.eval_candidate
         assert candidate.sampling_params.max_tokens is not None, "SamplingParams.max_tokens must be provided"
+        sampling_params = {"max_tokens": candidate.sampling_params.max_tokens}
 
         generations = []
         for x in tqdm(input_rows):
             if ColumnName.completion_input.value in x:
+                if candidate.sampling_params.stop:
+                    sampling_params["stop"] = candidate.sampling_params.stop
+
                 input_content = json.loads(x[ColumnName.completion_input.value])
-                response = await self.inference_api.completion(
+                response = await self.inference_api.openai_completion(
                     model=candidate.model,
-                    content=input_content,
-                    sampling_params=candidate.sampling_params,
+                    prompt=input_content,
+                    **sampling_params,
                 )
-                generations.append({ColumnName.generated_answer.value: response.completion_message.content})
+                generations.append({ColumnName.generated_answer.value: response.choices[0].text})
             elif ColumnName.chat_completion_input.value in x:
                 chat_completion_input_json = json.loads(x[ColumnName.chat_completion_input.value])
-                input_messages = [UserMessage(**x) for x in chat_completion_input_json if x["role"] == "user"]
+                input_messages = [
+                    OpenAIUserMessageParam(**x) for x in chat_completion_input_json if x["role"] == "user"
+                ]
+
                 messages = []
                 if candidate.system_message:
                     messages.append(candidate.system_message)
-                messages += [SystemMessage(**x) for x in chat_completion_input_json if x["role"] == "system"]
+
+                messages += [OpenAISystemMessageParam(**x) for x in chat_completion_input_json if x["role"] == "system"]
+
                 messages += input_messages
-                response = await self.inference_api.chat_completion(
-                    model_id=candidate.model,
+                response = await self.inference_api.openai_chat_completion(
+                    model=candidate.model,
                     messages=messages,
-                    sampling_params=candidate.sampling_params,
+                    **sampling_params,
                 )
-                generations.append({ColumnName.generated_answer.value: response.completion_message.content})
+                generations.append({ColumnName.generated_answer.value: response.choices[0].message.content})
             else:
                 raise ValueError("Invalid input row")
 
diff --git a/tests/integration/recordings/responses/8d035e153b6f.json b/tests/integration/recordings/responses/8d035e153b6f.json
new file mode 100644
index 000000000..18f3ee3cd
--- /dev/null
+++ b/tests/integration/recordings/responses/8d035e153b6f.json
@@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Who is the CEO of Meta?"
+        }
+      ],
+      "max_tokens": 0
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-708",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "Mark Zuckerberg is the founder, chairman and CEO of Meta, which he originally founded as Facebook in 2004.",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1759012142,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 24,
+          "prompt_tokens": 32,
+          "total_tokens": 56,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
diff --git a/tests/integration/recordings/responses/92a9a916ef02.json b/tests/integration/recordings/responses/92a9a916ef02.json
new file mode 100644
index 000000000..5fe294826
--- /dev/null
+++ b/tests/integration/recordings/responses/92a9a916ef02.json
@@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "What is the currency of Japan?"
+        }
+      ],
+      "max_tokens": 0
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-343",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "The currency of Japan is the Japanese yen (, ry\u014d) and its symbol, \u00a5.",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1759012146,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 20,
+          "prompt_tokens": 32,
+          "total_tokens": 52,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
diff --git a/tests/integration/recordings/responses/c62eb5d7115e.json b/tests/integration/recordings/responses/c62eb5d7115e.json
new file mode 100644
index 000000000..fa872ac44
--- /dev/null
+++ b/tests/integration/recordings/responses/c62eb5d7115e.json
@@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "What is the smallest country in the world?"
+        }
+      ],
+      "max_tokens": 0
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-842",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "The smallest country in the world is the Vatican City, an independent city-state located within Rome, Italy. It has a total area of approximately 0.44 km\u00b2 (0.17 sq mi) and a population of around 800 people.\n\nDespite its tiny size, the Vatican City is a sovereign state with its own government, currency, postal system, and even a small army (the Gendarmeria Romana). It's also home to numerous iconic landmarks, including St. Peter's Basilica, the Sistine Chapel, and the Vatican Museums.\n\nThe Vatican City is so small that it can fit entirely within an average American city park!",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1759012145,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 133,
+          "prompt_tokens": 34,
+          "total_tokens": 167,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
diff --git a/tests/integration/recordings/responses/e25ab43491af.json b/tests/integration/recordings/responses/e25ab43491af.json
new file mode 100644
index 000000000..9fb331942
--- /dev/null
+++ b/tests/integration/recordings/responses/e25ab43491af.json
@@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "What is the capital of France?"
+        }
+      ],
+      "max_tokens": 0
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-808",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "The capital of France is Paris.",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1759012142,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 8,
+          "prompt_tokens": 32,
+          "total_tokens": 40,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
diff --git a/tests/integration/recordings/responses/f28a44c97ea7.json b/tests/integration/recordings/responses/f28a44c97ea7.json
new file mode 100644
index 000000000..d50851dfd
--- /dev/null
+++ b/tests/integration/recordings/responses/f28a44c97ea7.json
@@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "What is the largest planet in our solar system?"
+        }
+      ],
+      "max_tokens": 0
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-282",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "The largest planet in our solar system is Jupiter. It is a gas giant, with a diameter of approximately 142,984 kilometers (88,846 miles). This makes it more than 11 times the diameter of the Earth and more than 2.5 times the mass of all the other planets in our solar system combined.",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1759012143,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 67,
+          "prompt_tokens": 35,
+          "total_tokens": 102,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}

From 498be131a182ff9d6eddd8ebaa8869d3d4ceb123 Mon Sep 17 00:00:00 2001
From: Alexey Rybak <50731695+reluctantfuturist@users.noreply.github.com>
Date: Mon, 29 Sep 2025 13:14:05 -0700
Subject: [PATCH 12/13] docs: update image paths (#3599)

# What does this PR do?
* Updates image paths for images in docs/resources/ to proper static
image locations

## Test Plan
* `npm run build` builds documentation properly
---
 docs/docs/references/llama_cli_reference/index.md  |   2 +-
 docs/getting_started.ipynb                         |   2 +-
 docs/{resources => static/img}/agentic-system.png  | Bin
 docs/{resources => static/img}/list-templates.png  | Bin
 docs/{resources => static/img}/llama-stack.png     | Bin
 docs/{resources => static/img}/model-lifecycle.png | Bin
 docs/{resources => static/img}/prompt-format.png   | Bin
 7 files changed, 2 insertions(+), 2 deletions(-)
 rename docs/{resources => static/img}/agentic-system.png (100%)
 rename docs/{resources => static/img}/list-templates.png (100%)
 rename docs/{resources => static/img}/llama-stack.png (100%)
 rename docs/{resources => static/img}/model-lifecycle.png (100%)
 rename docs/{resources => static/img}/prompt-format.png (100%)

diff --git a/docs/docs/references/llama_cli_reference/index.md b/docs/docs/references/llama_cli_reference/index.md
index fe3aa51ab..9b71a6795 100644
--- a/docs/docs/references/llama_cli_reference/index.md
+++ b/docs/docs/references/llama_cli_reference/index.md
@@ -261,7 +261,7 @@ You can even run `llama model prompt-format` see all of the templates and their
 ```
 llama model prompt-format -m Llama3.2-3B-Instruct
 ```
-![alt text](../../../resources/prompt-format.png)
+![alt text](/img/prompt-format.png)
 
 
 You will be shown a Markdown formatted description of the model interface and how prompts / messages are formatted for various scenarios.
diff --git a/docs/getting_started.ipynb b/docs/getting_started.ipynb
index 641cf4224..449bd2be1 100644
--- a/docs/getting_started.ipynb
+++ b/docs/getting_started.ipynb
@@ -1013,7 +1013,7 @@
         "\n",
         "\n",
         "\n",
-        "<img src=\"https://github.com/meta-llama/llama-stack/blob/main/docs/resources/agentic-system.png?raw=true\" alt=\"drawing\" width=\"800\"/>\n",
+        "<img src=\"https://github.com/meta-llama/llama-stack/blob/main/docs/static/img/agentic-system.png?raw=true\" alt=\"drawing\" width=\"800\"/>\n",
         "\n",
         "\n",
         "Agents are characterized by having access to\n",
diff --git a/docs/resources/agentic-system.png b/docs/static/img/agentic-system.png
similarity index 100%
rename from docs/resources/agentic-system.png
rename to docs/static/img/agentic-system.png
diff --git a/docs/resources/list-templates.png b/docs/static/img/list-templates.png
similarity index 100%
rename from docs/resources/list-templates.png
rename to docs/static/img/list-templates.png
diff --git a/docs/resources/llama-stack.png b/docs/static/img/llama-stack.png
similarity index 100%
rename from docs/resources/llama-stack.png
rename to docs/static/img/llama-stack.png
diff --git a/docs/resources/model-lifecycle.png b/docs/static/img/model-lifecycle.png
similarity index 100%
rename from docs/resources/model-lifecycle.png
rename to docs/static/img/model-lifecycle.png
diff --git a/docs/resources/prompt-format.png b/docs/static/img/prompt-format.png
similarity index 100%
rename from docs/resources/prompt-format.png
rename to docs/static/img/prompt-format.png

From e9eb004bf8d6fbd466793e93370eb90e98f28552 Mon Sep 17 00:00:00 2001
From: Matthew Farrellee <matt@cs.wisc.edu>
Date: Mon, 29 Sep 2025 16:14:41 -0400
Subject: [PATCH 13/13] fix: remove inference.completion from docs (#3589)

# What does this PR do?

now that /v1/inference/completion has been removed, no docs should refer
to it

this cleans up remaining references

## Test Plan

ci

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
---
 .../references/python_sdk_reference/index.md  |  1 -
 docs/getting_started.ipynb                    | 14 ++-----
 .../Llama_Stack_NVIDIA_E2E_Flow.ipynb         | 38 +++++++------------
 .../remote/inference/nvidia/NVIDIA.md         | 19 ----------
 .../remote/post_training/nvidia/README.md     | 12 +++---
 tests/integration/README.md                   |  6 +--
 6 files changed, 26 insertions(+), 64 deletions(-)

diff --git a/docs/docs/references/python_sdk_reference/index.md b/docs/docs/references/python_sdk_reference/index.md
index e0b29363e..bce87e14a 100644
--- a/docs/docs/references/python_sdk_reference/index.md
+++ b/docs/docs/references/python_sdk_reference/index.md
@@ -217,7 +217,6 @@ from llama_stack_client.types import (
 Methods:
 
 - <code title="post /v1/inference/chat-completion">client.inference.<a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/resources/inference.py">chat_completion</a>(\*\*<a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/types/inference_chat_completion_params.py">params</a>) -> <a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/types/inference_chat_completion_response.py">InferenceChatCompletionResponse</a></code>
-- <code title="post /v1/inference/completion">client.inference.<a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/resources/inference.py">completion</a>(\*\*<a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/types/inference_completion_params.py">params</a>) -> <a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/types/inference_completion_response.py">InferenceCompletionResponse</a></code>
 - <code title="post /v1/inference/embeddings">client.inference.<a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/resources/inference.py">embeddings</a>(\*\*<a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/types/inference_embeddings_params.py">params</a>) -> <a href="https://github.com/meta-llama/llama-stack-client-python/tree/main/src/llama_stack_client/types/embeddings_response.py">EmbeddingsResponse</a></code>
 
 ## VectorIo
diff --git a/docs/getting_started.ipynb b/docs/getting_started.ipynb
index 449bd2be1..56aef2b7d 100644
--- a/docs/getting_started.ipynb
+++ b/docs/getting_started.ipynb
@@ -824,16 +824,10 @@
         "\n",
         "\n",
         "user_input = \"Michael Jordan was born in 1963. He played basketball for the Chicago Bulls. He retired in 2003. Extract this information into JSON for me. \"\n",
-        "response = client.inference.completion(\n",
-        "    model_id=\"meta-llama/Llama-3.1-8B-Instruct\",\n",
-        "    content=user_input,\n",
-        "    stream=False,\n",
-        "    sampling_params={\n",
-        "        \"strategy\": {\n",
-        "            \"type\": \"greedy\",\n",
-        "        },\n",
-        "        \"max_tokens\": 50,\n",
-        "    },\n",
+        "response = client.chat.completions.create(\n",
+        "    model=\"meta-llama/Llama-3.1-8B-Instruct\",\n",
+        "    messages=[{\"role\": \"user\", \"content\": user_input}],\n",
+        "    max_tokens=50,\n",
         "    response_format={\n",
         "        \"type\": \"json_schema\",\n",
         "        \"json_schema\": Output.model_json_schema(),\n",
diff --git a/docs/notebooks/nvidia/beginner_e2e/Llama_Stack_NVIDIA_E2E_Flow.ipynb b/docs/notebooks/nvidia/beginner_e2e/Llama_Stack_NVIDIA_E2E_Flow.ipynb
index d8f29d999..601276526 100644
--- a/docs/notebooks/nvidia/beginner_e2e/Llama_Stack_NVIDIA_E2E_Flow.ipynb
+++ b/docs/notebooks/nvidia/beginner_e2e/Llama_Stack_NVIDIA_E2E_Flow.ipynb
@@ -706,20 +706,15 @@
     "    provider_id=\"nvidia\",\n",
     ")\n",
     "\n",
-    "response = client.inference.completion(\n",
-    "    content=\"Complete the sentence using one word: Roses are red, violets are \",\n",
+    "response = client.completions.create(\n",
+    "    prompt=\"Complete the sentence using one word: Roses are red, violets are \",\n",
     "    stream=False,\n",
-    "    model_id=CUSTOMIZED_MODEL_DIR,\n",
-    "    sampling_params={\n",
-    "        \"strategy\": {\n",
-    "            \"type\": \"top_p\",\n",
-    "            \"temperature\": 0.7,\n",
-    "            \"top_p\": 0.9\n",
-    "        },\n",
-    "        \"max_tokens\": 20,\n",
-    "    },\n",
+    "    model=CUSTOMIZED_MODEL_DIR,\n",
+    "    temperature=0.7,\n",
+    "    top_p=0.9,\n",
+    "    max_tokens=20,\n",
     ")\n",
-    "print(f\"Inference response: {response.content}\")"
+    "print(f\"Inference response: {response.choices[0].text}\")"
    ]
   },
   {
@@ -1233,20 +1228,15 @@
     "    provider_id=\"nvidia\",\n",
     ")\n",
     "\n",
-    "response = client.inference.completion(\n",
-    "    content=\"Complete the sentence using one word: Roses are red, violets are \",\n",
+    "response = client.completions.create(\n",
+    "    prompt=\"Complete the sentence using one word: Roses are red, violets are \",\n",
     "    stream=False,\n",
-    "    model_id=customized_chat_model_dir,\n",
-    "    sampling_params={\n",
-    "        \"strategy\": {\n",
-    "            \"type\": \"top_p\",\n",
-    "            \"temperature\": 0.7,\n",
-    "            \"top_p\": 0.9\n",
-    "        },\n",
-    "        \"max_tokens\": 20,\n",
-    "    },\n",
+    "    model=customized_chat_model_dir,\n",
+    "    temperature=0.7,\n",
+    "    top_p=0.9,\n",
+    "    max_tokens=20,\n",
     ")\n",
-    "print(f\"Inference response: {response.content}\")"
+    "print(f\"Inference response: {response.choices[0].text}\")"
    ]
   },
   {
diff --git a/llama_stack/providers/remote/inference/nvidia/NVIDIA.md b/llama_stack/providers/remote/inference/nvidia/NVIDIA.md
index d9c18533a..4cb2dc394 100644
--- a/llama_stack/providers/remote/inference/nvidia/NVIDIA.md
+++ b/llama_stack/providers/remote/inference/nvidia/NVIDIA.md
@@ -39,25 +39,6 @@ client = LlamaStackAsLibraryClient("nvidia")
 client.initialize()
 ```
 
-### Create Completion
-
-The following example shows how to create a completion for an NVIDIA NIM.
-
-> [!NOTE]
-> The hosted NVIDIA Llama NIMs (for example ```meta-llama/Llama-3.1-8B-Instruct```) that have ```NVIDIA_BASE_URL="https://integrate.api.nvidia.com"``` do not support the ```completion``` method, while locally deployed NIMs do.
-
-```python
-response = client.inference.completion(
-    model_id="meta-llama/Llama-3.1-8B-Instruct",
-    content="Complete the sentence using one word: Roses are red, violets are :",
-    stream=False,
-    sampling_params={
-        "max_tokens": 50,
-    },
-)
-print(f"Response: {response.content}")
-```
-
 ### Create Chat Completion
 
 The following example shows how to create a chat completion for an NVIDIA NIM.
diff --git a/llama_stack/providers/remote/post_training/nvidia/README.md b/llama_stack/providers/remote/post_training/nvidia/README.md
index 6647316df..9b088a615 100644
--- a/llama_stack/providers/remote/post_training/nvidia/README.md
+++ b/llama_stack/providers/remote/post_training/nvidia/README.md
@@ -140,13 +140,11 @@ client.models.register(
 #### 2. Inference with the fine-tuned model
 
 ```python
-response = client.inference.completion(
-    content="Complete the sentence using one word: Roses are red, violets are ",
+response = client.completions.create(
+    prompt="Complete the sentence using one word: Roses are red, violets are ",
     stream=False,
-    model_id="test-example-model@v1",
-    sampling_params={
-        "max_tokens": 50,
-    },
+    model="test-example-model@v1",
+    max_tokens=50,
 )
-print(response.content)
+print(response.choices[0].text)
 ```
diff --git a/tests/integration/README.md b/tests/integration/README.md
index 467f97e02..b68526410 100644
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@@ -178,10 +178,10 @@ Note that when re-recording tests, you must use a Stack pointing to a server (i.
 
 ### Basic Test Pattern
 ```python
-def test_basic_completion(llama_stack_client, text_model_id):
-    response = llama_stack_client.inference.completion(
+def test_basic_chat_completion(llama_stack_client, text_model_id):
+    response = llama_stack_client.inference.chat_completion(
         model_id=text_model_id,
-        content=CompletionMessage(role="user", content="Hello"),
+        messages=[{"role": "user", "content": "Hello"}],
     )
 
     # Test structure, not AI output quality