llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Ashwin Bharambe	2089427d60	Make all methods `async def` again; add completion() for meta-reference (#270 ) PR #201 had made several changes while trying to fix issues with getting the stream=False branches of inference and agents API working. As part of this, it made a change which was slightly gratuitous. Namely, making chat_completion() and brethren "def" instead of "async def". The rationale was that this allowed the user (within llama-stack) of this to use it as: ``` async for chunk in api.chat_completion(params) ``` However, it causes unnecessary confusion for several folks. Given that clients (e.g., llama-stack-apps) anyway use the SDK methods (which are completely isolated) this choice was not ideal. Let's revert back so the call now looks like: ``` async for chunk in await api.chat_completion(params) ``` Bonus: Added a completion() implementation for the meta-reference provider. Technically should have been another PR :)	2024-10-18 20:50:59 -07:00
Ashwin Bharambe	71a905e93f	Allow overridding checkpoint_dir via config	2024-10-18 14:28:06 -07:00
Ashwin Bharambe	33afd34e6f	Add an option to not use elastic agents for meta-reference inference (#269 )	2024-10-18 12:51:10 -07:00
Ashwin Bharambe	09b793c4d6	Fix fp8 implementation which had bit-rotten a bit I only tested with "on-the-fly" bf16 -> fp8 conversion, not the "load from fp8" codepath. YAML I tested with: ``` providers: - provider_id: quantized provider_type: meta-reference-quantized config: model: Llama3.1-8B-Instruct quantization: type: fp8 ```	2024-10-15 13:57:01 -07:00
Yuan Tang	2128e61da2	Fix incorrect completion() signature for Databricks provider (#236 )	2024-10-11 08:47:57 -07:00
Xi Yan	ca29980c6b	fix agents context retriever	2024-10-10 20:17:29 -07:00
Ashwin Bharambe	1ff0476002	Split off meta-reference-quantized provider	2024-10-10 16:03:19 -07:00
Ashwin Bharambe	6bb57e72a7	Remove "routing_table" and "routing_key" concepts for the user (#201 ) This PR makes several core changes to the developer experience surrounding Llama Stack. Background: PR #92 introduced the notion of "routing" to the Llama Stack. It introduces three object types: (1) models, (2) shields and (3) memory banks. Each of these objects can be associated with a distinct provider. So you can get model A to be inferenced locally while model B, C can be inference remotely (e.g.) However, this had a few drawbacks: you could not address the provider instances -- i.e., if you configured "meta-reference" with a given model, you could not assign an identifier to this instance which you could re-use later. the above meant that you could not register a "routing_key" (e.g. model) dynamically and say "please use this existing provider I have already configured" for a new model. the terms "routing_table" and "routing_key" were exposed directly to the user. in my view, this is way too much overhead for a new user (which almost everyone is.) people come to the stack wanting to do ML and encounter a completely unexpected term. What this PR does: This PR structures the run config with only a single prominent key: - providers Providers are instances of configured provider types. Here's an example which shows two instances of the remote::tgi provider which are serving two different models. providers: inference: - provider_id: foo provider_type: remote::tgi config: { ... } - provider_id: bar provider_type: remote::tgi config: { ... } Secondly, the PR adds dynamic registration of { models \| shields \| memory_banks } to the API surface. The distribution still acts like a "routing table" (as previously) except that it asks the backing providers for a listing of these objects. For example it asks a TGI or Ollama inference adapter what models it is serving. Only the models that are being actually served can be requested by the user for inference. Otherwise, the Stack server will throw an error. When dynamically registering these objects, you can use the provider IDs shown above. Info about providers can be obtained using the Api.inspect set of endpoints (/providers, /routes, etc.) The above examples shows the correspondence between inference providers and models registry items. Things work similarly for the safety <=> shields and memory <=> memory_banks pairs. Registry: This PR also makes it so that Providers need to implement additional methods for registering and listing objects. For example, each Inference provider is now expected to implement the ModelsProtocolPrivate protocol (naming is not great!) which consists of two methods register_model list_models The goal is to inform the provider that a certain model needs to be supported so the provider can make any relevant backend changes if needed (or throw an error if the model cannot be supported.) There are many other cleanups included some of which are detailed in a follow-up comment.	2024-10-10 10:24:13 -07:00
Dalton Flanagan	7a8aa775e5	JSON serialization for parallel processing queue (#232 ) * send/recv pydantic json over socket * fixup * address feedback * bidirectional wrapper * second round of feedback	2024-10-09 17:24:12 -04:00
Xi Yan	4d5f7459aa	[bugfix] Fix logprobs on meta-reference impl (#213 ) * fix log probs * add back LogProbsConfig * error handling * bugfix	2024-10-07 19:42:39 -07:00
Mindaugas	53d440e952	Fix ValueError in case chunks are empty (#206 )	2024-10-07 08:55:06 -07:00
Ashwin Bharambe	f913b57397	fix fp8 imports	2024-10-03 14:40:21 -07:00
Ashwin Bharambe	210b71b0ba	fix prompt guard (#177 ) Several other fixes to configure. Add support for 1b/3b models in ollama.	2024-10-03 11:07:53 -07:00
Ashwin Bharambe	19ce6bf009	Don't validate prompt-guard anymore	2024-10-02 20:43:57 -07:00
Ashwin Bharambe	4a75d922a9	Make Llama Guard 1B the default	2024-10-02 09:48:26 -07:00
Ashwin Bharambe	eb2d8a31a5	Add a RoutableProvider protocol, support for multiple routing keys (#163 ) * Update configure.py to use multiple routing keys for safety * Refactor distribution/datatypes into a providers/datatypes * Cleanup	2024-09-30 17:30:21 -07:00
Xi Yan	4ae8c63a2b	pre-commit lint	2024-09-28 16:04:41 -07:00
Ashwin Bharambe	0a3999a9a4	Use inference APIs for executing Llama Guard (#121 ) We should use Inference APIs to execute Llama Guard instead of directly needing to use HuggingFace modeling related code. The actual inference consideration is handled by Inference.	2024-09-28 15:40:06 -07:00
Russell Bryant	5828ffd53b	inference: Fix download command in error msg (#133 ) I got this error message and tried to the run the command presented and it didn't work. The model needs to be give with `--model-id` instead of as a positional argument. Signed-off-by: Russell Bryant <rbryant@redhat.com>	2024-09-27 13:31:11 -07:00
Kate Plawiak	3ae1597b9b	load models using hf model id (#108 )	2024-09-25 18:40:09 -07:00
Xi Yan	82f420c4f0	fix safety using inference (#99 )	2024-09-25 11:30:27 -07:00
Ashwin Bharambe	d442af0818	Add safety impl for llama guard vision	2024-09-25 11:07:19 -07:00
Ashwin Bharambe	4fcda00872	Re-apply revert	2024-09-25 11:00:43 -07:00
Ashwin Bharambe	56aed59eb4	Support for Llama3.2 models and Swift SDK (#98 )	2024-09-25 10:29:58 -07:00
Xi Yan	45be9f3b85	fix agent's embedding model config	2024-09-24 22:49:49 -07:00
Ashwin Bharambe	a2465f3f9c	Revert parts of `0d2eb3bd25`	2024-09-24 19:20:51 -07:00
Ashwin Bharambe	0d2eb3bd25	Use inference APIs for running llama guard Test Plan: First, start a TGI container with `meta-llama/Llama-Guard-3-8B` model serving on port 5099. See https://github.com/meta-llama/llama-stack/pull/53 and its description for how. Then run llama-stack with the following run config: ``` image_name: safety docker_image: null conda_env: safety apis_to_serve: - models - inference - shields - safety api_providers: inference: providers: - remote::tgi safety: providers: - meta-reference telemetry: provider_id: meta-reference config: {} routing_table: inference: - provider_id: remote::tgi config: url: http://localhost:5099 api_token: null hf_endpoint_name: null routing_key: Llama-Guard-3-8B safety: - provider_id: meta-reference config: llama_guard_shield: model: Llama-Guard-3-8B excluded_categories: [] disable_input_check: false disable_output_check: false prompt_guard_shield: null routing_key: llama_guard ``` Now simply run `python -m llama_stack.apis.safety.client localhost <port>` and check that the llama_guard shield calls run correctly. (The injection_shield calls fail as expected since we have not set up a router for them.)	2024-09-24 17:02:57 -07:00
Xi Yan	f92ff86b96	fix shields in agents safety	2024-09-23 21:22:22 -07:00
Ashwin Bharambe	c9005e95ed	Another attempt at a proper bugfix for safety violations	2024-09-23 19:06:30 -07:00
Xi Yan	e5bdd6615a	bug fix for safety violation	2024-09-23 18:17:15 -07:00
Xi Yan	70fb70a71c	fix URL issue with agents	2024-09-23 16:44:25 -07:00
Ashwin Bharambe	ec4fc800cc	[API Updates] Model / shield / memory-bank routing + agent persistence + support for private headers (#92 ) This is yet another of those large PRs (hopefully we will have less and less of them as things mature fast). This one introduces substantial improvements and some simplifications to the stack. Most important bits: * Agents reference implementation now has support for session / turn persistence. The default implementation uses sqlite but there's also support for using Redis. * We have re-architected the structure of the Stack APIs to allow for more flexible routing. The motivating use cases are: - routing model A to ollama and model B to a remote provider like Together - routing shield A to local impl while shield B to a remote provider like Bedrock - routing a vector memory bank to Weaviate while routing a keyvalue memory bank to Redis * Support for provider specific parameters to be passed from the clients. A client can pass data using `x_llamastack_provider_data` parameter which can be type-checked and provided to the Adapter implementations.	2024-09-23 14:22:22 -07:00
Hardik Shah	8bf8c07eb3	Respect user sent instructions in agent config and add them to system prompt	2024-09-21 16:46:10 -07:00
Ashwin Bharambe	132f9429b1	Add a test for CLI, but not fully done so disabled	2024-09-19 13:27:07 -07:00
Ashwin Bharambe	8b3ffa33de	Add another test case	2024-09-19 13:02:57 -07:00
Ashwin Bharambe	abb43936ab	Add a test runner and 2 very simple tests for agents	2024-09-19 12:22:48 -07:00
Ashwin Bharambe	f5eda1decf	Add default for max_seq_len	2024-09-18 21:59:10 -07:00
Ashwin Bharambe	8cdc2f0cfb	No RunShieldRequest	2024-09-18 20:38:21 -07:00
Xi Yan	e6fdb9df29	fix context retriever (#75 )	2024-09-18 08:24:36 -07:00
Ashwin Bharambe	9fd431e710	make shield imports more lazy	2024-09-17 21:27:37 -07:00
Ashwin Bharambe	25adc83de8	Fix for safety	2024-09-17 19:56:58 -07:00
Ashwin Bharambe	9487ad8294	API Updates (#73 ) * API Keys passed from Client instead of distro configuration * delete distribution registry * Rename the "package" word away * Introduce a "Router" layer for providers Some providers need to be factorized and considered as thin routing layers on top of other providers. Consider two examples: - The inference API should be a routing layer over inference providers, routed using the "model" key - The memory banks API is another instance where various memory bank types will be provided by independent providers (e.g., a vector store is served by Chroma while a keyvalue memory can be served by Redis or PGVector) This commit introduces a generalized routing layer for this purpose. * update `apis_to_serve` * llama_toolchain -> llama_stack * Codemod from llama_toolchain -> llama_stack - added providers/registry - cleaned up api/ subdirectories and moved impls away - restructured api/api.py - from llama_stack.apis.<api> import foo should work now - update imports to do llama_stack.apis.<api> - update many other imports - added __init__, fixed some registry imports - updated registry imports - create_agentic_system -> create_agent - AgenticSystem -> Agent * Moved some stuff out of common/; re-generated OpenAPI spec * llama-toolchain -> llama-stack (hyphens) * add control plane API * add redis adapter + sqlite provider * move core -> distribution * Some more toolchain -> stack changes * small naming shenanigans * Removing custom tool and agent utilities and moving them client side * Move control plane to distribution server for now * Remove control plane from API list * no codeshield dependency randomly plzzzzz * Add "fire" as a dependency * add back event loggers * stack configure fixes * use brave instead of bing in the example client * add init file so it gets packaged * add init files so it gets packaged * Update MANIFEST * bug fix --------- Co-authored-by: Hardik Shah <hjshah@fb.com> Co-authored-by: Xi Yan <xiyan@meta.com> Co-authored-by: Ashwin Bharambe <ashwin@meta.com>	2024-09-17 19:51:35 -07:00

42 commits