llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-08 19:10:56 +00:00

Author	SHA1	Message	Date
Ashwin Bharambe	fcd22b6baa	Make Safety test work, other cleanup	2024-10-09 21:09:50 -07:00
Ashwin Bharambe	b55034c0de	Another round of simplification and clarity for models/shields/memory_banks stuff	2024-10-09 19:19:26 -07:00
Ashwin Bharambe	8eee5b9adc	Fix server conditional awaiting on coroutines	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	216e7eb4d5	Move `async with SEMAPHORE` inside the async methods	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	4540d8bd87	move codeshield into an independent safety provider	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	7f1160296c	Updates to server.py to clean up streaming vs non-streaming stuff Also make sure agent turn create is correctly marked	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	640c5c54f7	rename augment_messages	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	336cf7a674	update vllm; not quite tested yet	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	0c9eb3341c	Separate chat_completion stream and non-stream implementations This is a pretty important requirement. The streaming response type is an AsyncGenerator while the non-stream one is a single object. So far this has worked _sometimes_ due to various pre-existing hacks (and in some cases, just failed.)	2024-10-08 17:23:40 -07:00
Ashwin Bharambe	4fa467731e	Fix a bug in meta-reference inference when stream=False Also introduce a gross hack (to cover grosser(?) hack) to ensure non-stream requests don't send back responses in SSE format. Not sure which of these hacks is grosser.	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	1550187cd8	cleanup	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	91e0063593	Introduce model_store, shield_store, memory_bank_store	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	e45a417543	more fixes, plug shutdown handlers still, FastAPIs sigint handler is not calling ours	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	59302a86df	inference registry updates	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	4215cc9331	Push registration methods onto the backing providers	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	5a7b01d292	Significantly upgrade the interactive configuration experience	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	f3923e3f0b	Redo the { models, shields, memory_banks } typeset	2024-10-08 17:23:02 -07:00
Xi Yan	4d5f7459aa	[bugfix] Fix logprobs on meta-reference impl (#213 ) * fix log probs * add back LogProbsConfig * error handling * bugfix	2024-10-07 19:42:39 -07:00
Mindaugas	53d440e952	Fix ValueError in case chunks are empty (#206 )	2024-10-07 08:55:06 -07:00
Russell Bryant	f73e247ba1	Inline vLLM inference provider (#181 ) This is just like `local` using `meta-reference` for everything except it uses `vllm` for inference. Docker works, but So far, `conda` is a bit easier to use with the vllm provider. The default container base image does not include all the necessary libraries for all vllm features. More cuda dependencies are necessary. I started changing this base image used in this template, but it also required changes to the Dockerfile, so it was getting too involved to include in the first PR. Working so far: * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True` * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False` Example: ``` $ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False User>hello world, write me a 2 sentence poem about the moon Assistant> The moon glows bright in the midnight sky A beacon of light, ``` I have only tested these models: * `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4) * `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)	2024-10-05 23:34:16 -07:00
Ashwin Bharambe	f913b57397	fix fp8 imports	2024-10-03 14:40:21 -07:00
Ashwin Bharambe	210b71b0ba	fix prompt guard (#177 ) Several other fixes to configure. Add support for 1b/3b models in ollama.	2024-10-03 11:07:53 -07:00
Ashwin Bharambe	19ce6bf009	Don't validate prompt-guard anymore	2024-10-02 20:43:57 -07:00
Ashwin Bharambe	4a75d922a9	Make Llama Guard 1B the default	2024-10-02 09:48:26 -07:00
Ashwin Bharambe	eb2d8a31a5	Add a RoutableProvider protocol, support for multiple routing keys (#163 ) * Update configure.py to use multiple routing keys for safety * Refactor distribution/datatypes into a providers/datatypes * Cleanup	2024-09-30 17:30:21 -07:00
Xi Yan	4ae8c63a2b	pre-commit lint	2024-09-28 16:04:41 -07:00
Ashwin Bharambe	0a3999a9a4	Use inference APIs for executing Llama Guard (#121 ) We should use Inference APIs to execute Llama Guard instead of directly needing to use HuggingFace modeling related code. The actual inference consideration is handled by Inference.	2024-09-28 15:40:06 -07:00
Russell Bryant	5828ffd53b	inference: Fix download command in error msg (#133 ) I got this error message and tried to the run the command presented and it didn't work. The model needs to be give with `--model-id` instead of as a positional argument. Signed-off-by: Russell Bryant <rbryant@redhat.com>	2024-09-27 13:31:11 -07:00
Kate Plawiak	3ae1597b9b	load models using hf model id (#108 )	2024-09-25 18:40:09 -07:00
Xi Yan	82f420c4f0	fix safety using inference (#99 )	2024-09-25 11:30:27 -07:00
Dalton Flanagan	5c4f73d52f	Drop header from LocalInference.h	2024-09-25 11:27:37 -07:00
Ashwin Bharambe	d442af0818	Add safety impl for llama guard vision	2024-09-25 11:07:19 -07:00
Dalton Flanagan	b3b0349931	Update LocalInference to use public repos	2024-09-25 11:05:51 -07:00
Ashwin Bharambe	4fcda00872	Re-apply revert	2024-09-25 11:00:43 -07:00
Ashwin Bharambe	56aed59eb4	Support for Llama3.2 models and Swift SDK (#98 )	2024-09-25 10:29:58 -07:00
Xi Yan	45be9f3b85	fix agent's embedding model config	2024-09-24 22:49:49 -07:00
Ashwin Bharambe	a2465f3f9c	Revert parts of `0d2eb3bd25`	2024-09-24 19:20:51 -07:00
Ashwin Bharambe	0d2eb3bd25	Use inference APIs for running llama guard Test Plan: First, start a TGI container with `meta-llama/Llama-Guard-3-8B` model serving on port 5099. See https://github.com/meta-llama/llama-stack/pull/53 and its description for how. Then run llama-stack with the following run config: ``` image_name: safety docker_image: null conda_env: safety apis_to_serve: - models - inference - shields - safety api_providers: inference: providers: - remote::tgi safety: providers: - meta-reference telemetry: provider_id: meta-reference config: {} routing_table: inference: - provider_id: remote::tgi config: url: http://localhost:5099 api_token: null hf_endpoint_name: null routing_key: Llama-Guard-3-8B safety: - provider_id: meta-reference config: llama_guard_shield: model: Llama-Guard-3-8B excluded_categories: [] disable_input_check: false disable_output_check: false prompt_guard_shield: null routing_key: llama_guard ``` Now simply run `python -m llama_stack.apis.safety.client localhost <port>` and check that the llama_guard shield calls run correctly. (The injection_shield calls fail as expected since we have not set up a router for them.)	2024-09-24 17:02:57 -07:00
Xi Yan	d04cd97aba	remove providers/impls/sqlite/*	2024-09-24 01:03:40 -07:00
Xi Yan	f92ff86b96	fix shields in agents safety	2024-09-23 21:22:22 -07:00
Ashwin Bharambe	c9005e95ed	Another attempt at a proper bugfix for safety violations	2024-09-23 19:06:30 -07:00
Xi Yan	e5bdd6615a	bug fix for safety violation	2024-09-23 18:17:15 -07:00
Xi Yan	70fb70a71c	fix URL issue with agents	2024-09-23 16:44:25 -07:00
Ashwin Bharambe	ec4fc800cc	[API Updates] Model / shield / memory-bank routing + agent persistence + support for private headers (#92 ) This is yet another of those large PRs (hopefully we will have less and less of them as things mature fast). This one introduces substantial improvements and some simplifications to the stack. Most important bits: * Agents reference implementation now has support for session / turn persistence. The default implementation uses sqlite but there's also support for using Redis. * We have re-architected the structure of the Stack APIs to allow for more flexible routing. The motivating use cases are: - routing model A to ollama and model B to a remote provider like Together - routing shield A to local impl while shield B to a remote provider like Bedrock - routing a vector memory bank to Weaviate while routing a keyvalue memory bank to Redis * Support for provider specific parameters to be passed from the clients. A client can pass data using `x_llamastack_provider_data` parameter which can be type-checked and provided to the Adapter implementations.	2024-09-23 14:22:22 -07:00
Hardik Shah	8bf8c07eb3	Respect user sent instructions in agent config and add them to system prompt	2024-09-21 16:46:10 -07:00
Ashwin Bharambe	132f9429b1	Add a test for CLI, but not fully done so disabled	2024-09-19 13:27:07 -07:00
Ashwin Bharambe	8b3ffa33de	Add another test case	2024-09-19 13:02:57 -07:00
Ashwin Bharambe	abb43936ab	Add a test runner and 2 very simple tests for agents	2024-09-19 12:22:48 -07:00
Ashwin Bharambe	f5eda1decf	Add default for max_seq_len	2024-09-18 21:59:10 -07:00
Ashwin Bharambe	8cdc2f0cfb	No RunShieldRequest	2024-09-18 20:38:21 -07:00

1 2

54 commits