llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-09 03:19:20 +00:00

Author	SHA1	Message	Date
Xi Yan	6abef716dd	rebase on top of registry	2024-10-08 23:41:03 -07:00
Xi Yan	0919072a33	eleuther custom tasks	2024-10-08 23:22:50 -07:00
Ashwin Bharambe	8eee5b9adc	Fix server conditional awaiting on coroutines	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	216e7eb4d5	Move `async with SEMAPHORE` inside the async methods	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	4540d8bd87	move codeshield into an independent safety provider	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	7f1160296c	Updates to server.py to clean up streaming vs non-streaming stuff Also make sure agent turn create is correctly marked	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	640c5c54f7	rename augment_messages	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	336cf7a674	update vllm; not quite tested yet	2024-10-08 17:23:42 -07:00
Ashwin Bharambe	0c9eb3341c	Separate chat_completion stream and non-stream implementations This is a pretty important requirement. The streaming response type is an AsyncGenerator while the non-stream one is a single object. So far this has worked _sometimes_ due to various pre-existing hacks (and in some cases, just failed.)	2024-10-08 17:23:40 -07:00
Ashwin Bharambe	4fa467731e	Fix a bug in meta-reference inference when stream=False Also introduce a gross hack (to cover grosser(?) hack) to ensure non-stream requests don't send back responses in SSE format. Not sure which of these hacks is grosser.	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	1550187cd8	cleanup	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	91e0063593	Introduce model_store, shield_store, memory_bank_store	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	e45a417543	more fixes, plug shutdown handlers still, FastAPIs sigint handler is not calling ours	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	59302a86df	inference registry updates	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	4215cc9331	Push registration methods onto the backing providers	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	5a7b01d292	Significantly upgrade the interactive configuration experience	2024-10-08 17:23:02 -07:00
Ashwin Bharambe	f3923e3f0b	Redo the { models, shields, memory_banks } typeset	2024-10-08 17:23:02 -07:00
Xi Yan	b87bdd0176	registry refactor	2024-10-08 15:44:02 -07:00
Xi Yan	a56ea48d71	excel dataset	2024-10-07 21:56:13 -07:00
Xi Yan	4d5f7459aa	[bugfix] Fix logprobs on meta-reference impl (#213 ) * fix log probs * add back LogProbsConfig * error handling * bugfix	2024-10-07 19:42:39 -07:00
Xi Yan	5b7d24b1c3	wip	2024-10-07 17:27:06 -07:00
Xi Yan	4764762dd4	tasks registry	2024-10-07 15:57:39 -07:00
Mindaugas	53d440e952	Fix ValueError in case chunks are empty (#206 )	2024-10-07 08:55:06 -07:00
Russell Bryant	f73e247ba1	Inline vLLM inference provider (#181 ) This is just like `local` using `meta-reference` for everything except it uses `vllm` for inference. Docker works, but So far, `conda` is a bit easier to use with the vllm provider. The default container base image does not include all the necessary libraries for all vllm features. More cuda dependencies are necessary. I started changing this base image used in this template, but it also required changes to the Dockerfile, so it was getting too involved to include in the first PR. Working so far: * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream True` * `python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False` Example: ``` $ python -m llama_stack.apis.inference.client localhost 5000 --model Llama3.2-1B-Instruct --stream False User>hello world, write me a 2 sentence poem about the moon Assistant> The moon glows bright in the midnight sky A beacon of light, ``` I have only tested these models: * `Llama3.1-8B-Instruct` - across 4 GPUs (tensor_parallel_size = 4) * `Llama3.2-1B-Instruct` - on a single GPU (tensor_parallel_size = 1)	2024-10-05 23:34:16 -07:00
Xi Yan	041634192a	move folder	2024-10-05 11:57:21 -07:00
Xi Yan	2441e66d14	evals api mvp	2024-10-04 00:50:03 -07:00
Xi Yan	3cbe3a72e8	mvp	2024-10-04 00:25:57 -07:00
Xi Yan	4f07aca309	get task	2024-10-03 17:31:46 -07:00
Ashwin Bharambe	f913b57397	fix fp8 imports	2024-10-03 14:40:21 -07:00
Xi Yan	8339b2cef3	wip api	2024-10-03 13:47:15 -07:00
Ashwin Bharambe	210b71b0ba	fix prompt guard (#177 ) Several other fixes to configure. Add support for 1b/3b models in ollama.	2024-10-03 11:07:53 -07:00
Ashwin Bharambe	19ce6bf009	Don't validate prompt-guard anymore	2024-10-02 20:43:57 -07:00
Ashwin Bharambe	4a75d922a9	Make Llama Guard 1B the default	2024-10-02 09:48:26 -07:00
Ashwin Bharambe	eb2d8a31a5	Add a RoutableProvider protocol, support for multiple routing keys (#163 ) * Update configure.py to use multiple routing keys for safety * Refactor distribution/datatypes into a providers/datatypes * Cleanup	2024-09-30 17:30:21 -07:00
Xi Yan	4ae8c63a2b	pre-commit lint	2024-09-28 16:04:41 -07:00
Ashwin Bharambe	0a3999a9a4	Use inference APIs for executing Llama Guard (#121 ) We should use Inference APIs to execute Llama Guard instead of directly needing to use HuggingFace modeling related code. The actual inference consideration is handled by Inference.	2024-09-28 15:40:06 -07:00
Russell Bryant	5828ffd53b	inference: Fix download command in error msg (#133 ) I got this error message and tried to the run the command presented and it didn't work. The model needs to be give with `--model-id` instead of as a positional argument. Signed-off-by: Russell Bryant <rbryant@redhat.com>	2024-09-27 13:31:11 -07:00
Kate Plawiak	3ae1597b9b	load models using hf model id (#108 )	2024-09-25 18:40:09 -07:00
Xi Yan	82f420c4f0	fix safety using inference (#99 )	2024-09-25 11:30:27 -07:00
Dalton Flanagan	5c4f73d52f	Drop header from LocalInference.h	2024-09-25 11:27:37 -07:00
Ashwin Bharambe	d442af0818	Add safety impl for llama guard vision	2024-09-25 11:07:19 -07:00
Dalton Flanagan	b3b0349931	Update LocalInference to use public repos	2024-09-25 11:05:51 -07:00
Ashwin Bharambe	4fcda00872	Re-apply revert	2024-09-25 11:00:43 -07:00
Ashwin Bharambe	56aed59eb4	Support for Llama3.2 models and Swift SDK (#98 )	2024-09-25 10:29:58 -07:00
Xi Yan	45be9f3b85	fix agent's embedding model config	2024-09-24 22:49:49 -07:00
Ashwin Bharambe	a2465f3f9c	Revert parts of `0d2eb3bd25`	2024-09-24 19:20:51 -07:00
Ashwin Bharambe	0d2eb3bd25	Use inference APIs for running llama guard Test Plan: First, start a TGI container with `meta-llama/Llama-Guard-3-8B` model serving on port 5099. See https://github.com/meta-llama/llama-stack/pull/53 and its description for how. Then run llama-stack with the following run config: ``` image_name: safety docker_image: null conda_env: safety apis_to_serve: - models - inference - shields - safety api_providers: inference: providers: - remote::tgi safety: providers: - meta-reference telemetry: provider_id: meta-reference config: {} routing_table: inference: - provider_id: remote::tgi config: url: http://localhost:5099 api_token: null hf_endpoint_name: null routing_key: Llama-Guard-3-8B safety: - provider_id: meta-reference config: llama_guard_shield: model: Llama-Guard-3-8B excluded_categories: [] disable_input_check: false disable_output_check: false prompt_guard_shield: null routing_key: llama_guard ``` Now simply run `python -m llama_stack.apis.safety.client localhost <port>` and check that the llama_guard shield calls run correctly. (The injection_shield calls fail as expected since we have not set up a router for them.)	2024-09-24 17:02:57 -07:00
Xi Yan	d04cd97aba	remove providers/impls/sqlite/*	2024-09-24 01:03:40 -07:00
Xi Yan	f92ff86b96	fix shields in agents safety	2024-09-23 21:22:22 -07:00
Ashwin Bharambe	c9005e95ed	Another attempt at a proper bugfix for safety violations	2024-09-23 19:06:30 -07:00

1 2

63 commits