llama-stack-mirror/toolchain/configs/ashwin.yaml
Ashwin Bharambe ad62e2e1f3 make inference server load checkpoints for fp8 inference
- introduce quantization related args for inference config
- also kill GeneratorArgs
2024-07-20 22:54:48 -07:00

11 lines
404 B
YAML

model_inference_config:
impl_type: "inline"
inline_config:
checkpoint_type: "pytorch"
checkpoint_dir: /home/ashwin/local/checkpoints/Meta-Llama-3.1-8B-Instruct-20240710150000
tokenizer_path: /home/ashwin/local/checkpoints/Meta-Llama-3.1-8B-Instruct-20240710150000/tokenizer.model
model_parallel_size: 1
max_seq_len: 2048
max_batch_size: 1
quantization:
type: "fp8"