llama-stack-mirror/llama_stack
yyymeta b79e0435de
fix: avoid tensor memory error (#1688)
# What does this PR do?

we randomly get errors like the following, it's most likely due to
accessing an object that is already deallocated

```

E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last):
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     fn(i, *args)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 611, in _wrap
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     ret = record(fn)(*args_)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     return f(*args, **kwargs)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 249, in worker_process_entrypoint
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     task = req_gen.send(result)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 156, in retrieve_requests
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     torch.distributed.broadcast_object_list(
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     return func(*args, **kwargs)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3504, in broadcast_object_list
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     object_list[i] = _tensor_to_object(obj_view, obj_size, group)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2961, in _tensor_to_object
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     return _unpickler(io.BytesIO(buf)).load()
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] EOFError: Ran out of input
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
Process SpawnProcess-1:
Traceback (most recent call last):
```

## Test Plan
start server

```
llama-stack-client eval run-benchmark mmmu_v1  --model-id meta-llama/Llama-4-17B-Omni-Instruct  --output-dir /tmp/mmmu_standard --num-examples 30
```

[//]: # (## Documentation)
2025-03-18 16:17:29 -07:00
..
apis chore: fix mypy violations in post_training modules (#1548) 2025-03-18 14:58:16 -07:00
cli refactor: simplify command execution and remove PTY handling (#1641) 2025-03-17 15:03:14 -07:00
distribution fix: Fixed import errors for UI and playground (#1666) 2025-03-18 15:00:48 -07:00
models/llama refactor: move all datetime.now() calls to UTC (#1589) 2025-03-13 15:34:53 -07:00
providers fix: avoid tensor memory error (#1688) 2025-03-18 16:17:29 -07:00
strong_typing Ensure that deprecations for fields follow through to OpenAPI 2025-02-19 13:54:04 -08:00
templates fix: Add the option to not verify SSL at remote-vllm provider (#1585) 2025-03-18 09:33:35 -04:00
__init__.py export LibraryClient 2024-12-13 12:08:00 -08:00
env.py refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401) 2025-03-04 14:53:47 -08:00
log.py feat: add support for logging config in the run.yaml (#1408) 2025-03-14 12:36:25 -07:00
schema_utils.py ci: add mypy for static type checking (#1101) 2025-02-21 13:15:40 -08:00