test: Make text-based chat completion tests run 10x faster (#1016)

# What does this PR do? This significantly shortens the test time (about 10x faster) since most of the time is spent on outputing the tokens "there are several planets in our solar system that have...". We want to have an answer quicker, especially when testing even larger models. ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/inference/test_text_inference.py -k "test_text_chat_completion_non_streaming or test_text_chat_completion_streaming" ================================================================== test session starts =================================================================== platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/yutang/.conda/envs/myenv/bin/python3.10 cachedir: .pytest_cache rootdir: /home/yutang/repos/llama-stack configfile: pyproject.toml plugins: anyio-4.7.0 collected 12 items / 8 deselected / 4 selected tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet do humans live on?-Earth] PASSED [ 25%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-Which planet has rings around it with a name starting with letter S?-Saturn] PASSED [ 50%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What's the name of the Sun in latin?-Sol] PASSED [ 75%] tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What is the name of the US captial?-Washington] PASSED [100%] ``` --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-08 14:49:46 -05:00 · 2025-02-08 14:49:46 -05:00 · 413099ef6a
commit 413099ef6a
parent 7766e68e92
1 changed files with 2 additions and 2 deletions
--- a/tests/client-sdk/inference/test_text_inference.py
+++ b/tests/client-sdk/inference/test_text_inference.py
@ -156,8 +156,8 @@ def test_text_completion_structured_output(llama_stack_client, text_model_id, in
@pytest.mark.parametrize(
    "question,expected",
    [
-        ("What are the names of planets in our solar system?", "Earth"),
+        ("Which planet do humans live on?", "Earth"),
-        ("What are the names of the planets that have rings around them?", "Saturn"),
+        ("Which planet has rings around it with a name starting with letter S?", "Saturn"),
    ],
 )
 def test_text_chat_completion_non_streaming(llama_stack_client, text_model_id, question, expected):