feat(tests): auto-merge all model list responses and unify recordings (#3320)

One needed to specify record-replay related environment variables for running integration tests. We could not use defaults because integration tests could be run against Ollama instances which could be running different models. For example, text vs vision tests needed separate instances of Ollama because a single instance typically cannot serve both of these models if you assume the standard CI worker configuration on Github. As a result, `client.list()` as returned by the Ollama client would be different between these runs and we'd end up overwriting responses. This PR "solves" it by adding a small amount of complexity -- we store model list responses specially, keyed by the hashes of the models they return. At replay time, we merge all of them and pretend that we have the union of all models available. ## Test Plan Re-recorded all the tests using `scripts/integration-tests.sh --inference-mode record`, including the vision tests.
2025-12-09 03:19:20 +00:00 · 2025-09-03 11:33:03 -07:00 · 2025-09-03 11:33:03 -07:00 · c3d3a0b833
commit c3d3a0b833
parent d948e63340
122 changed files with 17460 additions and 16129 deletions
--- a/tests/integration/recordings/responses/8295382a8e7c.json
+++ b/tests/integration/recordings/responses/8295382a8e7c.json
@ -1,7 +1,7 @@
 {
  "request": {
    "method": "POST",
-    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
    "headers": {},
    "body": {
      "model": "llama3.2:3b-instruct-fp16",
@ -20,14 +20,14 @@
    "body": {
      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
      "__data__": {
-        "id": "chatcmpl-99",
+        "id": "chatcmpl-713",
        "choices": [
          {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
-              "content": "I'd be happy to help you test the OpenAI 2 architecture!\n\nOpenAI 2 is a neural network model developed by OpenAI, and it's not exactly possible for me to directly \"test\" it. However, I can guide you through a simplified example of how to verify if OpenAI 2 has been implemented correctly in a specific codebase.\n\nHere's an outline of the steps:\n\n1. **Understand the basics**: Before we dive into testing, make sure you understand the architectural and functional details of OpenAI 2.\n2. **Get access to the model**: You'll need to obtain a trained OpenAI 2 model or implement it from scratch using a language like PyTorch or TensorFlow.\n3. **Implement a validation framework**: Create a simple validation framework that uses common tasks, such as classification on the GLUE benchmark, to evaluate the performance of your OpenAI 2 model.\n\nHere's a simplified code example in PyTorch:\n```python\nimport torch\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\n# Load pre-trained OpenAI 2 Model(s)\nmodel_name = \"github/openai/OpenAIAccelerated-Text-To-Speech\"\nmodel_class = AutoModelForSequenceClassification\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# Initialize the model and tokenizer\nmodel = model_class(pretrained=True, num_labels=8)  # adjust label number according to your task\ntokenizer = tokenizer\n\ndef evaluate_model():\n    batch_size = 100\n    device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\n    # Create a validation dataset from the GLUE benchmark\n    glue_datasets = [ datasets[\"glue\"][\"sst2\"], datasets[\"glue\"][\"mnli\"] ]\n    val_dataset = []\n    for i, gds in enumerate(glue_datasets):\n        data = gds[:10000]  # take only first few examples to speed up evaluation\n        input_ids = tokenizer encodings(data[\"sentence1\"], \n                                        attention_mask=data[\"attention_mask\"],\n                                        max_length=512,\n                                        padding=\"max_length\",\n                                        truncation=True,\n                                        return_tensors=\"pt\")\n\n        for example in data:\n            for sentence in [example['sentence1'], example['sentence2']]:\n                input_ids = input_ids.to(device)\n                outputs = model(input_ids, labels=None)  # adjust label to empty\n              \n\n    # Compute evaluation metrics\n    predictions = []\n    labels = []\n    accuracy = torch.zeros(8).to(device)\n\n    for sentence in data.values():\n        sentenceids = [input_ids[\"input_ids\"].flatten()]\n        _, pred_labels = model(sentenceids)\n        if len(predictions) == 0:\n            labels.extend([1, 2])\n        else:\n            assert len(labels)==len(sentences), 'error'\n            labels.append(preds[-1]) \n\n        # Append the prediction to the list\n        predictions.append(pred)\n\n    return accuracy\n\naccuracy = evaluate_model()\nprint(\"Accuracy:\", accuracy)\n```\n\nAfter running this code, you should get an estimate of the model's performance on the GLUE benchmark. Keep in mind that this is a simplified example and real-world openai 2 testing may require more involved validation processes.\n\nI hope this helps! Let me know if you have any further questions or if there are any specific areas where you'd like more information.",
+              "content": "I'd be happy to help you test the Transformer-XL (denoted as \"Test Trace OpenAI 2\") model, but first I need to clarify a few things:\n\n1. **Transformer-XL** is not an official name from OpenAI. It's actually a variant of the Transformer model proposed in the paper \"Long-Short Term Memory Are General: A Study on The Curvature of Time\" (2017) by Jinyu Chen, et al.\n2. **Trace OpenAI 2** sounds like a specific version or configuration of the Transformer-XL model, possibly developed by OpenAI.\n\nGiven these clarifications, I'll provide you with a general idea of how to test the Transformer-XL (or \"Test Trace OpenAI 2\") model using popular frameworks and libraries. Please note that this is not an exhaustive or definitive guide.\n\nTo test the Transformer-XL model, you can follow these steps:\n\n1. **Install the necessary dependencies**: You'll need a deep learning framework like TensorFlow or PyTorch, as well as a library for natural language processing (NLP) like Hugging Face's transformers.\n2. **Load the pre-trained weights**: You can use a pre-trained model checkpoint from Hugging Face's Transformers library or load your own weights trained on a specific task or dataset.\n3. **Prepare your data**: Load your text data into tokens, such as words or characters, and preprocess it according to the specific requirements of the Transformer-XL architecture (e.g., tokenization, padding, etc.).\n4. **Configure the model**: Adjust the hyperparameters to suit your specific test case, including the model's configuration, batch size, learning rate, etc.\n5. **Run the inference**: Use the loaded pre-trained weights to perform inference on your test data.\n\nHere's some sample Python code using PyTorch and Hugging Face's Transformers library to get you started:\n```python\nimport torch\nfrom transformers import LongformerForSequenceClassification, LongformerTokenizer\n\n# Load pre-trained weights\nmodel = LongformerForSequenceClassification.from_pretrained('test-trace-openai-2')\n\n# Prepare data\ntokenizer = model.tokenizer\ntext = \"This is a test sentence\"\ninputs = tokenizer(text, return_tensors='pt')\noutput = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])\n\n# Print the results\nprint(output.logits)\n```\nPlease note that this code snippet is just an example and may not work as-is. You'll need to adapt it to your specific requirements and test data.\n\nKeep in mind that testing a model's performance on a specific task or dataset requires careful consideration of factors like:\n\n* **Test data quality**: Your test data should accurately represent the underlying distribution of your target dataset.\n* **Model evaluation metrics**: Choose relevant evaluation metrics that measure the model's performance on your specific task, such as accuracy, precision, recall, F1-score, etc.\n\nFeel free to ask if you have any further questions or need more guidance!",
              "refusal": null,
              "role": "assistant",
              "annotations": null,
@ -37,15 +37,15 @@
            }
          }
        ],
-        "created": 1754510064,
+        "created": 1756921250,
        "model": "llama3.2:3b-instruct-fp16",
        "object": "chat.completion",
        "service_tier": null,
        "system_fingerprint": "fp_ollama",
        "usage": {
-          "completion_tokens": 694,
+          "completion_tokens": 614,
          "prompt_tokens": 31,
-          "total_tokens": 725,
+          "total_tokens": 645,
          "completion_tokens_details": null,
          "prompt_tokens_details": null
        }