{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "hTIfyoGtjoWD" }, "source": [ "[](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb)\n", "\n", "# Llama Stack Benchmark Evals\n", "\n", "This notebook will walk you through the main sets of APIs we offer with Llama Stack for supporting running benchmark evaluations of your with working examples to explore the possibilities that Llama Stack opens up for you.\n", "\n", "Read more about Llama Stack: https://llamastack.github.io/latest/index.html" ] }, { "cell_type": "markdown", "metadata": { "id": "bxs0FJ1ckGa6" }, "source": [ "## 0. Bootstrapping Llama Stack Library\n", "\n", "##### 0.1. Prerequisite: Create TogetherAI account\n", "\n", "In order to run inference for the llama models, you will need to use an inference provider. Llama stack supports a number of inference [providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote/inference).\n", "\n", "In this showcase, we will use [together.ai](https://www.together.ai/) as the inference provider. So, you would first get an API key from Together if you dont have one already.\n", "You can also use Fireworks.ai or even Ollama if you would like to.\n", "\n", "\n", "> **Note:** Set the API Key in the Secrets of this notebook as `TOGETHER_API_KEY`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "O9pGVlPIjpix" }, "outputs": [], "source": [ "# NBVAL_SKIP\n", "!pip install -U llama-stack" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "id": "JQpLUSNjlGAM" }, "outputs": [], "source": [ "# NBVAL_SKIP\n", "!uv run llama stack list-deps together | xargs -L1 uv pip install\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "collapsed": true, "id": "KkT2qVeTlI-b", "outputId": "9198fbfc-a126-4409-e2f5-5f5bf5cdf9a7" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Not in Google Colab environment\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Warning: `bwrap` is not available. Code interpreter tool will not work correctly.\n" ] }, { "data": { "text/html": [ "
Using config together:\n",
       "\n"
      ],
      "text/plain": [
       "Using config \u001b[34mtogether\u001b[0m:\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "apis:\n",
       "- agents\n",
       "- datasetio\n",
       "- eval\n",
       "- inference\n",
       "- safety\n",
       "- scoring\n",
       "- telemetry\n",
       "- tool_runtime\n",
       "- vector_io\n",
       "benchmarks: []\n",
       "container_image: null\n",
       "datasets: []\n",
       "image_name: together\n",
       "logging: null\n",
       "metadata_store:\n",
       "  db_path: /Users/xiyan/.llama/distributions/together/registry.db\n",
       "  namespace: null\n",
       "  type: sqlite\n",
       "models:\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.1-8B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.1-70B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.1-405B-Instruct-FP8\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.2-3B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.2-3B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.2-3B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.2-3B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.2-11B-Vision-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.2-90B-Vision-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.3-70B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.3-70B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-3.3-70B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-3.3-70B-Instruct-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Meta-Llama-Guard-3-8B\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-Guard-3-8B\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-Guard-3-8B\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-Guard-3-8B\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-Guard-3-11B-Vision-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-Guard-3-11B-Vision-Turbo\n",
       "- metadata: {}\n",
       "  model_id: meta-llama/Llama-Guard-3-11B-Vision\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-Guard-3-11B-Vision-Turbo\n",
       "- metadata:\n",
       "    context_length: 8192\n",
       "    embedding_dimension: 768\n",
       "  model_id: togethercomputer/m2-bert-80M-8k-retrieval\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - embedding\n",
       "  provider_id: together\n",
       "  provider_model_id: togethercomputer/m2-bert-80M-8k-retrieval\n",
       "- metadata:\n",
       "    context_length: 32768\n",
       "    embedding_dimension: 768\n",
       "  model_id: togethercomputer/m2-bert-80M-32k-retrieval\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - embedding\n",
       "  provider_id: together\n",
       "  provider_model_id: togethercomputer/m2-bert-80M-32k-retrieval\n",
       "- metadata:\n",
       "    embedding_dimension: 384\n",
       "  model_id: all-MiniLM-L6-v2\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - embedding\n",
       "  provider_id: sentence-transformers\n",
       "  provider_model_id: null\n",
       "providers:\n",
       "  agents:\n",
       "  - config:\n",
       "      persistence_store:\n",
       "        db_path: /Users/xiyan/.llama/distributions/together/agents_store.db\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: meta-reference\n",
       "    provider_type: inline::meta-reference\n",
       "  datasetio:\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: /Users/xiyan/.llama/distributions/together/huggingface_datasetio.db\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: huggingface\n",
       "    provider_type: remote::huggingface\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: /Users/xiyan/.llama/distributions/together/localfs_datasetio.db\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: localfs\n",
       "    provider_type: inline::localfs\n",
       "  eval:\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: /Users/xiyan/.llama/distributions/together/meta_reference_eval.db\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: meta-reference\n",
       "    provider_type: inline::meta-reference\n",
       "  inference:\n",
       "  - config:\n",
       "      api_key: '********'\n",
       "      url: https://api.together.xyz/v1\n",
       "    provider_id: together\n",
       "    provider_type: remote::together\n",
       "  - config: {}\n",
       "    provider_id: sentence-transformers\n",
       "    provider_type: inline::sentence-transformers\n",
       "  safety:\n",
       "  - config:\n",
       "      excluded_categories: []\n",
       "    provider_id: llama-guard\n",
       "    provider_type: inline::llama-guard\n",
       "  scoring:\n",
       "  - config: {}\n",
       "    provider_id: basic\n",
       "    provider_type: inline::basic\n",
       "  - config: {}\n",
       "    provider_id: llm-as-judge\n",
       "    provider_type: inline::llm-as-judge\n",
       "  - config:\n",
       "      openai_api_key: '********'\n",
       "    provider_id: braintrust\n",
       "    provider_type: inline::braintrust\n",
       "  telemetry:\n",
       "  - config:\n",
       "      service_name: llama-stack\n",
       "      sinks: sqlite\n",
       "      sqlite_db_path: /Users/xiyan/.llama/distributions/together/trace_store.db\n",
       "    provider_id: meta-reference\n",
       "    provider_type: inline::meta-reference\n",
       "  tool_runtime:\n",
       "  - config:\n",
       "      api_key: '********'\n",
       "      max_results: 3\n",
       "    provider_id: brave-search\n",
       "    provider_type: remote::brave-search\n",
       "  - config:\n",
       "      api_key: '********'\n",
       "      max_results: 3\n",
       "    provider_id: tavily-search\n",
       "    provider_type: remote::tavily-search\n",
       "  - config: {}\n",
       "    provider_id: rag-runtime\n",
       "    provider_type: inline::rag-runtime\n",
       "  - config: {}\n",
       "    provider_id: model-context-protocol\n",
       "    provider_type: remote::model-context-protocol\n",
       "  - config:\n",
       "      api_key: '********'\n",
       "    provider_id: wolfram-alpha\n",
       "    provider_type: remote::wolfram-alpha\n",
       "  vector_io:\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: /Users/xiyan/.llama/distributions/together/faiss_store.db\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: faiss\n",
       "    provider_type: inline::faiss\n",
       "scoring_fns: []\n",
       "server:\n",
       "  port: 8321\n",
       "  tls_certfile: null\n",
       "  tls_keyfile: null\n",
       "shields:\n",
       "- params: null\n",
       "  provider_id: null\n",
       "  provider_shield_id: null\n",
       "  shield_id: meta-llama/Llama-Guard-3-8B\n",
       "tool_groups:\n",
       "- args: null\n",
       "  mcp_endpoint: null\n",
       "  provider_id: tavily-search\n",
       "  toolgroup_id: builtin::websearch\n",
       "- args: null\n",
       "  mcp_endpoint: null\n",
       "  provider_id: rag-runtime\n",
       "  toolgroup_id: builtin::rag\n",
       "- args: null\n",
       "  mcp_endpoint: null\n",
       "  provider_id: wolfram-alpha\n",
       "  toolgroup_id: builtin::wolfram_alpha\n",
       "vector_dbs: []\n",
       "version: '2'\n",
       "\n",
       "\n"
      ],
      "text/plain": [
       "apis:\n",
       "- agents\n",
       "- datasetio\n",
       "- eval\n",
       "- inference\n",
       "- safety\n",
       "- scoring\n",
       "- telemetry\n",
       "- tool_runtime\n",
       "- vector_io\n",
       "benchmarks: \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
       "container_image: null\n",
       "datasets: \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
       "image_name: together\n",
       "logging: null\n",
       "metadata_store:\n",
       "  db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95mregistry.db\u001b[0m\n",
       "  namespace: null\n",
       "  type: sqlite\n",
       "models:\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-8B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-8B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.1\u001b[0m-8B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-8B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-70B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-70B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.1\u001b[0m-70B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-70B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-405B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-405B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.1\u001b[0m-405B-Instruct-FP8\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-\u001b[1;36m3.1\u001b[0m-405B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-3B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-3B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-3B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-3B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-11B-Vision-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-11B-Vision-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-11B-Vision-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-11B-Vision-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-90B-Vision-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-90B-Vision-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-90B-Vision-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.2\u001b[0m-90B-Vision-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.3\u001b[0m-70B-Instruct-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.3\u001b[0m-70B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-\u001b[1;36m3.3\u001b[0m-70B-Instruct\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-\u001b[1;36m3.3\u001b[0m-70B-Instruct-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Meta-Llama-Guard-\u001b[1;36m3\u001b[0m-8B\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-Guard-\u001b[1;36m3\u001b[0m-8B\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-Guard-\u001b[1;36m3\u001b[0m-8B\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Meta-Llama-Guard-\u001b[1;36m3\u001b[0m-8B\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-Guard-\u001b[1;36m3\u001b[0m-11B-Vision-Turbo\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-Guard-\u001b[1;36m3\u001b[0m-11B-Vision-Turbo\n",
       "- metadata: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "  model_id: meta-llama/Llama-Guard-\u001b[1;36m3\u001b[0m-11B-Vision\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - llm\n",
       "  provider_id: together\n",
       "  provider_model_id: meta-llama/Llama-Guard-\u001b[1;36m3\u001b[0m-11B-Vision-Turbo\n",
       "- metadata:\n",
       "    context_length: \u001b[1;36m8192\u001b[0m\n",
       "    embedding_dimension: \u001b[1;36m768\u001b[0m\n",
       "  model_id: togethercomputer/m2-bert-80M-8k-retrieval\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - embedding\n",
       "  provider_id: together\n",
       "  provider_model_id: togethercomputer/m2-bert-80M-8k-retrieval\n",
       "- metadata:\n",
       "    context_length: \u001b[1;36m32768\u001b[0m\n",
       "    embedding_dimension: \u001b[1;36m768\u001b[0m\n",
       "  model_id: togethercomputer/m2-bert-80M-32k-retrieval\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - embedding\n",
       "  provider_id: together\n",
       "  provider_model_id: togethercomputer/m2-bert-80M-32k-retrieval\n",
       "- metadata:\n",
       "    embedding_dimension: \u001b[1;36m384\u001b[0m\n",
       "  model_id: all-MiniLM-L6-v2\n",
       "  model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType\n",
       "  - embedding\n",
       "  provider_id: sentence-transformers\n",
       "  provider_model_id: null\n",
       "providers:\n",
       "  agents:\n",
       "  - config:\n",
       "      persistence_store:\n",
       "        db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95magents_store.db\u001b[0m\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: meta-reference\n",
       "    provider_type: inline::meta-reference\n",
       "  datasetio:\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95mhuggingface_datasetio.db\u001b[0m\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: huggingface\n",
       "    provider_type: remote::huggingface\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95mlocalfs_datasetio.db\u001b[0m\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: localfs\n",
       "    provider_type: inline::localfs\n",
       "  eval:\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95mmeta_reference_eval.db\u001b[0m\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: meta-reference\n",
       "    provider_type: inline::meta-reference\n",
       "  inference:\n",
       "  - config:\n",
       "      api_key: \u001b[32m'********'\u001b[0m\n",
       "      url: \u001b[4;94mhttps://api.together.xyz/v1\u001b[0m\n",
       "    provider_id: together\n",
       "    provider_type: remote::together\n",
       "  - config: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "    provider_id: sentence-transformers\n",
       "    provider_type: inline::sentence-transformers\n",
       "  safety:\n",
       "  - config:\n",
       "      excluded_categories: \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
       "    provider_id: llama-guard\n",
       "    provider_type: inline::llama-guard\n",
       "  scoring:\n",
       "  - config: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "    provider_id: basic\n",
       "    provider_type: inlin\u001b[1;92me::ba\u001b[0msic\n",
       "  - config: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "    provider_id: llm-as-judge\n",
       "    provider_type: inline::llm-as-judge\n",
       "  - config:\n",
       "      openai_api_key: \u001b[32m'********'\u001b[0m\n",
       "    provider_id: braintrust\n",
       "    provider_type: inlin\u001b[1;92me::b\u001b[0mraintrust\n",
       "  telemetry:\n",
       "  - config:\n",
       "      service_name: llama-stack\n",
       "      sinks: sqlite\n",
       "      sqlite_db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95mtrace_store.db\u001b[0m\n",
       "    provider_id: meta-reference\n",
       "    provider_type: inline::meta-reference\n",
       "  tool_runtime:\n",
       "  - config:\n",
       "      api_key: \u001b[32m'********'\u001b[0m\n",
       "      max_results: \u001b[1;36m3\u001b[0m\n",
       "    provider_id: brave-search\n",
       "    provider_type: remot\u001b[1;92me::b\u001b[0mrave-search\n",
       "  - config:\n",
       "      api_key: \u001b[32m'********'\u001b[0m\n",
       "      max_results: \u001b[1;36m3\u001b[0m\n",
       "    provider_id: tavily-search\n",
       "    provider_type: remote::tavily-search\n",
       "  - config: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "    provider_id: rag-runtime\n",
       "    provider_type: inline::rag-runtime\n",
       "  - config: \u001b[1m{\u001b[0m\u001b[1m}\u001b[0m\n",
       "    provider_id: model-context-protocol\n",
       "    provider_type: remote::model-context-protocol\n",
       "  - config:\n",
       "      api_key: \u001b[32m'********'\u001b[0m\n",
       "    provider_id: wolfram-alpha\n",
       "    provider_type: remote::wolfram-alpha\n",
       "  vector_io:\n",
       "  - config:\n",
       "      kvstore:\n",
       "        db_path: \u001b[35m/Users/xiyan/.llama/distributions/together/\u001b[0m\u001b[95mfaiss_store.db\u001b[0m\n",
       "        namespace: null\n",
       "        type: sqlite\n",
       "    provider_id: faiss\n",
       "    provider_type: inlin\u001b[1;92me::fa\u001b[0miss\n",
       "scoring_fns: \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
       "server:\n",
       "  port: \u001b[1;36m8321\u001b[0m\n",
       "  tls_certfile: null\n",
       "  tls_keyfile: null\n",
       "shields:\n",
       "- params: null\n",
       "  provider_id: null\n",
       "  provider_shield_id: null\n",
       "  shield_id: meta-llama/Llama-Guard-\u001b[1;36m3\u001b[0m-8B\n",
       "tool_groups:\n",
       "- args: null\n",
       "  mcp_endpoint: null\n",
       "  provider_id: tavily-search\n",
       "  toolgroup_id: builtin::websearch\n",
       "- args: null\n",
       "  mcp_endpoint: null\n",
       "  provider_id: rag-runtime\n",
       "  toolgroup_id: builtin::rag\n",
       "- args: null\n",
       "  mcp_endpoint: null\n",
       "  provider_id: wolfram-alpha\n",
       "  toolgroup_id: builtin::wolfram_alpha\n",
       "vector_dbs: \u001b[1m[\u001b[0m\u001b[1m]\u001b[0m\n",
       "version: \u001b[32m'2'\u001b[0m\n",
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "try:\n",
    "    from google.colab import userdata\n",
    "    os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')\n",
    "    os.environ['TAVILY_SEARCH_API_KEY'] = userdata.get('TAVILY_SEARCH_API_KEY')\n",
    "except ImportError:\n",
    "    print(\"Not in Google Colab environment\")\n",
    "\n",
    "from llama_stack.core.library_client import LlamaStackAsLibraryClient\n",
    "\n",
    "client = LlamaStackAsLibraryClient(\"together\")\n",
    "_ = client.initialize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "qwXHwHq4lS1s"
   },
   "source": [
    "## 1. Open Benchmark Model Evaluation\n",
    "\n",
    "The first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:\n",
    "\n",
    "- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI)]: Benchmark designed to evaluate multimodal models.\n",
    "- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "dqXLFtcao1oI"
   },
   "source": [
    "#### 1.1 Running MMMU\n",
    "- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "TC_IwIAQo4q-"
   },
   "outputs": [],
   "source": [
    "name = \"llamastack/mmmu\"\n",
    "subset = \"Agriculture\"\n",
    "split = \"dev\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "id": "DJkmoG2kq1_P"
   },
   "outputs": [],
   "source": [
    "import datasets\n",
    "\n",
    "ds = datasets.load_dataset(path=name, name=subset, split=split)\n",
    "ds = ds.select_columns([\"chat_completion_input\", \"input_query\", \"expected_answer\"])\n",
    "eval_rows = ds.to_pandas().to_dict(orient=\"records\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "sqBA5LbNq7Xm"
   },
   "source": [
    "- **Run Evaluation on Model Candidate**\n",
    "  - Define a System Prompt\n",
    "  - Define an EvalCandidate\n",
    "  - Run evaluate on datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 441
    },
    "collapsed": true,
    "id": "1r6qYTp9q5l7",
    "outputId": "f1607a9b-c3a3-43cc-928f-0487d0438748"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 5/5 [00:33<00:00,  6.71s/it]\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "EvaluateResponse(\n", "│ generations=[\n", "│ │ {\n", "│ │ │ 'generated_answer': '**Potato Pests**\\n\\nThe two insects depicted are:\\n\\n* **Colorado Potato Beetle (Leptinotarsa decemlineata)**: Characterized by black and yellow stripes, this beetle is a significant pest of potatoes. It feeds on the leaves and can cause substantial damage to the crop.\\n* **False Potato Beetle (Leptinotarsa juncta)**: Also known as the false Colorado beetle, this species has similar coloring but is not as harmful to potatoes as the Colorado potato beetle.'\n", "│ │ },\n", "│ │ {\n", "│ │ │ 'generated_answer': \"The image shows a sunflower leaf with a powdery mildew, which is a fungal disease caused by various species of fungi. The white powdery coating on the leaves is a characteristic symptom of this disease. The leaf also has some black spots, which could be indicative of a secondary infection or another type of disease. However, without more information or a closer examination, it's difficult to determine the exact cause of the black spots.\\n\\nBased on the image alone, we can see at least two types of symptoms: the powdery mildew and the black spots. This suggests that there may be more than one pathogen involved, but it's also possible that the black spots are a result of the same fungal infection causing the powdery mildew.\\n\\nAnswer: B) Two pathogens\"\n", "│ │ },\n", "│ │ {\n", "│ │ │ 'generated_answer': 'The symptoms observed, characterized by the massive gum production on the trunks of the grapefruit trees in Cyprus, suggest a physiological or pathological response. Given the absence of visible signs of damage or pests from a higher point on a hillside, and considering the specific nature of the symptom (gum production), we can infer that the cause is more likely related to an internal process within the tree rather than external damage from harvesting. While physiological stress (B) could lead to such symptoms, the primary reason for gum production in trees, especially in citrus species, is typically linked to disease. Among the options provided, fungal gummosis (E) is a condition known to cause gumming in citrus trees, which aligns with the observed symptoms. Therefore, without direct evidence of external damage (harvesting) or confirmation of physiological stress being the primary cause, the most appropriate answer based on the information given is:\\n\\nAnswer: E'\n", "│ │ },\n", "│ │ {'generated_answer': 'Answer: D'},\n", "│ │ {\n", "│ │ │ 'generated_answer': \"**Analysis of the Image**\\n\\nThe image provided shows a rhubarb plant with split petioles. To determine the cause of this issue, we need to consider various factors that could lead to such damage.\\n\\n**Possible Causes of Petiole Splitting**\\n\\n* **Physiological Problems**: Rhubarb plants can experience physiological stress due to environmental factors like extreme temperatures, waterlogging, or nutrient deficiencies. This stress can cause the petioles to split.\\n* **Phytoplasma Infection**: Phytoplasma is a type of bacteria that can infect plants, including rhubarb. It can cause symptoms such as yellowing leaves, stunted growth, and splitting of petioles.\\n* **Animal Damage**: Animals like rabbits, deer, or insects can damage rhubarb plants by eating the leaves or stems, which can lead to splitting of the petioles.\\n* **Bacteria**: Bacterial infections can also cause damage to rhubarb plants, including splitting of the petioles.\\n\\n**Conclusion**\\n\\nBased on the analysis, it is clear that all the options listed (A) Physiological problems, B) Phytoplasma infection, D) Animal damage, and E) Bacteria) could potentially cause the petioles of the rhubarb plant to split. Therefore, there is no single option that would not be a cause for the petioles splitting.\\n\\n**Answer**: C) I don't know and don't want to guess.\"\n", "│ │ }\n", "│ ],\n", "│ scores={\n", "│ │ 'basic::regex_parser_multiple_choice_answer': ScoringResult(\n", "│ │ │ aggregated_results={'accuracy': {'accuracy': 0.2, 'num_correct': 1.0, 'num_total': 5}},\n", "│ │ │ score_rows=[{'score': 0.0}, {'score': 0.0}, {'score': 0.0}, {'score': 1.0}, {'score': 0.0}]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mEvaluateResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mgenerations\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'**Potato Pests**\\n\\nThe two insects depicted are:\\n\\n* **Colorado Potato Beetle \u001b[0m\u001b[32m(\u001b[0m\u001b[32mLeptinotarsa decemlineata\u001b[0m\u001b[32m)\u001b[0m\u001b[32m**: Characterized by black and yellow stripes, this beetle is a significant pest of potatoes. It feeds on the leaves and can cause substantial damage to the crop.\\n* **False Potato Beetle \u001b[0m\u001b[32m(\u001b[0m\u001b[32mLeptinotarsa juncta\u001b[0m\u001b[32m)\u001b[0m\u001b[32m**: Also known as the false Colorado beetle, this species has similar coloring but is not as harmful to potatoes as the Colorado potato beetle.'\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"The image shows a sunflower leaf with a powdery mildew, which is a fungal disease caused by various species of fungi. The white powdery coating on the leaves is a characteristic symptom of this disease. The leaf also has some black spots, which could be indicative of a secondary infection or another type of disease. However, without more information or a closer examination, it's difficult to determine the exact cause of the black spots.\\n\\nBased on the image alone, we can see at least two types of symptoms: the powdery mildew and the black spots. This suggests that there may be more than one pathogen involved, but it's also possible that the black spots are a result of the same fungal infection causing the powdery mildew.\\n\\nAnswer: B\u001b[0m\u001b[32m)\u001b[0m\u001b[32m Two pathogens\"\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'The symptoms observed, characterized by the massive gum production on the trunks of the grapefruit trees in Cyprus, suggest a physiological or pathological response. Given the absence of visible signs of damage or pests from a higher point on a hillside, and considering the specific nature of the symptom \u001b[0m\u001b[32m(\u001b[0m\u001b[32mgum production\u001b[0m\u001b[32m)\u001b[0m\u001b[32m, we can infer that the cause is more likely related to an internal process within the tree rather than external damage from harvesting. While physiological stress \u001b[0m\u001b[32m(\u001b[0m\u001b[32mB\u001b[0m\u001b[32m)\u001b[0m\u001b[32m could lead to such symptoms, the primary reason for gum production in trees, especially in citrus species, is typically linked to disease. Among the options provided, fungal gummosis \u001b[0m\u001b[32m(\u001b[0m\u001b[32mE\u001b[0m\u001b[32m)\u001b[0m\u001b[32m is a condition known to cause gumming in citrus trees, which aligns with the observed symptoms. Therefore, without direct evidence of external damage \u001b[0m\u001b[32m(\u001b[0m\u001b[32mharvesting\u001b[0m\u001b[32m)\u001b[0m\u001b[32m or confirmation of physiological stress being the primary cause, the most appropriate answer based on the information given is:\\n\\nAnswer: E'\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'Answer: D'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"**Analysis of the Image**\\n\\nThe image provided shows a rhubarb plant with split petioles. To determine the cause of this issue, we need to consider various factors that could lead to such damage.\\n\\n**Possible Causes of Petiole Splitting**\\n\\n* **Physiological Problems**: Rhubarb plants can experience physiological stress due to environmental factors like extreme temperatures, waterlogging, or nutrient deficiencies. This stress can cause the petioles to split.\\n* **Phytoplasma Infection**: Phytoplasma is a type of bacteria that can infect plants, including rhubarb. It can cause symptoms such as yellowing leaves, stunted growth, and splitting of petioles.\\n* **Animal Damage**: Animals like rabbits, deer, or insects can damage rhubarb plants by eating the leaves or stems, which can lead to splitting of the petioles.\\n* **Bacteria**: Bacterial infections can also cause damage to rhubarb plants, including splitting of the petioles.\\n\\n**Conclusion**\\n\\nBased on the analysis, it is clear that all the options listed \u001b[0m\u001b[32m(\u001b[0m\u001b[32mA\u001b[0m\u001b[32m)\u001b[0m\u001b[32m Physiological problems, B\u001b[0m\u001b[32m)\u001b[0m\u001b[32m Phytoplasma infection, D\u001b[0m\u001b[32m)\u001b[0m\u001b[32m Animal damage, and E\u001b[0m\u001b[32m)\u001b[0m\u001b[32m Bacteria\u001b[0m\u001b[32m)\u001b[0m\u001b[32m could potentially cause the petioles of the rhubarb plant to split. Therefore, there is no single option that would not be a cause for the petioles splitting.\\n\\n**Answer**: C\u001b[0m\u001b[32m)\u001b[0m\u001b[32m I don't know and don't want to guess.\"\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m]\u001b[0m,\n", "\u001b[2;32m│ \u001b[0m\u001b[33mscores\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'basic::regex_parser_multiple_choice_answer'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'accuracy'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'accuracy'\u001b[0m: \u001b[1;36m0.2\u001b[0m, \u001b[32m'num_correct'\u001b[0m: \u001b[1;36m1.0\u001b[0m, \u001b[32m'num_total'\u001b[0m: \u001b[1;36m5\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m1.0\u001b[0m\u001b[1m}\u001b[0m, \u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[1;36m0.0\u001b[0m\u001b[1m}\u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from rich.pretty import pprint\n", "from tqdm import tqdm\n", "\n", "SYSTEM_PROMPT_TEMPLATE = \"\"\"\n", "You are an expert in {subject} whose job is to answer questions from the user using images.\n", "\n", "First, reason about the correct answer.\n", "\n", "Then write the answer in the following format where X is exactly one of A,B,C,D:\n", "\n", "Answer: X\n", "\n", "Make sure X is one of A,B,C,D.\n", "\n", "If you are uncertain of the correct answer, guess the most likely one.\n", "\"\"\"\n", "\n", "system_message = {\n", " \"role\": \"system\",\n", " \"content\": SYSTEM_PROMPT_TEMPLATE.format(subject=subset),\n", "}\n", "\n", "client.benchmarks.register(\n", " benchmark_id=\"meta-reference::mmmu\",\n", " # Note: we can use any value as `dataset_id` because we'll be using the `evaluate_rows` API which accepts the\n", " # `input_rows` argument and does not fetch data from the dataset.\n", " dataset_id=f\"mmmu-{subset}-{split}\",\n", " # Note: for the same reason as above, we can use any value as `scoring_functions`.\n", " scoring_functions=[],\n", ")\n", "\n", "response = client.eval.evaluate_rows(\n", " benchmark_id=\"meta-reference::mmmu\",\n", " input_rows=eval_rows,\n", " # Note: Here we define the actual scoring functions.\n", " scoring_functions=[\"basic::regex_parser_multiple_choice_answer\"],\n", " benchmark_config={\n", " \"eval_candidate\": {\n", " \"type\": \"model\",\n", " \"model\": \"meta-llama/Llama-3.2-90B-Vision-Instruct\",\n", " \"sampling_params\": {\n", " \"strategy\": {\n", " \"type\": \"top_p\",\n", " \"temperature\": 1.0,\n", " \"top_p\": 0.95,\n", " },\n", " \"max_tokens\": 4096,\n", " \"repeat_penalty\": 1.0,\n", " },\n", " \"system_message\": system_message,\n", " },\n", " },\n", ")\n", "pprint(response)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vYlb9wKzwg-s" }, "source": [ "#### 1.2. Running SimpleQA\n", "- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.\n", "- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HXmZf3Ymw-aX" }, "outputs": [], "source": [ "simpleqa_dataset_id = \"huggingface::simpleqa\"\n", "\n", "register_dataset_response = client.datasets.register(\n", " purpose=\"eval/messages-answer\",\n", " source={\n", " \"type\": \"uri\",\n", " \"uri\": \"huggingface://datasets/llamastack/simpleqa?split=train\",\n", " },\n", " dataset_id=simpleqa_dataset_id,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gc8azb4Rxr5J" }, "outputs": [], "source": [ "eval_rows = client.datasets.iterrows(\n", " dataset_id=simpleqa_dataset_id,\n", " limit=5,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 506 }, "id": "zSYAUnBUyRaG", "outputId": "038cf42f-4e3c-4053-b3c4-cf16547483dd" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/5 [00:00, ?it/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 5/5 [00:13<00:00, 2.71s/it]\n" ] }, { "data": { "text/html": [ "
EvaluateResponse(\n", "│ generations=[\n", "│ │ {'generated_answer': \"I'm not sure who received the IEEE Frank Rosenblatt Award in 2010.\"},\n", "│ │ {'generated_answer': \"I'm not aware of the information about the 2018 Jerlov Award recipient.\"},\n", "│ │ {\n", "│ │ │ 'generated_answer': \"Radcliffe College was a women's liberal arts college in Cambridge, Massachusetts. However, it merged with Harvard University in 1977 and is now known as the Radcliffe Institute for Advanced Study at Harvard University.\"\n", "│ │ },\n", "│ │ {'generated_answer': 'I am unable to verify in whose honor the Leipzig 1877 tournament was organized.'},\n", "│ │ {\n", "│ │ │ 'generated_answer': \"I am unable to verify what Empress Elizabeth of Austria's favorite sculpture depicted at her villa Achilleion at Corfu, according to Karl Küchler.\"\n", "│ │ }\n", "│ ],\n", "│ scores={\n", "│ │ 'llm-as-judge::405b-simpleqa': ScoringResult(\n", "│ │ │ aggregated_results={'categorical_count': {'categorical_count': {'A': 1, 'C': 4}}},\n", "│ │ │ score_rows=[\n", "│ │ │ │ {'score': 'C', 'judge_feedback': 'C'},\n", "│ │ │ │ {'score': 'C', 'judge_feedback': 'C'},\n", "│ │ │ │ {'score': 'A', 'judge_feedback': 'A'},\n", "│ │ │ │ {'score': 'C', 'judge_feedback': 'C'},\n", "│ │ │ │ {'score': 'C', 'judge_feedback': 'C'}\n", "│ │ │ ]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mEvaluateResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mgenerations\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"I'm not sure who received the IEEE Frank Rosenblatt Award in 2010.\"\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"I'm not aware of the information about the 2018 Jerlov Award recipient.\"\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"Radcliffe College was a women's liberal arts college in Cambridge, Massachusetts. However, it merged with Harvard University in 1977 and is now known as the Radcliffe Institute for Advanced Study at Harvard University.\"\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'I am unable to verify in whose honor the Leipzig 1877 tournament was organized.'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"I am unable to verify what Empress Elizabeth of Austria's favorite sculpture depicted at her villa Achilleion at Corfu, according to Karl Küchler.\"\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m]\u001b[0m,\n", "\u001b[2;32m│ \u001b[0m\u001b[33mscores\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'llm-as-judge::405b-simpleqa'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[32m'categorical_count'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'categorical_count'\u001b[0m: \u001b[1m{\u001b[0m\u001b[32m'A'\u001b[0m: \u001b[1;36m1\u001b[0m, \u001b[32m'C'\u001b[0m: \u001b[1;36m4\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'C'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'C'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'C'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'C'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'A'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'A'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'C'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'C'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'C'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'C'\u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# register 405B as LLM Judge model\n", "client.models.register(\n", " model=\"meta-llama/Llama-3.1-405B-Instruct\",\n", " provider_model_id=\"meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo\",\n", " provider_id=\"together\",\n", ")\n", "\n", "client.benchmarks.register(\n", " benchmark_id=\"meta-reference::simpleqa\",\n", " dataset_id=simpleqa_dataset_id,\n", " scoring_functions=[\"llm-as-judge::405b-simpleqa\"],\n", ")\n", "\n", "response = client.eval.evaluate_rows(\n", " benchmark_id=\"meta-reference::simpleqa\",\n", " input_rows=eval_rows.data,\n", " scoring_functions=[\"llm-as-judge::405b-simpleqa\"],\n", " benchmark_config={\n", " \"eval_candidate\": {\n", " \"type\": \"model\",\n", " \"model\": \"meta-llama/Llama-3.2-90B-Vision-Instruct\",\n", " \"sampling_params\": {\n", " \"strategy\": {\n", " \"type\": \"greedy\",\n", " },\n", " \"max_tokens\": 4096,\n", " \"repeat_penalty\": 1.0,\n", " },\n", " },\n", " },\n", ")\n", "pprint(response)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eyziqe_Em6d6" }, "source": [ "## 2. Agentic Evaluation\n", "\n", "- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.\n", "\n", "- We will continue to use the SimpleQA dataset we used in previous example.\n", "\n", "- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.\n", "\n", "> You will need to set the `TAVILY_SEARCH_API_KEY` in Secrets of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 538 }, "id": "mxLCsP4MvFqP", "outputId": "8be2a32f-2a47-4443-8992-0000c23ca678" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "5it [00:06, 1.33s/it]\n" ] }, { "data": { "text/html": [ "
EvaluateResponse(\n", "│ generations=[\n", "│ │ {\n", "│ │ │ 'generated_answer': 'The IEEE Frank Rosenblatt Award was given to Professor John Shawe-Taylor in 2010 for his contributions to the foundations of kernel methods.'\n", "│ │ },\n", "│ │ {\n", "│ │ │ 'generated_answer': 'The Jerlov Award is given by The Oceanography Society to recognize outstanding contributions to the field of ocean optics. The 2018 Jerlov Award was awarded to Dr. Kendall L. Carder.'\n", "│ │ },\n", "│ │ {\n", "│ │ │ 'generated_answer': \"The women's liberal arts college in Cambridge, Massachusetts is Radcliffe College. However, in 1999, Radcliffe College merged with Harvard University to form the Radcliffe Institute for Advanced Study at Harvard University. The institute is still located in Cambridge, Massachusetts, and is dedicated to supporting women's education and research.\"\n", "│ │ },\n", "│ │ {'generated_answer': 'The Leipzig 1877 tournament was organized in honor of Adolf Anderssen.'},\n", "│ │ {\n", "│ │ │ 'generated_answer': \"According to Karl Küchler, Empress Elizabeth of Austria's favorite sculpture, which was made for her villa Achilleion at Corfu, depicted the Dying Achilles.\"\n", "│ │ }\n", "│ ],\n", "│ scores={\n", "│ │ 'llm-as-judge::405b-simpleqa': ScoringResult(\n", "│ │ │ aggregated_results={},\n", "│ │ │ score_rows=[\n", "│ │ │ │ {'score': 'B', 'judge_feedback': 'B'},\n", "│ │ │ │ {'score': 'B', 'judge_feedback': 'B'},\n", "│ │ │ │ {'score': 'A', 'judge_feedback': 'A'},\n", "│ │ │ │ {'score': 'A', 'judge_feedback': 'A'},\n", "│ │ │ │ {'score': 'B', 'judge_feedback': 'B'}\n", "│ │ │ ]\n", "│ │ )\n", "│ }\n", ")\n", "\n" ], "text/plain": [ "\u001b[1;35mEvaluateResponse\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[33mgenerations\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'The IEEE Frank Rosenblatt Award was given to Professor John Shawe-Taylor in 2010 for his contributions to the foundations of kernel methods.'\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'The Jerlov Award is given by The Oceanography Society to recognize outstanding contributions to the field of ocean optics. The 2018 Jerlov Award was awarded to Dr. Kendall L. Carder.'\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"The women's liberal arts college in Cambridge, Massachusetts is Radcliffe College. However, in 1999, Radcliffe College merged with Harvard University to form the Radcliffe Institute for Advanced Study at Harvard University. The institute is still located in Cambridge, Massachusetts, and is dedicated to supporting women's education and research.\"\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m'The Leipzig 1877 tournament was organized in honor of Adolf Anderssen.'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[32m'generated_answer'\u001b[0m: \u001b[32m\"According to Karl Küchler, Empress Elizabeth of Austria's favorite sculpture, which was made for her villa Achilleion at Corfu, depicted the Dying Achilles.\"\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m]\u001b[0m,\n", "\u001b[2;32m│ \u001b[0m\u001b[33mscores\u001b[0m=\u001b[1m{\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[32m'llm-as-judge::405b-simpleqa'\u001b[0m: \u001b[1;35mScoringResult\u001b[0m\u001b[1m(\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33maggregated_results\u001b[0m=\u001b[1m{\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[33mscore_rows\u001b[0m=\u001b[1m[\u001b[0m\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'B'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'B'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'B'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'B'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'A'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'A'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'A'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'A'\u001b[0m\u001b[1m}\u001b[0m,\n", "\u001b[2;32m│ │ │ │ \u001b[0m\u001b[1m{\u001b[0m\u001b[32m'score'\u001b[0m: \u001b[32m'B'\u001b[0m, \u001b[32m'judge_feedback'\u001b[0m: \u001b[32m'B'\u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[2;32m│ │ │ \u001b[0m\u001b[1m]\u001b[0m\n", "\u001b[2;32m│ │ \u001b[0m\u001b[1m)\u001b[0m\n", "\u001b[2;32m│ \u001b[0m\u001b[1m}\u001b[0m\n", "\u001b[1m)\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "agent_config = {\n", " \"model\": \"meta-llama/Llama-3.3-70B-Instruct\",\n", " \"instructions\": \"You are a helpful assistant that have access to tool to search the web. \",\n", " \"sampling_params\": {\n", " \"strategy\": {\n", " \"type\": \"top_p\",\n", " \"temperature\": 0.5,\n", " \"top_p\": 0.9,\n", " }\n", " },\n", " \"toolgroups\": [\n", " \"builtin::websearch\",\n", " ],\n", " \"tool_choice\": \"auto\",\n", " \"tool_prompt_format\": \"json\",\n", " \"input_shields\": [],\n", " \"output_shields\": [],\n", " \"enable_session_persistence\": False,\n", "}\n", "\n", "response = client.eval.evaluate_rows(\n", " benchmark_id=\"meta-reference::simpleqa\",\n", " input_rows=eval_rows.data,\n", " scoring_functions=[\"llm-as-judge::405b-simpleqa\"],\n", " benchmark_config={\n", " \"eval_candidate\": {\n", " \"type\": \"agent\",\n", " \"config\": agent_config,\n", " },\n", " },\n", ")\n", "pprint(response)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lxc9-eXYK5Av" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [ "bxs0FJ1ckGa6", "eyziqe_Em6d6" ], "provenance": [] }, "kernelspec": { "display_name": "master", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.16" } }, "nbformat": 4, "nbformat_minor": 0 }