About the LLM Leaderboard

This page explains how we built the LLM Leaderboard: the benchmarks we used, the core skills we define and measure, the evaluation metrics and parameters, and how to reproduce our results. For background and motivation see our release blog post. Technical reports are available in the pocket-ml-testbench repository.

You can inspect summarized results and per-supplier details on the public LLM Leaderboard.

Benchmarks

We evaluate models using one in-house generative benchmark (bAbISteps) plus six established benchmarks via the EleutherAI Language Model Evaluation Harness, a unified framework to test LLMs on a large number of different evaluation tasks. These tasks measure a range of skills including reasoning, knowledge, instruction-following and arithmetic, among others.

  • bAbISteps - A comprehensive basic reasoning benchmark inspired by Meta's bAbI, enhanced with a powerful generative engine currently under development by Pnyx.
  • BBH (Big Bench Hard) (https://arxiv.org/abs/2210.09261) – A subset of 23 challenging tasks from the BigBench dataset to evaluate language models. The tasks use objective metrics, are highly difficult, and have sufficient sample sizes for statistical significance. They include multistep arithmetic, algorithmic reasoning (e.g., boolean expressions, SVG shapes), language understanding (e.g., sarcasm detection, name disambiguation), and world knowledge. BBH performance correlates well with human preferences, providing valuable insights into model capabilities.
  • GPQA (Graduate-Level Google-Proof Q&A Benchmark) (https://arxiv.org/abs/2311.12022) – GPQA is a highly challenging knowledge dataset with questions crafted by PhD-level domain experts in fields like biology, physics, and chemistry. These questions are designed to be difficult for laypersons but relatively easy for experts. The dataset has undergone multiple rounds of validation to ensure both difficulty and factual accuracy. Access to GPQA is restricted through gating mechanisms to minimize the risk of data contamination. Consequently, we do not provide plain text examples from this dataset, as requested by the authors.
  • GSM8K (https://arxiv.org/abs/2110.14168) – GSM8K consists of 8.5K high quality grade school math problems created by human problem writers. Problems were segmented into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / *) to reach the final answer. A bright middle school student should be able to solve every problem.
  • IFEval (https://arxiv.org/abs/2311.07911) – IFEval is a dataset designed to test a model's ability to follow explicit instructions, such as "include keyword x" or "use format y." The focus is on the model's adherence to formatting instructions rather than the content generated, allowing for the use of strict and rigorous metrics.
  • MMLU (Measuring Massive Multitask Language Understanding) (https://arxiv.org/abs/2009.03300) – MMLU is test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
  • MMLU-PRO (https://arxiv.org/abs/2406.01574) – MMLU-Pro is a refined version of the MMLU dataset, which has been a standard for multiple-choice knowledge assessment. Recent research identified issues with the original MMLU, such as noisy data (some unanswerable questions) and decreasing difficulty due to advances in model capabilities and increased data contamination. MMLU-Pro addresses these issues by presenting models with 10 choices instead of 4, requiring reasoning on more questions, and undergoing expert review to reduce noise. As a result, MMLU-Pro is of higher quality and currently more challenging than the original.

💡 For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in a context with 0-few-shot prompts.

Task Evaluations and Parameters

Note: The attentive reader will notice that we have used the A-VERT score (a-vert_match). This new evaluation method of evaluation was developed by the Pnyx team to address the limitations of traditional methods in evaluating natural language generative models (Exact-Matching and Accuracy), while also approximating the evaluation environment to the real use case of LLMs since it is not necessary to consider the use of few-shot prompts. For more details, please refer to the published work here. Related to this change, the label "Customized: yes/no" that appears below indicates whether the default task from the Eleuther AI Language Model Evaluation Harness was modified to addapt the evaluation metric to A-VERT or not. Changes of each task can be found in the pocket-ml-testbench repository.

BenchmarkOverview Task / TaskCustomizedMeasure
bAbIStepsbabisteps-chat-zero-shot-allyesA-VERT score (a-vert_match)
Big Bench Hard (BBH)bbh_splityesA-VERT score (a-vert_match)
GPQAgpqa_subtaskyesA-VERT score (a-vert_match)
GSM8Kgsm8k_chatyesA-VERT score (a-vert_match)
IFEvalifevalnoPrompt Level Loose Accuracy (prompt_level_loose_acc)
MMLUmmlu_chatyesA-VERT score (a-vert_match)
MMLU-PROmmlu_pro_categoriesyesA-VERT score (a-vert_match)
  • bAbISteps:
    • Overview Task: babisteps-chat-zero-shot-all
    • Customized: yes
    • Measure: A-VERT score (a-vert_match)

  • Big Bench Hard (BBH)
    • Overview Task: bbh_split
    • Customized: yes
    • Measure: A-VERT score (a-vert_match)

  • Generalized Purpose Question Answering (GPQA)
    • Overview Task: gpqa_subtask
    • Customized: yes
    • Measure: A-VERT score (a-vert_match)

  • GSM8K
    • Task: gsm8k_chat
    • Customized: yes
    • Measure: A-VERT score (a-vert_match)
  • IFEval
    • Task: ifeval
    • Customized: no
    • Measure: Prompt Level Loose Accuracy (prompt_level_loose_acc)
  • MMLU
    • Overview Task: mmlu_chat
    • Customized: yes
    • Measure: A-VERT score (a-vert_match)

  • MMLU-PRO
    • Overview Task: mmlu_pro_categories
    • Customized: yes
    • Measure: A-VERT score (a-vert_match)

Skills

As mentioned in the release blog post, at Pnyx we have developed a taxonomy that unifies and relates core skills with the benchmarks detailed above. We present below the list of skills:

  • Reasoning & Logic: Ability to analyze information, identify patterns, and solve complex problems using logical and deductive thinking.
  • Foundational Knowledge: Ability to understand and apply basic knowledge in various disciplines and fields of study.
  • Interpretation & Communication: Ability to accurately interpret information and communicate ideas clearly and effectively.
  • Social & Ethical: Ability to understand and navigate social and ethical contexts, making informed and responsible decisions.
  • Creative & Interpersonal: Ability to generate original ideas and collaborate effectively with others in diverse environments.

To fullfill the evaluation of these skills, we mapped sub-tasks from the benchmarks mentioned above to each core skill. Moreover, the selection of sub-tasks associated with each of the benchmarks was refined by our team in order to achieve a balance between specificity, difficulty, and metric saturation. Below is the assignment of sub-tasks (from each benchmark) to the mentioned skills:

Reasoning & Logic

  • bAbISteps:

    babisteps-chat_zero_shot-task_01-simpletracking, babisteps-chat_zero_shot-task_02-immediateorder, babisteps-chat_zero_shot-task_03-complextracking, babisteps-chat_zero_shot-task_04-listing, babisteps-chat_zero_shot-task_05-sizeorder, babisteps-chat_zero_shot-task_06-spatialorder, babisteps-chat_zero_shot-task_07-temporalorder, babisteps-chat_zero_shot-task_08-pathfinding, babisteps-chat_zero_shot-task_09-timetracking

  • BBH:

    bbh-split_01-boolean_expressions, bbh-split_02-causal_judgement, bbh-split_03-date_understanding, bbh-split_05-dyck_languages, bbh-split_06-formal_fallacies, bbh-split_10-logical_deduction_seven_objects, bbh-split_15-object_counting, bbh-split_27-word_sorting

  • GSM8K:

    gsm8k_chat

  • MMLU:

    mmlu_abstract_algebra_chat_generative, mmlu_logical_fallacies_chat_generative

  • MMLU-PRO:

    mmlu_pro-category_math

Foundational Knowledge

  • BBH:

    bbh-split_21-sports_understanding

  • GPQA:

    gpqa_subtask_main_biology, gpqa_subtask_main_chemistry, gpqa_subtask_main_physics

  • MMLU:

    mmlu_anatomy_chat_generative, mmlu_human_aging_chat_generative, mmlu_nutrition_chat_generative, mmlu_high_school_world_history_chat_generative, mmlu_high_school_macroeconomics_chat_generative, mmlu_high_school_physics_chat_generative, mmlu_virology_chat_generative, mmlu_computer_security_chat_generative, mmlu_college_computer_science_chat_generative

  • MMLU-PRO:

    mmlu_pro-category_psychology, mmlu_pro-category_health, mmlu_pro-category_biology, mmlu_pro-category_chemistry, mmlu_pro-category_physics, mmlu_pro-category_engineering, mmlu_pro-category_computer_science, mmlu_pro-category_other

Interpretation & Communication

  • BBH:

    bbh-split_04-disambiguation_qa, bbh-split_08-hyperbaton, bbh-split_19-salient_translation_error_detection, bbh-split_20-snarks

  • IFEval:

    ifeval

Social & Ethical

  • MMLU:

    mmlu_sociology_chat_generative, mmlu_human_sexuality_chat_generative, mmlu_professional_law_chat_generative, mmlu_global_facts_chat_generative, mmlu_business_ethics_chat_generative, mmlu_moral_disputes_chat_generative, mmlu_moral_scenarios_chat_generative

  • MMLU-PRO:

    mmlu_pro-category_history, mmlu_pro-category_philosophy, mmlu_pro-category_law

Creative & Interpersonal

  • BBH:

    bbh-split_12-movie_recommendation, bbh-split_18-ruin_names

  • MMLU:

    mmlu_public_relations_chat_generative, mmlu_management_chat_generative

  • MMLU-PRO:

    mmlu_pro-category_business

Results

Summarized numerical results (by skill and by benchmark) and per-supplier breakdowns are available on the public LLM Leaderboard. Click on any supplier row to view detailed benchmark- and skill-level results.

Reproducibility

To reproduce our results, please refer to our README.md file. In summary, you need to clone the repository with:

git clone https://github.com/pnyxai/pocket-ml-testbench.git
cd tilt

and once defined the environment variables (see README.md) just run the following command:

tilt up

Note: results may vary slightly between runs due to randomness in sampling.

Contact and contributions

If you find issues, missing citations, or want to contribute, please open an issue or a pull request in the pocket-ml-testbench repository.