Pnyx - LLM Network Platform

A‑VERT offers a new way to judge AI by meaning, not mechanics—evaluating answers directly to ensure fair, transparent, and human‑aligned comparison across diverse models.

In any open network where AI models are deployed—whether centralized or decentralized—the challenge isn’t just finding the “best” model, but being able to evaluate all of them in a reliable way.

A‑VERT (Agnostic Verification with Embedding Ranking Targets)¹ is a new method for evaluating AI responses that doesn’t care about the model’s architecture, internals, or how it was trained. Instead, A‑VERT focuses entirely on the meaning of the answer, making it ideal for real-world deployments where models can be as diverse as the suppliers running them. It is open-source² and you can find it pip³.

Why judging answers is the real problem

In real-world AI systems, especially in open or decentralized networks, two endpoints might claim to run the same model, but differ in size, fine-tuning, quantization, or even the way answers are generated. From the outside, these differences are invisible—all a client sees is the text going in and out. This means quality can only be measured by the answers themselves, not by inspecting the model or its internals.

Traditional scoring methods struggle here:

Exact match only works for short, precise answers and penalizes even small wording changes.
String overlap metrics can be fooled by answers that look similar but have very different meanings.
Log-probability scoring requires access to internal model probabilities, which are often unavailable in real-world APIs.

These methods work well in controlled lab environments but fall short when dealing with long, conversational, or free-form answers from a variety of models. A-VERT is designed to solve this by focusing on what matters: whether the answer is right, regardless of how it’s expressed.

From one perfect answer to groups of meanings

A-VERT takes inspiration from how humans judge answers. Instead of requiring a single “perfect” response, it looks for clusters of answers that express the same correct or incorrect idea. For example, for each question, A-VERT can create:

A correct group of phrasings that all mean the right thing.
A wrong group of plausible but incorrect alternatives.

When a model answers, A-VERT uses a compact semantic model (an embedding or reranker) to see how close the answer is to each group. It picks the best match from each group and decides: is the answer more like the correct group or the wrong group? There’s no arbitrary threshold—just a direct comparison of meaning.

How A-VERT reads meaning

A-VERT uses small, efficient models to capture the meaning of answers and compare them to reference groups:

Embedding models convert text into numerical vectors, so similar meanings are close together in space.
Reranker models directly compare pairs of texts and score how well they match.

A-VERT can work with either type and uses models small enough to be practical for large-scale evaluation, unlike giant “LLM-as-a-judge” setups. This makes it perfect for real-world deployments, where efficiency and cost matter.

A-VERT methodology diagram

Putting A-VERT to work in real benchmarks

A-VERT was tested on three major benchmarks—bAbI, MMLU, and MMLU-Pro—using a mix of models (Llama 3.3, Qwen3, and GPT-OSS) . Moreover, in the A-VERT setting the models were allowed to answer freely, using chat modes, including reasoning traces and tool use. The same prompts were evaluated, in each setting, by:

Human annotators.
Traditional scoring methods.
A-VERT using different semantic models.

This allowed direct comparison of how well A-VERT matches human judgment when assessing the capacity of a model in answering a given prompt.

How close is A-VERT to human judgment?

When using the best semantic models, A-VERT achieves:

A balanced accuracy of about 0.96, meaning it’s reliable for both correct and incorrect answers.
An F1 score near 0.99, with very few mistakes.
A correlation (R²) of about 0.97 with human scores across all tasks.

These results were achieved while:

Allowing models to answer freely, without forcing short or rigid formats.
Treating models as pure black boxes, only seeing the text output.
Using small, practical semantic models suitable for large-scale use.

$Regression analysis MMLU \& bAbI sub-tasks$

What this unlocks for real-world AI

A-VERT’s agnostic, outcome-based approach means it can be used anywhere—whether in a decentralized marketplace, a centralized cloud, or any setting where models are treated as black boxes. It doesn’t matter if the model is open-source, proprietary, or running on a unique architecture: A-VERT evaluates answers based on their meaning, not their origin. This makes it a powerful tool for ensuring quality and fairness in real-world AI deployments, where transparency and trust are more important than ever.

By focusing on what answers mean, rather than how they’re produced, A-VERT helps turn diverse, anonymous AI endpoints into reliable, measurable actors—making open, fair AI evaluation not just possible, but practical.

A-VERT is not just a research idea—it’s already powering real-world evaluation. Our live and continuous benchmark⁴ , uses A-VERT to measure the performance of AI models as they serve requests in real time. This benchmark feeds directly into our public leaderboard, making it possible for anyone to see which models are delivering the best answers, regardless of their architecture or provenance. By using A-VERT, we ensure that the leaderboard reflects actual, observable quality, giving users confidence in the models they choose and enabling suppliers to compete on the merit of their answers, not just their branding.