Feedbacker

Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang
Tsinghua University, Baidu Inc
Illustration of Feedbacker

The framework of Feedbacker. Feedbacker consists of four key components: a tree-structured query taxonomy builder, a query synthesis scheme, a pre-comparison-derived criteria pointwise evaluation method, and a set of visualization and analysis toolkits.

Abstract

Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce \textbf{Feedbacker}, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model’s specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC$^{2}$ (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential.

Taxonomy and Dataset. Existing datasets lack a comprehensive taxonomy as a foundation. To conduct a comprehensive and fine-grained evaluation, we have designed an automatic taxonomy building method (TaxBuilder) and an automatic query synthesis approach (RealMix).

Evaluation Method. Existing evaluation methods are either inaccurate (pointwise evaluation) or inefficient (pairwise evaluation). Our evaluation method — pre-comparison-derived criteria based pointwise evaluation — leverages pre-comparisons of multiple LLM responses to derive evaluation criteria, enabling the extraction of more effective evaluation metrics. Subsequently, it conducts pointwise evaluation based on the extracted criteria. With this approach, our method achieves higher accuracy than pairwise evaluation while maintaining efficiency comparable to pointwise evaluation.

Leaderboard

Below is the comprehensive evaluation results of 17 mainstream LLMs using our Feedbacker framework. Note that this is a first-level division result. Feedbacker provides very fine-grained results. Please refer to the Visualization & Analysis for more details.

Main results of Feedbacker

BibTeX

@article{wang2025fromrankings,
      title={From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback},
      author={Wang, Zongqi and Gu, Tianle and Gong, Chen and Tian, Xin and Bao, Siqi and Yang, Yujiu},
      journal={arXiv preprint arXiv:2505.06698}
      year={2025},
}