Feedbacker

The framework of Feedbacker. Feedbacker consists of four key components: a tree-structured query taxonomy builder, a query synthesis scheme, a pre-comparison-derived criteria pointwise evaluation method, and a set of visualization and analysis toolkits.

Abstract

Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce \textbf{Feedbacker}, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model’s specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC$^{2}$ (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our project homepage and dataset are available at https://liudan193.github.io/Feedbacker.

Taxonomy and Dataset. Existing datasets lack a comprehensive taxonomy as a foundation. To conduct a comprehensive and fine-grained evaluation, we have designed an automatic taxonomy building method (TaxBuilder) and an automatic query synthesis approach (RealMix).

Tree-structured query taxonomy for comprehensive LLM evaluation

Feedbacker-T-V0. This is our provided tree-structured query taxonomy for comprehensive and fine-grained LLM evaluation. Only a portion is shown here; for the full taxonomy, please refer to the Visualization & Analysis. This taxonomy is constructed using TaxBuilder, and you can also use it to create your own customized taxonomy.

Statistical analysis of the evaluation dataset

Feedbacker-D-V0. This is a statistic of our provided evaluation dataset. The dataset is generated using RealMix, and it can assist in building your own customized taxonomy.

Win rate comparison across different LLM models

Comparison of Quality and Realism Between Our Dataset and a Real User Queries Set. The results show that the queries generated by RealMix exhibit higher quality and are closer to real user queries.

Evaluation Method. Existing evaluation methods are either inaccurate (pointwise evaluation) or inefficient (pairwise evaluation). Our evaluation method — pre-comparison-derived criteria based pointwise evaluation — leverages pre-comparisons of multiple LLM responses to derive evaluation criteria, enabling the extraction of more effective evaluation metrics. Subsequently, it conducts pointwise evaluation based on the extracted criteria. With this approach, our method achieves higher accuracy than pairwise evaluation while maintaining efficiency comparable to pointwise evaluation.

Comparison of Different Evaluation Methods. We compared four evaluation methods: pairwise evaluation, pointwise Evaluation, naive criteria decomposition pointwise evaluation, and our method, which uses pre-comparison to generate more effective evaluation criteria. The results demonstrate that our method offers significant advantages over the others.

Token Cost Comparison. We compared the time consumption between pairwise evaluation and our proposed method. The results show that as the number of models to be evaluated increases, our method exhibits a significant advantage in efficiency. Therefore, we advocate for the community to include evaluation criteria generated by our method when releasing new benchmarks.

Leaderboard

Below is the comprehensive evaluation results of 17 mainstream LLMs using our Feedbacker framework. Note that this is a first-level division result. Feedbacker provides very fine-grained results. Please refer to the Visualization & Analysis for more details.

BibTeX

@article{wang2025rankings,
      title={From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback},
      author={Wang, Zongqi and Gu, Tianle and Gong, Chen and Tian, Xin and Bao, Siqi and Yang, Yujiu},
      journal={arXiv preprint arXiv:2505.06698},
      year={2025}
}