Evaluating TnT-LLM Text Classification: Human Agreement and Scalable LLM Metrics

Table of Links
Abstract and 1 Introduction
2 Related Work
3 Method and 3.1 Phase 1: Taxonomy Generation
3.2 Phase 2: LLM-Augmented Text Classification
4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies
4.2 Phase 2 Evaluation Strategies
5 Experiments and 5.1 Data
5.2 Taxonomy Generation
5.3 LLM-Augmented Text Classification
5.4 Summary of Findings and Suggestions
6 Discussion and Future Work, and References
A. Taxonomies
B. Additional Results
C. Implementation Details
D. Prompt Templates
4.2 Phase 2 Evaluation Strategies
To quantitatively evaluate text classification, we create a benchmark dataset with reliable ground-truth annotations as follows:
Task and Annotation Reliability. We first assess the reliability of the label assignment task and the human annotations by involving multiple human annotators and calculating the interrater agreement (Cohen’s Kappa [6] between two raters and Fleiss’ Kappa [7] among multiple raters). We then resolve disagreements between human annotations by either voting or deliberation, and obtain a consensus human annotation for each instance. Then we use an LLM as an additional annotator to perform the same label assignment task, and measure the agreement between the LLM annotation and the consensus human label. Intuitively, this agreement captures how well the LLM is aligned with (the majority of) human annotators and how reliable it is for this label assignment task.
Classification Metrics. We apply both human and LLM annotations on a small-scale corpus sample and calculate the conventional multiclass and multilabel classification metrics (e.g., Accuracy, F1) with human annotations as the ground truth. These metrics evaluate how the label classifier is aligned with human preferences on a small subset of the corpus. We then apply the LLM annotator on a larger-scale corpus sample and leverage the resulting annotations as the oracle to calculate the same classification metrics. These metrics enable a comprehensive diagnosis of the label classifier performance at scale on different aspects of the corpus, such as domains, languages, and time ranges.
In practice, we recommend leveraging both human evaluation and LLM-based metrics as a holistic evaluation suite, while also taking into account the task and annotation reliability. This approach can help us identify and mitigate the possible bias that may arise from either method or be affected by the task complexity, and enable us to scale up the evaluation and annotation to a large corpus sample with confidence, thus obtaining more robust and informative evaluation results.
Authors:
(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;
(2) Tara Safavi (Corresponding authors), Microsoft Corporation;
(3) Sujay Kumar Jauhar, Microsoft Corporation;
(4) Yujin Kim, Microsoft Corporation;
(5) Scott Counts, Microsoft Corporation;
(6) Jennifer Neville, Microsoft Corporation;
(7) Siddharth Suri, Microsoft Corporation;
(8) Chirag Shah, University of Washington and Work done while working at Microsoft;
(9) Ryen W. White, Microsoft Corporation;
(10) Longqi Yang, Microsoft Corporation;
(11) Reid Andersen, Microsoft Corporation;
(12) Georg Buscher, Microsoft Corporation;
(13) Dhruv Joshi, Microsoft Corporation;
(14) Nagu Rangan, Microsoft Corporation.