Additional Results: Cross-Lingual Taxonomy Evaluation and In-Depth Classification Analysis
Table of Links
Abstract and 1 Introduction
2 Related Work
3 Method and 3.1 Phase 1: Taxonomy Generation
3.2 Phase 2: LLM-Augmented Text Classification
4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies
4.2 Phase 2 Evaluation Strategies
5 Experiments and 5.1 Data
5.2 Taxonomy Generation
5.3 LLM-Augmented Text Classification
5.4 Summary of Findings and Suggestions
6 Discussion and Future Work, and References
A. Taxonomies
B. Additional Results
C. Implementation Details
D. Prompt Templates
B ADDITIONAL RESULTS
We present additional results from the experiments conducted for taxonomy generation phase and the label assignment phase.
B.1 Phase 1: Taxonomy Generation
In addition to the taxonomy evaluation results on BingChatPhase1-S-Eng, we also investigate how the label taxonomy outcome from our proposed TnT-LLM framework perform across different languages. We present the label accuracy results from the GPT-4 rator in Figure 5, where we generally do not find significant differences of its performance on English conversations and non-English conversations.
B.2 Phase 2: Label Assignment
B.2.1 Annotation Agreement Analysis. We conduct in-depth investigation on the agreement results among human annotators and the LLM annotator for the label assignment task. The agreement results between different pairs of human annotators are presented in Figure 6. The confusion matrix between the GPT-4 annotations and (resolved) human annotations for the primary label on BingChatPhase2-S-Eng dataset is provided in Figure 7. We notice that for user intent, most disagreements occur at the boundary between “Fact-based information seeking” and “Clarification and concept explanation”, “General solution and advice seeking” and “Technical assistance and problem solving”. This suggests that human annotators and the GPT-4 annotator have different judgments on how “technical” or how much elaboration a user query requires. Note all our human annotators have high technical expertise, which may lead them to apply different implicit standards than the general population, resulting in potentially biased annotations. We observe similar patterns in the domain label assignment task, where “General digital support” and “Software development and hardware issues” are often confused, and the GPT-4 annotator has a high false positive rate on the “Software development and hardware issues” if human annotations are considered as oracle. We argue that this kind of analysis can help us identify and reduce potential biases in both human annotations and LLM annotations, and thus improve the clarity of the label description in the taxonomy and the consistency of label annotations.
B.2.2 Full Classification Results. We present the full multiclass classification results from predicting the primary label of a conversation in Figure 11, the full multilabel classification results from predicting all applicable labels in Figure 12, and the by language classification results in Figure 13. We confirm that the conclusions in Section 5.3 still hold.
Authors:
(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;
(2) Tara Safavi (Corresponding authors), Microsoft Corporation;
(3) Sujay Kumar Jauhar, Microsoft Corporation;
(4) Yujin Kim, Microsoft Corporation;
(5) Scott Counts, Microsoft Corporation;
(6) Jennifer Neville, Microsoft Corporation;
(7) Siddharth Suri, Microsoft Corporation;
(8) Chirag Shah, University of Washington and Work done while working at Microsoft;
(9) Ryen W. White, Microsoft Corporation;
(10) Longqi Yang, Microsoft Corporation;
(11) Reid Andersen, Microsoft Corporation;
(12) Georg Buscher, Microsoft Corporation;
(13) Dhruv Joshi, Microsoft Corporation;
(14) Nagu Rangan, Microsoft Corporation.