Bitcoin

Additional Results: Cross-Lingual Taxonomy Evaluation and In-Depth Classification Analysis

mrarup82April 22, 2025

0 1 2 minutes read

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Method and 3.1 Phase 1: Taxonomy Generation

3.2 Phase 2: LLM-Augmented Text Classification

4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies

4.2 Phase 2 Evaluation Strategies

5 Experiments and 5.1 Data

5.2 Taxonomy Generation

5.3 LLM-Augmented Text Classification

5.4 Summary of Findings and Suggestions

6 Discussion and Future Work, and References

A. Taxonomies

B. Additional Results

C. Implementation Details

D. Prompt Templates

B ADDITIONAL RESULTS

We present additional results from the experiments conducted for taxonomy generation phase and the label assignment phase.

B.1 Phase 1: Taxonomy Generation

In addition to the taxonomy evaluation results on BingChatPhase1-S-Eng, we also investigate how the label taxonomy outcome from our proposed TnT-LLM framework perform across different languages. We present the label accuracy results from the GPT-4 rator in Figure 5, where we generally do not find significant differences of its performance on English conversations and non-English conversations.

Figure 5: Taxonomy evaluation results by language on multilingual conversations (BingChat-Phase1-L-Multi) from the GPT-4 rater.

B.2 Phase 2: Label Assignment

B.2.1 Annotation Agreement Analysis. We conduct in-depth investigation on the agreement results among human annotators and the LLM annotator for the label assignment task. The agreement results between different pairs of human annotators are presented in Figure 6. The confusion matrix between the GPT-4 annotations and (resolved) human annotations for the primary label on BingChatPhase2-S-Eng dataset is provided in Figure 7. We notice that for user intent, most disagreements occur at the boundary between “Fact-based information seeking” and “Clarification and concept explanation”, “General solution and advice seeking” and “Technical assistance and problem solving”. This suggests that human annotators and the GPT-4 annotator have different judgments on how “technical” or how much elaboration a user query requires. Note all our human annotators have high technical expertise, which may lead them to apply different implicit standards than the general population, resulting in potentially biased annotations. We observe similar patterns in the domain label assignment task, where “General digital support” and “Software development and hardware issues” are often confused, and the GPT-4 annotator has a high false positive rate on the “Software development and hardware issues” if human annotations are considered as oracle. We argue that this kind of analysis can help us identify and reduce potential biases in both human annotations and LLM annotations, and thus improve the clarity of the label description in the taxonomy and the consistency of label annotations.

B.2.2 Full Classification Results. We present the full multiclass classification results from predicting the primary label of a conversation in Figure 11, the full multilabel classification results from predicting all applicable labels in Figure 12, and the by language classification results in Figure 13. We confirm that the conclusions in Section 5.3 still hold.

Figure 6: Pairwise agreement (in Cohen’s Kappa) between human annotators on the label assignment task.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

mrarup82April 22, 2025

0 1 2 minutes read

Table of Links

B ADDITIONAL RESULTS

B.1 Phase 1: Taxonomy Generation

B.2 Phase 2: Label Assignment

mrarup82

Related Articles

Bitcoin Price Flashes Brilliance As China Increases Gold Holdings By Five Tonnes

How Parents Can Navigate the World of AI Chatbots

Why COTI Joined Saudi Arabia’s $140B Tech Bet on AI, Blockchain, and Real-World Assets

Intel to Slash Marketing Workforce, Outsource Jobs to Accenture in Aggressive AI-Fueled Restructuring

Leave a Reply Cancel reply