Crypto Trends

TnT-LLM for Automated Taxonomy Generation: Outperforming Clustering Baselines

Abstract and 1 Introduction

2 Related Work

3 Method and 3.1 Phase 1: Taxonomy Generation

3.2 Phase 2: LLM-Augmented Text Classification

4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies

4.2 Phase 2 Evaluation Strategies

5 Experiments and 5.1 Data

5.2 Taxonomy Generation

5.3 LLM-Augmented Text Classification

5.4 Summary of Findings and Suggestions

6 Discussion and Future Work, and References

A. Taxonomies

B. Additional Results

C. Implementation Details

D. Prompt Templates

5.2 Taxonomy Generation

5.2.1 Methods. To evaluate the effectiveness of TnT-LLM, we compare it with baseline methods that rely on embedding-based clustering to group conversations and then assigns LLM-generated labels to each cluster. We use two state-of-the-art LLMs, GPT4 (0613) and GPT-3.5-Turbo (0613), as label generators and evaluators, and two different embedding methods, ada2[2] and Instructor-XL [26], to represent the conversations. The methods considered in our experiments are as follows:

• GPT-4 (TnT-LLM): the proposed TnT-LLM with GPT-4 to perform label taxonomy generation and assignment.

• GPT-3.5 (TnT-LLM): the proposed TnT-LLM with GPT-3.5- Turbo to perform label taxonomy generation and assignment.

• ada2 + GPT-4: the embedding-based clustering approach where conversations are represented via ada2 and K-means algorithm is applied to generate clusters. We randomly sample 200 conversations within each cluster, prompt GPT-4 to summarize each conversation, then ask it to produce a label name and description from these summaries, conditioned on the use-case instruction.

• ada2 + GPT-3.5-Turbo: similar to the above method, with GPT-3.5-Turbo as the label generator. • Instructor-XL + GPT-4: similar to the above embeddingbased methods, with Instructor-XL and GPT-4 as the underlying embedding and the label generator respectively.

• Instructor-XL + GPT-3.5-Turbo: similar to the above method, with GPT-3.5-Turbo as the label generator.

Note that all the taxonomies evaluated in this section are fully automatic and do not involve any human intervention.

5.2.2 Implementation Details. We instruct our LLMs to generate 10 intent categories and 25 domain categories for taxonomy generation. Likewise, we learn 10 intent clusters and 25 domain clusters with our embedding-based baselines. We use a minibatch size of 200 for our proposed taxonomy generation pipeline. We also apply a minibatch version of the K-means algorithm in all embeddingbased clustering approaches, where the same batch size is used with a K-means++ [2] initialization. We run 10 different trials of the clustering algorithm and select the best one based on the Silhouette coefficient [21] on the validation set. We also devise a “model” selection prompt, which takes a batch of conversation summaries,

Table 1: Inter-rater reliability (Fleiss’ Kappa and Cohen’s Kappa) among human raters and between LLM raters and the resolved human rating through majority voting. Agreement considered as moderate and above (> 0.4) are highlighted with *. Evaluation is performed on BingChat-Phase1-S-Eng.Table 1: Inter-rater reliability (Fleiss’ Kappa and Cohen’s Kappa) among human raters and between LLM raters and the resolved human rating through majority voting. Agreement considered as moderate and above (> 0.4) are highlighted with *. Evaluation is performed on BingChat-Phase1-S-Eng.

multiple label taxonomies, a use-case instruction as input, then outputs the index of the taxonomy that best fits the data and the instructional desiderata. We then run TnT-LLM 10 trials and select the best outcome based on its performance on the validation set. Human Evaluation. To evaluate the quality of generated taxonomies from methods listed above, three of the authors performed the label accuracy and use-case relevance tasks; each conversation was evaluated by all three raters. While raters possessed a high degree of familiarity with the Bing Copilot system, as well as the desired use-cases, they were unaware of the correspondence between methods and their generated labels. The position of the options in the pairwise comparison label accuracy task is also fully randomized. We also use two LLM systems, GPT-4 and GPT-3.5-Turbo, to perform the same evaluation tasks as the human raters. However, we notice that the LLM systems tend to exhibit a position bias [16] for the pairwise comparison task, where they favor one option over another based on its position in the prompt. This bias is more evident when the taxonomy quality is low and the task is more challenging. To mitigate this, we average the results over multiple runs with randomized positions of the options in our experiments.

5.2.3 Results. We first calculate the coverage of the LLM-generated taxonomies on the BingChat-Phase1-L-Multi dataset, where both LLM systems achieve very high coverage (>99.5%) on both user intent and conversational domain taxonomies.

We then conduct the accuracy and relevance evaluation tasks to assess the quality of the taxonomies generated by different methods on the small English-only evaluation dataset BingChat-Phase1- S-Eng. We report the inter-rater agreement (Cohen’s Kappa [6] between two raters and Fleiss’ Kappa [7] among multiple raters) in Table 1. The agreement is moderate (𝜅 > 0.4) on intent and domain accuracy as well as intent relevance, while the agreement on domain relevance is fair (Fleiss′𝜅 = 0.379).[3] Interestingly, for the tasks with moderate agreement, the GPT-4 evaluator agrees more with the human majority than the humans do among themselves. This suggests that GPT-4 can be a consistent and reliable evaluator.

Figure 4a shows the main results on label accuracy and use case relevance from human evaluations on BingChat-Phase1- S-Eng. We observe our TnT-LLM using GPT-4 outperforms other

Figure 4: Taxonomy evaluation results on BingChat-Phase1- S-Eng from human raters and the GPT-4 rater, where error bars indicate 95% confidence intervals.Figure 4: Taxonomy evaluation results on BingChat-Phase1- S-Eng from human raters and the GPT-4 rater, where error bars indicate 95% confidence intervals.

methods in most cases. Compared to GPT4, we find that GPT-3.5- Turbo tends capture conversation topics (domains) well, but often fails to generate labels that are aligned with the user intent instruction. Likewise, we notice that some embedding methods (ada2 + GPT-4, instructor-xl + GPT-4) perform well in terms of producing accurate domain labels, on par with TnT-LLM instantiated with GPT-3.5-Turbo, but fail to capture the user intent behind the conversations. This is likely because the domain labels reflect the topical theme of the conversations, which can be easily derived from the semantic information captured by unsupervised embeddings, while intent labels require deeper reasoning and understanding of the use-case instruction.

With regard to our baselines, we find that GPT-4 consistently outperforms GPT-3.5-Turbo in producing more accurate labels when using the same embedding method for clustering. For the intent use-case, GPT-4 generates more relevant labels than GPT-3.5-Turbo, while the difference is less noticeable for the domain use case; again, this may be because GPT-3.5-Turbo is better at capturing topical information in conversations than reasoning about user intent.

Finally, given the high agreement between GPT-4 and human raters on the label accuracy task, we use GPT-4 to evaluate the label accuracy on the larger multilingual dataset BingChat-Phase1- L-Multi (Figure 4b). We observe similar patterns as those in our

Table 2: Inter-rater reliability (Fleiss’ Kappa and Cohen’s Kappa) among human annotators and between LLM annotations and the resolved human annotations. Agreement considered as moderate ((0.4, 0.6]) are highlighted with *, substantial and above (> 0.6) are highlighted with **.Table 2: Inter-rater reliability (Fleiss’ Kappa and Cohen’s Kappa) among human annotators and between LLM annotations and the resolved human annotations. Agreement considered as moderate ((0.4, 0.6]) are highlighted with *, substantial and above (> 0.6) are highlighted with **.

human evaluation, where our TnT-LLM achieves the highest accuracy, and in particular the instantation that uses GPT-4.


[2] https://openai.com/blog/new-and-improved-embedding-model

[3] Note that these evaluation tasks are cognitively challenging, especially for low-quality taxonomies (e.g., from some baseline methods).

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button