Crypto News

LLM-Augmented Text Classification: Distilling GPT-4 Labels into Efficient Classifiers

mrarup82April 23, 2025

0 1 4 minutes read

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Method and 3.1 Phase 1: Taxonomy Generation

3.2 Phase 2: LLM-Augmented Text Classification

4 Evaluation Suite and 4.1 Phase 1 Evaluation Strategies

4.2 Phase 2 Evaluation Strategies

5 Experiments and 5.1 Data

5.2 Taxonomy Generation

5.3 LLM-Augmented Text Classification

5.4 Summary of Findings and Suggestions

6 Discussion and Future Work, and References

A. Taxonomies

B. Additional Results

C. Implementation Details

D. Prompt Templates

5.3 LLM-Augmented Text Classification

At the end of the label taxonomy generation phase, we conduct a lightweight human calibration [23] on the intent taxonomy and domain taxonomy generated from TnT-LLM with GPT-4 to improve their clarity. These calibrated taxonomies are then utilized in the label assignment phase. The full label and description texts of each taxonomy are provided in Table 5 and Table 6. As a reminder, our main goal in this section is to compare how distilled lightweight classifiers trained on LLM labels compare to a full LLM classifier; our goal is to achieve a favorable tradeoff of accuracy and efficiency compared to a more expensive but potentially more powerful LLM.

5.3.1 Methods. We apply GPT-4 as an automated annotator to assign both the primary label and any other relevant labels to each conversation in the corpus. We then train classifiers based on the GPT-4 annotated training and validation sets. We extract features from each conversation using two embedding methods: ada2 and Instructor-XL. For each embedding method, we train three types of classifiers with the GPT-4 labels: Logistic Regression, the gradient boosting LightGBM [13], and a two-layer MultiLayer Perceptron (MLP) [9]. We use multinomial logit in logistic regression for the primary label classification, and a standard ‘one-vs-all’ scheme for the multilabel classification with all three classifiers.

Additionally, four of the authors manually labeled 400 English conversations (BingChat-Phase2-S-Eng) with the given intent and domain taxonomy. Each conversation was labeled by three annotators, and the majority vote determined the final labels. For a few conversations (<10%), where all three annotators disagreed on the primary label the fourth annotator was used as a tie-breaker.

We thus obtain two annotated test sets: BingChat-Phase2-SEng with 400 English conversations with both human and GPT-4 annotations, and BingChat-Phase2-L-Multi with around 10k conversations with GPT-4 annotations only.

5.3.2 Results. We first evaluate the agreement between annotators to assess the task complexity and reliability. As Table 2 shows, human annotators have substantial agreement on the primary domain label (𝜅 > 0.6), and moderate agreement on the primary intent label (Fleiss′𝜅 = 0.553). Both of these values indicate a high degree of mutual understanding among raters and clarity in the instructions and taxonomies. We also note that the domain taxonomy has more categories (25) than the intent taxonomy (10). One might expect a larger taxonomy to be more difficult to comprehend, but we find the smaller intent taxonomy to be more challenging for humans to agree on. We attribute this to the task complexity and ambiguity, as it requires more reasoning; this observation aligns well with our observation in the previous evaluation that GPT4 greatly outperforms GPT-3.5-Turbo on intent detection, as GPT4 is generally considered to be a stronger reasoner.

Similar to the label accuracy evaluation (Table 1), GPT-4 agrees more with the resolved human labels than humans do among themselves on the primary label assignment. We observe that human agreement on all applicable labels is moderate (𝜅 > 0.4) with both intent and domain taxonomies, which is surprisingly good considering such an agreement is calculated based on exact match (i.e., an agreement is counted only if all selected labels are matched). However, the agreement between GPT-4 and human annotations on this task is much lower. A closer inspection reveals that GPT-4 tends to be more liberal than humans on label assignment, applying all relevant categories, resulting in a low precision but high recall.

We then evaluate the classification performance of the distilled embedding-based classifiers on two datasets: BingChat-Phase2- S-Eng, where human annotations are the oracle, and BingChatPhase2-L-Multi, where GPT-4 annotations are the oracle. The results for the primary label classification are presented in Table 3, where we observe that lightweight embedding-based classifiers can achieve promising results. In particular, ada2 embeddings achieve strong results with logistic regression; nonlinearity does not seem to improve performance significantly in most cases. When using human annotations as the gold standard, we find that the performance of these lightweight models are comparable to, and sometimes slightly better than, directly using GPT-4 as a classifier on BingChat-Phase2-S-Eng. We also perform evaluation on the multilingual test set BingChat-Phase2-L-Multi, where GPT-4 annotations are considered as oracle. We observe the performance on non-English conversations is lower than that on English conversations (Table 3), especially on the Instructor embedding, indicating the importance of choosing an appropriate embedding method that suits the characteristics of the corpus.

On the multilabel classification task (Table 4), we observe that the distilled classifiers achieve higher precision at the expense of some recall compared to GPT-4. Here, nonlinearity also seems to help more, as MLP-based classifiers achieve the highest accuracy and precision.

Authors:

(1) Mengting Wan, Microsoft Corporation and Microsoft Corporation;

(2) Tara Safavi (Corresponding authors), Microsoft Corporation;

(3) Sujay Kumar Jauhar, Microsoft Corporation;

(4) Yujin Kim, Microsoft Corporation;

(5) Scott Counts, Microsoft Corporation;

(6) Jennifer Neville, Microsoft Corporation;

(7) Siddharth Suri, Microsoft Corporation;

(8) Chirag Shah, University of Washington and Work done while working at Microsoft;

(9) Ryen W. White, Microsoft Corporation;

(10) Longqi Yang, Microsoft Corporation;

(11) Reid Andersen, Microsoft Corporation;

(12) Georg Buscher, Microsoft Corporation;

(13) Dhruv Joshi, Microsoft Corporation;

(14) Nagu Rangan, Microsoft Corporation.

mrarup82April 23, 2025

0 1 4 minutes read

Table of Links

5.3 LLM-Augmented Text Classification

mrarup82

Related Articles

Deutsche Bank reports steep profit drop after splashing $1.7 billion on Postbank legal costs

The National Guard Was Sent to LA in 1992. This Is Different

Binance to Temporarily Suspend Cardano (ADA) Deposits on This Date, Here’s Why

Shardeum Hits 171K+ Validators in Testnet – A Record for L1s

Leave a Reply Cancel reply