LLM Probabilities, Training Size, and Perturbation Thresholds in Entity Recognition
Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
Abstract and 1 Introduction
2 Background
2.1 Definitions
2.2 NLP Approaches
2.3 Privacy-Preserving Data Publishing
2.4 Differential Privacy
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
3.2 Wikipedia Biographies
4 Privacy-oriented Entity Recognizer
4.1 Wikidata Properties
4.2 Silver Corpus and Model Fine-tuning
4.3 Evaluation
4.4 Label Disagreement
4.5 MISC Semantic Type
5 Privacy Risk Indicators
5.1 LLM Probabilities
5.2 Span Classification
5.3 Perturbations
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
7 Conclusions and Future Work
Declarations
References
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
C. Label Agreement
D. LLM probabilities: base models
E. Training size and performance
F. Perturbation thresholds
A Human properties from Wikidata
The two tables below show the selected Wikidata properties mentioned in Section 4.1 that constitute the DEM and MISC gazetteers.
DEM-related properties
MISC-related properties
B Training parameters of entity recognizer
C Label Agreement
Frequently confused label pairs (see Section 4.4) are shown in Figure 4.
D LLM probabilities: base models
Table 11 describes the (ordered) based models the Autogluon tabular predictor employs for the LLM-probability based approach of Section 5.1
E Training size and performance
Figure 5 shows the F1 score of both the Tabular and the Multimodal Autogluon predictors (LLM probabilities Section 6.3 and span classification Section 6.3 respectively) at different training sizes for both datasets. We use a random sample of 1% to 100% for each training dataset split.
F Perturbation thresholds
Figure 6 shows the performance of different perturbation thresholds for both datasets for the training dataset split, with the black line indicating the threshold used in Section 5.3 for evaluation.