What Is Text Sanitization? Definitions, Privacy Laws, and NLP Approaches
Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
Abstract and 1 Introduction
2 Background
2.1 Definitions
2.2 NLP Approaches
2.3 Privacy-Preserving Data Publishing
2.4 Differential Privacy
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
3.2 Wikipedia Biographies
4 Privacy-oriented Entity Recognizer
4.1 Wikidata Properties
4.2 Silver Corpus and Model Fine-tuning
4.3 Evaluation
4.4 Label Disagreement
4.5 MISC Semantic Type
5 Privacy Risk Indicators
5.1 LLM Probabilities
5.2 Span Classification
5.3 Perturbations
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
7 Conclusions and Future Work
Declarations
References
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
C. Label Agreement
D. LLM probabilities: base models
E. Training size and performance
F. Perturbation thresholds
2.1 Definitions
The right to privacy is a fundamental human right, as evidenced by its inclusion in the Universal Declaration of Human Rights and the European Convention on Human Rights. In the digital sphere, data privacy is enforced through multiple national and international regulations, such as the General Data Protection Regulation (GDPR) in Europe, the California Consumer Privacy Act (CCPA) in the United States or China’s Personal Information Protection Law (PIPL). Although those regulations differ in both scope and interpretation, their common principle is that individuals should remain in control of their own data. In particular, the processing of personal data must have a legal ground, and cannot be shared to third parties without the explicit and informed consent of the person(s) the data refers to.
One alternative strategy is to anonymize data to ensure the data is no longer personal, and therefore out of the scope of privacy regulations. Anonymization, according to the GDPR, refers to the complete and irrevocable removal of all information that may directly or indirectly lead to re-identification. However, as shown by Weitzenboeck et al. (2022), transforming the data to make it completely anonymous is almost impossible to achieve in practice for unstructured data such as text, unless the content of the text is radically altered, or the original source of the document is deleted.
Although complete anonymization is hard to attain, text sanitization is a crucial tool to adhere with the general requirement of data minimization which is enshrined in GDPR and most privacy regulations (Goldsteen et al., 2021). The principle of data minimization states that one should only collect and retain the personal data that is strictly necessary to fulfill a given purpose.
The process of editing text documents to conceal the identity of a person has a somewhat confusing terminology (Lison et al., 2021; Pil´an et al., 2022). The GDPR makes use of the term pseudonymization to denote a process of transforming data to conceal at least some personal identifiers, but in a way that does not amount to complete anonymization. The term de-identification is also common (Chevrier et al., 2019; Johnson et al., 2020), especially for work on medical patient records. De-identification approaches are typically restricted to the recognition of predefined entities, such as the categories of HIPAA (2004). In contrast, we define text sanitization as the process of detecting and masking any type of personal information in a text document that can lead to identification of the individual whose identity we wish to protect.
Text sanitization is a topic of investigation in several research fields, notably in natural language processing (NLP) and in privacy-preserving data publishing (PPDP). Approaches to text rewriting based on differential privacy have also been proposed. We review below those approaches.
2.2 NLP Approaches
NLP approaches to text sanitization have mainly focused on sequence labelling approaches, inspired by the large body of work on Named Entity Recognition. Such approaches aim at the detection of text spans containing personal identifiers (Chiu and Nichols, 2016; Lample et al., 2016). Most research works in this field to date have focused on the medical domain, where the Health Insurance Portability and Accountability Act of 1996 (HIPAA, 2004) offers concrete rules that allow for the standardization of this task. HIPAA defines a set of Protected Health Information (PHI) data types that encompass direct identifiers (such as names or social security numbers) as well as domain-specific demographic attributes including treatments received and health conditions. A wide variety of NLP methods have been developed for this task, including rule-based, machine learning-based and hybrid approaches (Sweeney, 1996; Neamatullah et al., 2008; Yang and Garibaldi, 2015; Yogarajan et al., 2018). Character-based recurrent neural networks (Dernoncourt et al., 2017; Liu et al., 2017) and transformer architectures have also been investigated for this purpose (Johnson et al., 2020). A recent initiative focused on replacing sensitive information is INCOGNITUS (Ribeiro et al., 2023), a clinical note de-identification tool. The system allows for redacting documents with either a NER-based method or with an embedding based approach substituting all tokens with a semantically related one. Recent large language models from the GPT family have also been explored. Liu et al. (2023) proposed DeID-GPT for masking PHI categories and showed that, with zero-shot in-context learning incorporating explicitly HIPAA requirements in the prompts, GPT-4 outperformed fine-tuned transformer models on the same annotated medical texts.
Text sanitization outside the medical domain includes approaches such as JuezHernandez et al. (2023), who propose AGORA, a document de-identification system combined with geoparsing (automatic location extraction from text) using LSTMs and CRFs and trained on Spanish law enforcement data. The authors focus on offering a complete pipeline and location information, while demographic attributes are not part of the information to de-identify. Yermilov et al. (2023) compared three systems for detecting and pseudonymizing PII: (1) a NER-based one relying on Wikidata; (2) a single-step sequence-to-sequence model trained on a parallel corpus; and (3) a large language model where named entities are first detected using a 1-shot prompt to GPT-3 and then pseudonymized with 1-shot prompts using ChatGPT (GPT-3.5). The authors find that the NER-based approach is best for preserving privacy while LLMs best preserve utility for a text classification and summarization tasks. Finally, Papadopoulou et al. (2022) present an approach to text sanitization, from detection of personal information to privacy risk estimation through the use of language model probabilities, web queries, and a classifier trained on manually labeled data. The present paper builds upon this work.