Teaching AI to Say “I Don’t Know”: A Four-Step Guide to Contextual Data Imputation

Authors:
(1) Ahatsham Hayat, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]);
(2) Mohammad Rashedul Hasan, Department of Electrical and Computer Engineering, University of Nebraska-Lincoln ([email protected]).
Table of Links
Abstract and 1 Introduction
2 Method
2.1 Problem Formulation and 2.2 Missingness Patterns
2.3 Generating Missing Values
2.4 Description of CLAIM
3 Experiments
3.1 Results
4 Related Work
5 Conclusion and Future Directions
6 Limitations and References
2 Method
2.1 Problem Formulation
2.2 Missingness Patterns
We represent the missing data mechanism as a conditional distribution of M given X, which is parameterized by an unknown ϕ, as follows.
In the literature, the following three standard mechanisms for missing data are defined [21].
Missing completely at random (MCAR). An MCAR case occurs when the probability that a value of a variable is missing is independent of the variable itself and any other variables, expressed as follows.
In MCAR, the missingness probability depends neither on the missing variable nor on the observed variables.
Missing at random (MAR). The probability that the value of a variable is missing only depends on the observed values of other variables XO. Thus, the missingness is independent of the missing variables and the missing value is predictable from the observed variables, formalized as follows.
Missing not at random (MNAR). This case corresponds to missing mechanisms that are neither MCAR nor MAR. In MNAR, the reason for a value to be missing, can depend on other variables, but also on the value that is missing.
Unlike MAR, the missingness in MNAR cannot be predicted only from the observed variables. There is a no general method of handling MNAR missing data properly [14].
Often the reasons for missing data is ignored when the missingness is due to MCAR or MAR, thus imputation methods can be simplified [33]. For this reason, the majority of research covers the cases where missing data are of the MAR or the MCAR type.
2.3 Generating Missing Values
We constructed synthetic datasets with up to 30% missing values by applying the following three missingness mechanisms on complete datasets: MCAR, MAR and MNAR. The implementations of these mechanisms are modified from [20].
MCAR. It was introduced by randomly removing 30% of the observations from each feature.
MAR. First, we select all observations within the 30-th percentile range of an independent feature (usually the first column in the dataset). Then, we randomly remove 60% observations from each corresponding (dependent) feature.
MNAR. We remove the observations of a feature if the observations fall within the 30-th percentile range of the feature value.
2.4 Description of CLAIM
Figure 1 illustrates the CLAIM process, which encompasses four stages: (1) constructing a contextualized natural language dataset, (2) generating suitable descriptors for
missing values, (3) creating a missingness-aware contextualized dataset, and (4) adapting an LLM for downstream tasks. We detail these stages below.
Constructing a Contextualized Natural Language Dataset. We construct a contextualized natural language dataset from a numeric dataset X containing missing values. The objective is to generate contextually suitable description of each attribute and its measures in natural language. For instance, a record from the UCI Wine dataset [12] with numeric input and output attributes is contextualized as follows: “The alcohol content in the wine is 12.47. The level of malic acid in the wine is 1.52 … The class of the wine is classified as class 1 wine.”[1] This step converts numeric values into detailed descriptions, preparing the dataset for embedding missing value descriptors.
Generating Suitable Descriptors for Missing Values. Unlike conventional imputation methods that estimate missing values from observed data using numerical methods, we utilize contextually-relevant descriptors of missing values for imputation. We generate these descriptors by a conversational LLM (e.g., OpenAI’s ChatGPT-3.5 [2]). We prompt the LLM with a dataset description and instruct it to generate missing value descriptors, such as: “For any missing attribute values, suggest a descriptor for the missing data that I can place in those cells.” This method relies on the LLM’s extensive knowledge base to produce appropriate missing value descriptors. A list of feature-specific contextually relevant missing-value descriptors for selected datasets are provided in the Appendix.
Creating a Missingness-Aware Contextualized Dataset. We construct the missingness aware contextualized natural language dataset Xmissingness_aware by replacing the missing values with the generated descriptors. This process ensures that each data instance is aware of its missing attributes, thus capable of improving the LLM’s ability to learn from incomplete data by providing explicit context. Furthermore, we use distinct descriptors for separate features in the dataset that contain missing values, thereby implicitly informing an LLM to handle the missingness of each feature in a contextually-suitable way for improving the performance of the downstream task.
Adapting an LLM for Solving Downstream Tasks. The final step involves finetuning a pre-trained LLM with the missingness-aware, contextually-rich dataset. We incorporate specific task instructions and strategies for handling missing data into the fine-tuning process. For instance, for classification tasks, we might include instructions like: “Predict the class based on the given measurements. Use the context provided by missing value descriptors to inform your prediction.”
This structured approach, from transforming datasets to fine-tuning LLMs, signifies a comprehensive method for addressing data missingness through the capabilities of LLMs.
This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.