Missing Data is an interesting data imperfection since it might arise naturally because of the character of the domain, or be inadvertently created during data, collection, transmission, or processing.
In essence, missing data is characterised by the looks of absent values in data, i.e., missing values in some records or observations within the dataset, and may either be univariate (one feature has missing values) or multivariate (several features have missing values):
Let’s consider an example. Let’s say we’re conducting a study on a patient cohort regarding diabetes, as an illustration.
Medical data is an important example for this, since it is usually highly subjected to missing values: patient values are taken from each surveys and laboratory results, may be measured several times throughout the course of diagnosis or treatment, are stored in several formats (sometimes distributed across institutions), and are sometimes handled by different people. It could (and most definitely will) get messy!
In our diabetes study, a the presence of missing values is likely to be related to the study being conducted or the info being collected.
As an example, missing data may arise because of a faulty sensor that shuts down for prime values of blood pressure. One other possibility is that missing values in feature “weight” usually tend to be missing for older women, that are less inclined to disclose this information. Or obese patients could also be less more likely to share their weight.
Alternatively, data may also be missing for reasons which are under no circumstances related to the study.
A patient can have a few of his information missing because a flat tire caused him to miss a doctors appointment. Data might also be missing because of human error: as an illustration, if the person conducting the evaluation misplaces of misreads some documents.
No matter the rationale why data is missing, it is necessary to research whether the datasets contain missing data prior to model constructing, as this problem can have severe consequences for classifiers:
- Some classifiers cannot handle missing values internally: This makes them inapplicable when handling datasets with missing data. In some scenarios, these values are encoded with a pre-defined value, e.g., “0” in order that machine learning algorithms are in a position to deal with them, although this shouldn’t be one of the best practice, especially for higher percentages of missing data (or more complex missing mechanisms);
- Predictions based on missing data may be biased and unreliable: Although some classifiers can handle missing data internally, their predictions is likely to be compromised, since a crucial piece of data is likely to be missing from the training data.
Furthermore, although missing values may “all look the identical”, the reality is that their underlying mechanisms (that reason why they’re missing) can follow 3 major patters: Missing Completely At Random (MCAR), Missing Not At Random (MNAR), and Missing Not At Random (MNAR).
Keeping these several types of missing mechanisms in mind is essential because they determine the selection for appropriate methods to handle missing data efficiently and the validity of the inferences derived from them.
Let’s go over each mechanism real quick!
Missing Data Mechanisms
If you happen to’re a mathy person, I’d suggest a go through this paper (cof cof), namely Sections II and III, which comprises all of the notation and mathematical formulation you is likely to be searching for (I used to be actually inspired by this book, which can also be a really interesting primer, check Section 2.2.3. and a pair of.2.4.).
If you happen to’re also a visible learner like me, you’d wish to “see” it, right?
For that matter, we’ll take a take a look at the adolescent tobacco study example, utilized in the paper. We’ll consider dummy data to showcase each missing mechanism:
One thing to be mindful this: the missing mechanisms describe whether and the way the missingness pattern may be explained by the observed data and/or the missing data. It’s tricky, I do know. But it can get more clear with the instance!
In our tobacco study, we’re specializing in adolescent tobacco use. There are 20 observations, relative to twenty participants, and have Age
is totally observed, whereas the Variety of Cigarettes
(smoked per day) will probably be missing in line with different mechanisms.
Missing Completely At Random (MCAR): No harm, no foul!
In Missing Completely At Random (MCAR) mechanism, the missingness process is totally unrelated to each the observed and missing data. That implies that the probability that a feature has missing values is completely random.
In our example, I simply removed some values randomly. Note how the missing values aren’t positioned in a specific range of Age
or Variety of Cigaretters
values. This mechanism can subsequently occur because of unexpected events happening throughout the study: say, the person chargeable for registering the participants’ responses unintentionally skipped an issue of the survey.
Missing At Random (MAR): Search for the tell-tale signs!
The name is definitely misleading, because the Missing At Random (MAR) occurs when the missingness process may be linked to the observed information in data (though to not the missing information itself).
Consider the following example, where I removed the values of Variety of Cigarettes
for younger participants only (between 15 and 16 years). Note that, despite the missingess process being clearly related to the observed values in Age
, it is totally unrelated to the variety of cigarettes smoked by these teens, had it been reported (note the “Complete” column, where a high and low variety of cigarettes could be found among the many missing values, had they been observed).
This could be the case if younger kids could be less inclined to disclose their variety of smoked cigarettes per day, avoiding to confess that they’re regular smokers (whatever the amount they smoke).
Missing Not At Random (MNAR): That ah-ha moment!
As expected, the Missing Not At Random (MNAR) mechanism is the trickiest of all of them, since the missingness process may depend upon each the observed and missing information in the info. Which means the probability of missing values occurring in a feature could also be related to the observed values of other feature in the info, in addition to to the missing values of that feature itself!
Take a take a look at the following example: values are missing for higher amounts of Variety of Cigarettes
, which implies that the probability of missing values in Variety of Cigarettes
is expounded to the missing values themselves, had they been observed (note the “Complete” column).
This could be the case of teens that refused to report their variety of smoked cigarettes per day since they smoked a really great quantity.
Along our easy example, we’ve seen how MCAR is the best of the missing mechanisms. In such scenario, we may ignore most of the complexities that arise because of the looks of missing values, and some easy fixes comparable to case listwise or casewise deletion, in addition to simpler statistical imputation techniques, may do the trick.
Nonetheless, although convenient, the reality is that in real-world domains, MCAR is usually unrealistic, and most researchers normally assume no less than MAR of their studies, which is more general and realistic than MCAR. On this scenario, we may consider more robust strategies than can infer the missing information from the observed data. On this regard, data imputation strategies based on machine learning are generally the preferred.
Finally, MNAR is by far probably the most complex case, since it is extremely difficult to infer the causes for the missingess. Current approaches concentrate on mapping the causes for the missing values using correction aspects defined by domain experts, inferring missing data from distributed systems, extending state-of-the-art models (e.g., generative models) to include multiple imputation, or performing sensitivity evaluation to find out how results change under different circumstances.
Also, on the subject of identifiability, the issue doesn’t get any easier.
Although there are some tests to tell apart MCAR from MAR, they aren’t widely popular and have restrictive assumptions that don’t hold for complex, real-world datasets. It’s also impossible to tell apart MNAR from MAR because the information that will be needed is missing.
To diagnose and distinguish missing mechanisms in practice, we may concentrate on hypothesis testing, sensitivity evaluation, getting some insights from domain experts, and investigating vizualization techniques that may provide some understanding of the domains.
Naturally, there are other complexities to account for which condition the applying of treatment strategies for missing data, namely the percentage of knowledge that’s missing, the variety of features it affects, and the end goal of the technique (e.g., feed a training model for classification or regression, reconstruct the unique values in probably the most authentic way possible?).
All in all, not a simple job.
Let’s take this little by little. We’ve just learned an overload of data on missing data and its complex entanglements.
In this instance, we’ll cover the fundamentals of learn how to mark and visualize missing data in a real-world dataset, and ensure the issues that missing data introduces to data science projects.
For that purpose, we’ll use the Pima Indians Diabetes dataset, available on Kaggle (License — CC0: Public Domain). If you happen to’d wish to follow along the tutorial, be at liberty to download the notebook from the Data-Centric AI Community GitHub repository.
To make a fast profiling of your data, we’ll also use ydata-profiling
, that gets us a full overview of our dataset in only a couple of line of codes. Let’s start by installing it:
Now, we will load the info and make a fast profile:
the info, we will determine that this dataset consists by 768 records/rows/observations (768 patients), and 9 attributes or features. Actually, Final result
is the goal class (1/0), so we have now 8 predictors (8 numerical features and 1 categorical).
At a primary glance, the dataset doesn’t appear to have missing data. Nonetheless, this dataset is thought to be affected by missing data! How can we confirm that?
the “Alerts” section, we will see several “Zeros” alerts that indicate us that there are several features for which zero values make no sense or are biologically unimaginable: e.g., a zero-value for body mass index or blood pressure is invalid!
Skimming through all features, we will determine that pregnancies seems tremendous (have zero pregnancies is cheap), but for the remaining features, zero values are suspicious:
In most real-world datasets, missing data is encoded by sentinel values:
- Out-of-range entries, comparable to
999
; - Negative numbers where the feature has only positive values, e.g.
-1
; - Zero-values in a feature that would never be 0.
In our case, Glucose
, BloodPressure
, SkinThickness
, Insulin
, and BMI
all have missing data. Let’s count the variety of zeros that these features have:
We will see that Glucose
, BloodPressure
and BMI
have just a couple of zero values, whereas SkinThickness
and Insulin
have loads more, covering nearly half of the present observations. This implies we’d consider different strategies to handle these features: some might require more complex imputation techniques than others, as an illustration.
To make our dataset consistent with data-specific conventions, we must always make these missing values as NaN
values.
That is the usual strategy to treat missing data in python and the convention followed by popular packages like pandas
and scikit-learn
. These values are ignored from certain computations like sum
or count
, and are recognized by some functions to perform other operations (e.g., drop the missing values, impute them, replace them with a hard and fast value, etc).
We’ll mark our missing values using the replace()
function, after which calling isnan()
to confirm in the event that they were accurately encoded:
The count of NaN
values is identical because the 0
values, which implies that we have now marked our missing values accurately! We could then use the profile report agains to ascertain that now the missing data is recognized. Here’s how our “latest” data looks like:
We will further check for some characteristics of the missingness process, skimming through the “Missing Values” section of the report:
Besided the “Count” plot, that provides us an outline of all missing values per feature, we will explore the “Matrix” and “Heatmap” plots in additional detail to hypothesize on the underlying missing mechanisms the info may suffer from. Specifically, the correlation between missing features is likely to be informative. On this case, there appears to be a major correlation between Insulin
and SkinThicknes
: each values appear to be concurrently missing for some patients. Whether it is a coincidence (unlikely), or the missingness process may be explained by known aspects, namely portraying MAR or MNAR mechanisms could be something for us to dive our noses into!
Regardless, now we have now our data ready for evaluation! Unfortunately, the strategy of handling missing data is way from being over. Many classic machine learning algorithms cannot handle missing data, and we’d like find expert ways to mitigate the problem. Let’s try to judge the Linear Discriminant Evaluation (LDA) algorithm on this dataset:
If you happen to attempt to run this code, it can immediately throw an error:
The only strategy to fix this (and probably the most naive!) could be to remove all records that contain missing values. We will do that by making a latest data frame with the rows containing missing values removed, using the dropna()
function…
… and trying again:
And there you will have it! By the dropping the missing values, the LDA algorithm can now operate normally.
Nonetheless, the dataset size was substantially reduced to 392 observations only, which implies we’re losing nearly half of the available information.
For that reason, as a substitute of simply dropping observations, we must always search for imputation strategies, either statistical or machine-learning based. We could also use synthetic data to switch the missing values, depending on our final application.
And for that, we’d attempt to get some insight on the underlying missing mechanisms in the info. Something to look ahead to in future articles?