Home Artificial Intelligence Missing Data Demystified: The Absolute Primer for Data Scientists The Problem of Missing Data The Impact of Missing Data Mechanisms Identifying and Marking Missing Data Final Thoughts About me

Missing Data Demystified: The Absolute Primer for Data Scientists The Problem of Missing Data The Impact of Missing Data Mechanisms Identifying and Marking Missing Data Final Thoughts About me

0
Missing Data Demystified: The Absolute Primer for Data Scientists
The Problem of Missing Data
The Impact of Missing Data Mechanisms
Identifying and Marking Missing Data
Final Thoughts
About me

Missing Data is an interesting data imperfection since it might arise naturally because of the character of the domain, or be inadvertently created during data, collection, transmission, or processing.

In essence, missing data is characterised by the looks of absent values in data, i.e., missing values in some records or observations within the dataset, and may either be univariate (one feature has missing values) or multivariate (several features have missing values):

Univariate versus Multivariate missing data patterns. Image by Creator.

Let’s consider an example. Let’s say we’re conducting a study on a patient cohort regarding diabetes, as an illustration.

Medical data is an important example for this, since it is usually highly subjected to missing values: patient values are taken from each surveys and laboratory results, may be measured several times throughout the course of diagnosis or treatment, are stored in several formats (sometimes distributed across institutions), and are sometimes handled by different people. It could (and most definitely will) get messy!

In our diabetes study, a the presence of missing values is likely to be related to the study being conducted or the info being collected.

As an example, missing data may arise because of a faulty sensor that shuts down for prime values of blood pressure. One other possibility is that missing values in feature “weight” usually tend to be missing for older women, that are less inclined to disclose this information. Or obese patients could also be less more likely to share their weight.

Alternatively, data may also be missing for reasons which are under no circumstances related to the study.

A patient can have a few of his information missing because a flat tire caused him to miss a doctors appointment. Data might also be missing because of human error: as an illustration, if the person conducting the evaluation misplaces of misreads some documents.

No matter the rationale why data is missing, it is necessary to research whether the datasets contain missing data prior to model constructing, as this problem can have severe consequences for classifiers:

  • Some classifiers cannot handle missing values internally: This makes them inapplicable when handling datasets with missing data. In some scenarios, these values are encoded with a pre-defined value, e.g., “0” in order that machine learning algorithms are in a position to deal with them, although this shouldn’t be one of the best practice, especially for higher percentages of missing data (or more complex missing mechanisms);
  • Predictions based on missing data may be biased and unreliable: Although some classifiers can handle missing data internally, their predictions is likely to be compromised, since a crucial piece of data is likely to be missing from the training data.

Furthermore, although missing values may “all look the identical”, the reality is that their underlying mechanisms (that reason why they’re missing) can follow 3 major patters: Missing Completely At Random (MCAR), Missing Not At Random (MNAR), and Missing Not At Random (MNAR).

Keeping these several types of missing mechanisms in mind is essential because they determine the selection for appropriate methods to handle missing data efficiently and the validity of the inferences derived from them.

Let’s go over each mechanism real quick!

Missing Data Mechanisms

If you happen to’re a mathy person, I’d suggest a go through this paper (cof cof), namely Sections II and III, which comprises all of the notation and mathematical formulation you is likely to be searching for (I used to be actually inspired by this book, which can also be a really interesting primer, check Section 2.2.3. and a pair of.2.4.).

If you happen to’re also a visible learner like me, you’d wish to “see” it, right?

For that matter, we’ll take a take a look at the adolescent tobacco study example, utilized in the paper. We’ll consider dummy data to showcase each missing mechanism:

Missing mechanisms example: a simulated dataset of a study in adolescent tobacco use, where the day by day average of smoked cigarettes is missing under different mechanisms (MCAR, MAR, and MNAR). Image by Creator.

One thing to be mindful this: the missing mechanisms describe whether and the way the missingness pattern may be explained by the observed data and/or the missing data. It’s tricky, I do know. But it can get more clear with the instance!

In our tobacco study, we’re specializing in adolescent tobacco use. There are 20 observations, relative to twenty participants, and have Age is totally observed, whereas the Variety of Cigarettes (smoked per day) will probably be missing in line with different mechanisms.

Missing Completely At Random (MCAR): No harm, no foul!

In Missing Completely At Random (MCAR) mechanism, the missingness process is totally unrelated to each the observed and missing data. That implies that the probability that a feature has missing values is completely random.

MCAR mechanism: (a) Missing values in variety of cigarettes are completely random; (b) Example of a MCAR pattern in a real-world dataset. Image by Creator.

In our example, I simply removed some values randomly. Note how the missing values aren’t positioned in a specific range of Ageor Variety of Cigaretters values. This mechanism can subsequently occur because of unexpected events happening throughout the study: say, the person chargeable for registering the participants’ responses unintentionally skipped an issue of the survey.

Missing At Random (MAR): Search for the tell-tale signs!

The name is definitely misleading, because the Missing At Random (MAR) occurs when the missingness process may be linked to the observed information in data (though to not the missing information itself).

Consider the following example, where I removed the values of Variety of Cigarettes for younger participants only (between 15 and 16 years). Note that, despite the missingess process being clearly related to the observed values in Age, it is totally unrelated to the variety of cigarettes smoked by these teens, had it been reported (note the “Complete” column, where a high and low variety of cigarettes could be found among the many missing values, had they been observed).

MAR mechanism: (a) Missing values in variety of cigarettes are related to the Age; (b) Example of a MAR pattern in a real-world dataset: values in X_miss_1, X_miss_3, and X_miss_p are missing depending on the values of X_obs. Values corresponding to highest/darkest values are missing. Image by Creator.

This could be the case if younger kids could be less inclined to disclose their variety of smoked cigarettes per day, avoiding to confess that they’re regular smokers (whatever the amount they smoke).

Missing Not At Random (MNAR): That ah-ha moment!

As expected, the Missing Not At Random (MNAR) mechanism is the trickiest of all of them, since the missingness process may depend upon each the observed and missing information in the info. Which means the probability of missing values occurring in a feature could also be related to the observed values of other feature in the info, in addition to to the missing values of that feature itself!

Take a take a look at the following example: values are missing for higher amounts of Variety of Cigarettes, which implies that the probability of missing values in Variety of Cigarettes is expounded to the missing values themselves, had they been observed (note the “Complete” column).

MNAR mechanism: (a) Missing values in variety of cigarettes are correspondent to the best values, had they been observed; (b) Example of a MNAR pattern in a real-world dataset: values in X_miss depend upon the values themselves (highest/darker values are removed). Image by Creator.

This could be the case of teens that refused to report their variety of smoked cigarettes per day since they smoked a really great quantity.

Along our easy example, we’ve seen how MCAR is the best of the missing mechanisms. In such scenario, we may ignore most of the complexities that arise because of the looks of missing values, and some easy fixes comparable to case listwise or casewise deletion, in addition to simpler statistical imputation techniques, may do the trick.

Nonetheless, although convenient, the reality is that in real-world domains, MCAR is usually unrealistic, and most researchers normally assume no less than MAR of their studies, which is more general and realistic than MCAR. On this scenario, we may consider more robust strategies than can infer the missing information from the observed data. On this regard, data imputation strategies based on machine learning are generally the preferred.

Finally, MNAR is by far probably the most complex case, since it is extremely difficult to infer the causes for the missingess. Current approaches concentrate on mapping the causes for the missing values using correction aspects defined by domain experts, inferring missing data from distributed systems, extending state-of-the-art models (e.g., generative models) to include multiple imputation, or performing sensitivity evaluation to find out how results change under different circumstances.

Also, on the subject of identifiability, the issue doesn’t get any easier.

Although there are some tests to tell apart MCAR from MAR, they aren’t widely popular and have restrictive assumptions that don’t hold for complex, real-world datasets. It’s also impossible to tell apart MNAR from MAR because the information that will be needed is missing.

To diagnose and distinguish missing mechanisms in practice, we may concentrate on hypothesis testing, sensitivity evaluation, getting some insights from domain experts, and investigating vizualization techniques that may provide some understanding of the domains.

Naturally, there are other complexities to account for which condition the applying of treatment strategies for missing data, namely the percentage of knowledge that’s missing, the variety of features it affects, and the end goal of the technique (e.g., feed a training model for classification or regression, reconstruct the unique values in probably the most authentic way possible?).

All in all, not a simple job.

Let’s take this little by little. We’ve just learned an overload of data on missing data and its complex entanglements.

In this instance, we’ll cover the fundamentals of learn how to mark and visualize missing data in a real-world dataset, and ensure the issues that missing data introduces to data science projects.

For that purpose, we’ll use the Pima Indians Diabetes dataset, available on Kaggle (License — CC0: Public Domain). If you happen to’d wish to follow along the tutorial, be at liberty to download the notebook from the Data-Centric AI Community GitHub repository.

To make a fast profiling of your data, we’ll also use ydata-profiling, that gets us a full overview of our dataset in only a couple of line of codes. Let’s start by installing it:

Installing the most recent release of ydata-profiling. Snippet by Creator.

Now, we will load the info and make a fast profile:

Loading the info and creating the profiling report. Snippet by Creator.

the info, we will determine that this dataset consists by 768 records/rows/observations (768 patients), and 9 attributes or features. Actually, Final result is the goal class (1/0), so we have now 8 predictors (8 numerical features and 1 categorical).

Profiling Report: Overall data characteristics. Image by Creator.

At a primary glance, the dataset doesn’t appear to have missing data. Nonetheless, this dataset is thought to be affected by missing data! How can we confirm that?

the “Alerts” section, we will see several “Zeros” alerts that indicate us that there are several features for which zero values make no sense or are biologically unimaginable: e.g., a zero-value for body mass index or blood pressure is invalid!

Skimming through all features, we will determine that pregnancies seems tremendous (have zero pregnancies is cheap), but for the remaining features, zero values are suspicious:

Profiling Report: Data Quality Alerts. Image by Creator.

In most real-world datasets, missing data is encoded by sentinel values:

  • Out-of-range entries, comparable to 999;
  • Negative numbers where the feature has only positive values, e.g. -1;
  • Zero-values in a feature that would never be 0.

In our case, Glucose, BloodPressure, SkinThickness, Insulin, and BMI all have missing data. Let’s count the variety of zeros that these features have:

Counting the variety of zero values. Snippet by Creator.

We will see that Glucose, BloodPressure and BMI have just a couple of zero values, whereas SkinThickness and Insulin have loads more, covering nearly half of the present observations. This implies we’d consider different strategies to handle these features: some might require more complex imputation techniques than others, as an illustration.

To make our dataset consistent with data-specific conventions, we must always make these missing values as NaN values.

That is the usual strategy to treat missing data in python and the convention followed by popular packages like pandas and scikit-learn. These values are ignored from certain computations like sum or count, and are recognized by some functions to perform other operations (e.g., drop the missing values, impute them, replace them with a hard and fast value, etc).

We’ll mark our missing values using the replace() function, after which calling isnan() to confirm in the event that they were accurately encoded:

Marking zero values as NaN values. Snippet by Creator.

The count of NaN values is identical because the 0 values, which implies that we have now marked our missing values accurately! We could then use the profile report agains to ascertain that now the missing data is recognized. Here’s how our “latest” data looks like:

Checking the generated alerts: “Missing” alerts at the moment are highlighted. Image by Creator.

We will further check for some characteristics of the missingness process, skimming through the “Missing Values” section of the report:

Profiling Report: Investigating Missing Data. Screencast by Creator.

Besided the “Count” plot, that provides us an outline of all missing values per feature, we will explore the “Matrix” and “Heatmap” plots in additional detail to hypothesize on the underlying missing mechanisms the info may suffer from. Specifically, the correlation between missing features is likely to be informative. On this case, there appears to be a major correlation between Insulin and SkinThicknes : each values appear to be concurrently missing for some patients. Whether it is a coincidence (unlikely), or the missingness process may be explained by known aspects, namely portraying MAR or MNAR mechanisms could be something for us to dive our noses into!

Regardless, now we have now our data ready for evaluation! Unfortunately, the strategy of handling missing data is way from being over. Many classic machine learning algorithms cannot handle missing data, and we’d like find expert ways to mitigate the problem. Let’s try to judge the Linear Discriminant Evaluation (LDA) algorithm on this dataset:

Evaluating the Linear Discriminant Evaluation (LDA) algorithm with missing values. Snippet by Creator.

If you happen to attempt to run this code, it can immediately throw an error:

LDA algorithm cannot deal with missing values internall, throwing and error message. Image by Creator.

The only strategy to fix this (and probably the most naive!) could be to remove all records that contain missing values. We will do that by making a latest data frame with the rows containing missing values removed, using the dropna() function…

Dropping all rows/observations with missing values. Snippet by Creator.

… and trying again:

Evaluating the LDA algorithm without missing values. Snippet by Creator.
LDA can now operate, althought the dataset size is sort of cut in half. Image by Creator.

And there you will have it! By the dropping the missing values, the LDA algorithm can now operate normally.

Nonetheless, the dataset size was substantially reduced to 392 observations only, which implies we’re losing nearly half of the available information.

For that reason, as a substitute of simply dropping observations, we must always search for imputation strategies, either statistical or machine-learning based. We could also use synthetic data to switch the missing values, depending on our final application.

And for that, we’d attempt to get some insight on the underlying missing mechanisms in the info. Something to look ahead to in future articles?

LEAVE A REPLY

Please enter your comment!
Please enter your name here