
Missing values are quite common in real datasets. Over time, many methods have been proposed to cope with this issue. Normally, they consist either in removing data that contain missing values or in imputing them with some techniques.
In this text, I’ll test a 3rd alternative:
Doing nothing.
Indeed, one of the best models for tabular datasets (namely, XGBoost, LightGBM, and CatBoost) can natively handle missing values. So, the query I’ll try to reply is:
Can these models handle missing values effectively, or would we obtain a greater result with a preliminary imputation?
There appears to be a widespread belief that we must do something about missing values. As an illustration, I asked ChatGPT what should I do if my dataset contain missing values, and it suggested 10 other ways to do away with them (you possibly can read the complete answer here).
But where does this belief come from?
Normally, these sorts of opinions originate from historical models, particularly from linear regression. This can also be the case. Let’s see why.
Suppose that we’ve this dataset:
If we tried to coach a linear regression on these features, we’d get an error. In actual fact, to have the option to make predictions, linear regression must multiply each feature by a numeric coefficient. If a number of features are missing, it’s inconceivable to make a prediction for that row.
For this reason many imputation methods have been proposed. As an illustration, considered one of the best possibilities is to exchange the nulls with the feature’s mean.