An empirical evaluation about whether ML models make more mistakes when making predictions on outliers
Outliers are individuals which might be very different from nearly all of the population. Traditionally, amongst practitioners there’s a certain mistrust in outliers, that is why ad-hoc measures reminiscent of removing them from the dataset are sometimes adopted.
Nonetheless, when working with real data, outliers are on the order of business. Sometimes, they’re much more essential than other observations! Take as an example the case of people which might be outliers because they’re very high-paying customers: you don’t wish to discard them, actually, you almost certainly wish to treat them with extra care.
An interesting — and quite unexplored — aspect of outliers is how they interact with ML models. My feeling is that data scientists consider that outliers harm the performance of their models. But this belief might be based on a preconception greater than on real evidence.
Thus, the query I’ll try to reply in this text is the next:
Is an ML model more more likely to make mistakes when making predictions on outliers?
Suppose that we now have a model that has been trained on these data points:
We receive latest data points for which the model should make predictions.
Let’s consider two cases:
- the brand new data point is an outlier, i.e. different from a lot of the training observations.
- the brand new data point is “standard”, i.e. it lies in an area that’s pretty “dense” of coaching points.
We would really like to grasp whether, generally, the outlier is harder to predict than the usual statement.