Frequent errors and find out how to avoid them, with practical examples

I’ve done it over and over myself — hitting run on some model training code and having a “WOW” moment when the error scoring comes out great. Suspiciously great. Digging through the feature engineering code, there’s a calculation that baked future data into the training data, and fixing the feature pumps those mean squared errors back as much as reality. Now where’s that whiteboard again…
Time series problems have quite a few unique pitfalls. Luckily, with some diligence and slightly practice, you’ll be accounting for these pitfalls long before typing from sklearn import into your notebook. Listed below are three things to look out for, and a few scenarios where you may run into them.
This one’s almost actually the primary hazard you’ll encounter with time series, and overwhelmingly essentially the most frequent one I see in entry-level portfolios ( you, generic stock market forecasting project). The excellent news is that it’s generally the best to avoid.
The Problem: Simply put, look-ahead bias is when your model is trained using future data it might not have access to in point of fact.
The standard way you’d introduce this issue into your code is by randomly splitting training and testing data into two chunks of a predetermined size (e.g. 80/20). Random sampling will mean each your training and test data cover the identical time period, so that you’ll have “leaked” knowledge of the longer term into your model.
When it comes time to validate with the test data, the model already knows what happens. You’ll inevitably get some pretty stellar, yet bogus error scores this manner.
The Fix: Split your dataset using a cutoff in time relatively than holding out a percentage of the info.
For instance, if I actually have data that covers 2013–2023, I would set 2013–2021 as my training data and 2022–2023 as my test data. In a straightforward use case, the test data then covers a time period the model is totally naive to, and your error scoring ought to be accurate. Remember, this also applies to the likes of k-fold cross-validation.