The performance of ML models degrades as time passes and data distribution changes.
A recent study from MIT, Harvard, The University of Monterrey, and Cambridge showed that 91% of ML models degrade over time. This study is one in all the primary of its kind, where researchers concentrate on studying machine learning models’ behavior after deployment and the way their performance evolves with unseen data.
“While much research has been done on various types and markers of temporal data drifts, there isn’t a comprehensive study of how the models themselves can reply to these drifts.”
This blog post will review probably the most critical parts of the research, highlight their results, and stress the importance of those results, especially for the ML industry.
If you may have been previously exposed to concepts like covariate shift or concept drift, chances are you’ll remember that changes within the distribution of the production data may affect the model’s performance. This phenomenon is one in all the challenges of maintaining an ML model in production.
By definition, ML models rely upon the information it was trained on, meaning that if the distribution of the production data starts to vary, the model may not perform in addition to before. And as time passes, the model’s performance may degrade increasingly. The authors seek advice from this phenomenon as “AI aging.” I prefer to call it model performance degradation, and depending on how significant the drop in performance is, we may consider it a model failure.
To get a greater understanding of this phenomenon, the authors developed a framework for identifying temporal model degradation. They applied the framework to 32 datasets from 4 industries, using 4 standard ML models, and investigated how temporal model degradation can develop under minimal drifts in the information.
To avoid any model bias, the authors selected 4 different standard ML methods (Linear Regression, Random Forest Regressor, XGBoost, and a Multilayer Perceptron Neural Network). Each of those methods represents different mathematical approaches to learning from data. By selecting different model types, they were able to research similarities and differences in the best way diverse models can age on the identical data.
Similarly, to avoid domain bias, they selected 32 datasets from 4 industries (Healthcare, Weather, Airport Traffic, and Financial).
One other critical decision is that they only investigated pairs of model-dataset with good initial performance. This decision is crucial because it is just not worthwhile investigating the degradation of a model with a poor initial fit.
To discover temporal model performance degradation, the authors designed a framework that emulates a typical production ML model. And ran multiple dataset-model experiments following this framework.
For every experiment, they did 4 things:
- Randomly select one yr of historical data as training data
- Select an ML model
- Randomly pick a future datetime point where they’ll test the model
- Calculate the model’s performance change
To raised understand the framework, we’d like a few definitions. Probably the most recent point within the training data was defined as t_0. The variety of days between t_0 and the purpose in the long run where they test the model was defined as dT, which symbolizes the model’s age.
For instance, a weather forecasting model was trained with data from January 1st to December thirty first of 2022. And on February 1st, 2023, we ask it to make a weather forecast.
On this case
- t_0 = December thirty first, 2022 because it is probably the most recent point within the training data.
- dT = 32 days (days from December thirty first and February 1st). That is the age of the model.
The diagram below summarizes how they performed every “history-future” simulation. We’ve got added annotations to make it easier to follow.
To quantify the model’s performance change, they measured the mean squared error (MSE) at time t_0 as MSE(t_0) and on the time of the model evaluation as MSE(t_1).
Since MSE(t_0) is purported to be low (each model was generalizing well at dates near training). One can measure the relative performance error because the ratio between MSE(t_0) and MSE(t_1).
E_rel = MSE(t_1)/MSE(t_0)
The researchers ran 20,000 experiments of this kind for every dataset-model pair! Where t_0 and dT were randomly sampled from a uniform distribution.
After running all of those experiments, they reported an aging model chart for every dataset-model pair. This chart accommodates 20,000 purple points, each representing the relative performance error E_rel obtained at dT days after training.
The chart summarizes how the model’s performance changes when the model’s age increases. Key takeaways:
- The error increases over time: the model becomes less and fewer performant as time passes. This may occasionally be happening as a consequence of a drift present in any of the model’s features or as a consequence of concept drift.
- The error variability increases over time: The gap between one of the best and worst-case scenarios increases because the model ages. When an ML model has high error variability, it implies that it sometimes performs well and sometimes badly. The model performance is just not just degrading, however it has erratic behavior.
The reasonably low median model error should still create the illusion of accurate model performance while the actual outcomes turn out to be less and fewer certain.
After performing all of the experiments for all 4 (models) x 32 (datasets) = 128 (model, dataset) pairs, temporal model degradation was observed in 91% of the cases. Here we’ll have a look at the 4 most typical degradation patterns and their impact on ML model implementations.
Although no strong degradation was observed within the two examples below, these results still present a challenge. Taking a look at the unique Patient and Weather datasets, we will see that the patient data has numerous outliers within the Delay variable. In contrast, the weather data has seasonal shifts within the Temperature variable. But even with these two behaviors within the goal variables, each models appear to perform accurately over time.
The authors claim that these and similar results reveal that data drifts alone can’t be used to clarify model failures or trigger model quality checks and retraining.
We’ve got also observed this in practice. Data drift doesn’t necessarily translates right into a model performance degradation. That’s the reason in our ML monitoring workflow, we concentrate on performance monitoring and use data drift detection tools only to research plausible explanations of the degradation issue since data drifts alone mustn’t be used to trigger model quality checks.
Model performance degradation may escalate very abruptly. Taking a look at the plot below, we will see that each models were performing well in the primary yr. But in some unspecified time in the future, they began to degrade at an explosive rate. The authors claim that these degradations can’t be explained alone by a specific drift in the information.
Let’s compare two model aging plots comprised of the identical dataset but with different ML models. On the left, we see an explosive degradation pattern, while on the proper, almost no degradation was seen. Each models were performing well initially, however the neural network appeared to degrade in performance faster than the linear regression (labeled as RV model).
Given this, and similar results, the authors concluded that Temporal model quality is dependent upon the selection of the ML model and its stability on a certain data set.
In practice, we will take care of the sort of phenomenon by repeatedly monitoring the estimated model performance. This permits us to handle the performance issues before an explosive degradation is found.
While the yellow (twenty fifth percentile) and the black (median) lines remain at relatively low error levels, the gap between them and the red line (seventy fifth percentile) increases significantly with time. As mentioned before, this will likely create the illusion of an accurate model performance while the true model outcomes turn out to be less and fewer certain.
Neither the information nor the model alone could be used to ensure consistent predictive quality. As an alternative, the temporal model quality is set by the steadiness of a particular model applied to the precise data at a specific time.
Once we’ve got found the underlying reason behind the model aging problem, we will seek for one of the best technique to repair the issue. The suitable solution is context-dependent, so there isn’t a easy fix that matches every problem.
Each time we see a model performance degradation, we should always investigate the problem and understand the reason behind it. Automatic fixes are almost inconceivable to generalize for each situation since multiple reasons may cause the degradation issue.
Within the paper, the authors proposed a possible solution to the temporal degradation problem. It is concentrated on ML model retraining and assumes that we’ve got access to newly labeled data, that there aren’t any data quality issues, and that there isn’t a concept drift. To make this solution practically feasible, they mentioned that one needs the next:
1. Alert when your model have to be retrained.
Alerting when the model’s performance has been degrading is just not a trivial task. One needs access to the newest ground truth or have the ability to estimate the model’s performance. Solutions like DLE and CBPE from NannyML may help to do this. For instance, DLE (Direct Looks Estimation) and CBPE (Confidence-based Performance Estimation) use probabilistic methods to estimate the model’s performance even when targets are absent. They monitor the estimated performance and alert when the model has degraded.
2. Develop an efficient and robust mechanism for automatic model retraining.
If we all know that there isn’t a data quality issue or concept drift, ceaselessly retraining the ML model with the newest labeled data could help. Nonetheless, this will likely cause recent challenges, comparable to lack of model convergence, suboptimal changes to the training parameters, and “catastrophic forgetting” which is the tendency of a man-made neural network to abruptly forget previously learned information upon learning recent information.
3. Have constant access to probably the most recent ground truth.
Probably the most recent ground truth will allow us to retrain the ML model and calculate the realized performance. The issue is that in practice, ground truth is usually delayed, or it is pricey and time-consuming to get newly labeled data.
When retraining could be very expensive, one potential solution can be to have a model catalog after which use the estimated performance to pick the model with the best-expected performance. This might fix the problem of various models aging in a different way on the identical dataset.
Other popular solutions utilized in the industry are reverting your model back to a previous checkpoint, fixing the problem downstream, or changing the business process. To learn more about when it’s best to use each solution try our previous blog post on The right way to address data distribution shift.
The study by Vela et al. showed that the ML model’s performance doesn’t remain static, even after they achieve high accuracy on the time of deployment. And that different ML models age at different rates even when trained on the identical datasets. One other relevant remark is that not all temporal drifts will cause performance degradation. Subsequently, the selection of the model and its stability also becomes probably the most critical aspects in coping with performance temporal degradation.
These results give a theoretical backup of why monitoring solutions are vital for the ML industry. Moreover, it shows that ML model performance is susceptible to degradation. This is the reason every production ML model have to be monitored. Otherwise, the model may fail without alerting the companies.
Vela, D., Sharp, A., Zhang, R., et al. Temporal quality degradation in AI models. Sci Rep 12, 11654 (2022). https://doi.org/10.1038/s41598-022-15245-z