Theoretical Concepts & Tools
Data Validation: Data validation refers back to the strategy of ensuring data quality and integrity. What do I mean by that?
As you routinely gather data from different sources (in our case, an API), you wish a option to continually validate that the information you simply extracted follows a algorithm that your system expects.
For instance, you expect that the energy consumption values are:
- of type float,
- not null,
- ≥0.
Whilst you developed the ML pipeline, the API returned only values that respected these terms, as data people call it: a “data contract.”
But, as you allow your system to run in production for a 1 month, 1 yr, 2 years, etc., you won’t ever know what could change to data sources you haven’t got control over.
Thus, you wish a option to continuously check these characteristics before ingesting the information into the Feature Store.
Note: To see how one can extend this idea to unstructured data, resembling images, you’ll be able to check my Master Data Integrity to Clean Your Computer Vision Datasets article.
Great Expectations (aka GE): GE is a preferred tool that easily helps you to do data validation and report the outcomes. Hopsworks has GE support. You may add a GE validation suit to Hopsworks and select how one can behave when recent data is inserted, and the validation step fails — read more about GE + Hopsworks [2].
Ground Truth Types: While your model is running in production, you’ll be able to have access to your ground truth in 3 different scenarios:
- real-time: a perfect scenario where you’ll be able to easily access your goal. For instance, if you recommend an ad and the buyer either clicks it or not.
- delayed: eventually, you’ll access the bottom truths. But, unfortunately, it should be too late to react in time adequately.
- none: you’ll be able to’t routinely collect any GT. Often, in these cases, you may have to rent human annotators when you need any actuals.
In our case, we’re somewhere between #1. and #2. The GT is not precisely in real-time, but it surely has a delay only of 1 hour.
Whether a delay of 1 hour is OK depends so much on the business context, but as an instance that, in your case, it’s okay.
As we considered that a delay of 1 hour is okay for our use case, we’re in good luck: we’ve access to the GT in real-time(ish).
This implies we will use metrics resembling MAPE to observe the model’s performance in real-time(ish).
In scenarios 2 or 3, we wanted to make use of data & concept drifts as proxy metrics to compute performance signals in time.
ML Monitoring: ML monitoring is the strategy of assuring that your production system works well over time. Also, it gives you a mechanism to proactively adapt your system, resembling retraining your model in time or adapting it to recent changes within the environment.
In our case, we are going to continually compute the MAPE metric. Thus, if the error suddenly spikes, you’ll be able to create an alarm to tell you or routinely trigger a hyper-optimization tuning step to adapt the model configuration to the brand new environment.