Home Artificial Intelligence Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring Table of Contents: Course Introduction Course Lessons: Data Source Lesson 5 Lesson 5: Code Conclusion References

Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring Table of Contents: Course Introduction Course Lessons: Data Source Lesson 5 Lesson 5: Code Conclusion References

0
Ensuring Trustworthy ML Systems With Data Validation and Real-Time Monitoring
Table of Contents:
Course Introduction
Course Lessons:
Data Source
Lesson 5
Lesson 5: Code
Conclusion
References

Theoretical Concepts & Tools

Data Validation: Data validation refers back to the strategy of ensuring data quality and integrity. What do I mean by that?

As you routinely gather data from different sources (in our case, an API), you wish a option to continually validate that the information you simply extracted follows a algorithm that your system expects.

For instance, you expect that the energy consumption values are:

  • of type float,
  • not null,
  • ≥0.

Whilst you developed the ML pipeline, the API returned only values that respected these terms, as data people call it: a “data contract.”

But, as you allow your system to run in production for a 1 month, 1 yr, 2 years, etc., you won’t ever know what could change to data sources you haven’t got control over.

Thus, you wish a option to continuously check these characteristics before ingesting the information into the Feature Store.

Note: To see how one can extend this idea to unstructured data, resembling images, you’ll be able to check my Master Data Integrity to Clean Your Computer Vision Datasets article.

Great Expectations (aka GE): GE is a preferred tool that easily helps you to do data validation and report the outcomes. Hopsworks has GE support. You may add a GE validation suit to Hopsworks and select how one can behave when recent data is inserted, and the validation step fails — read more about GE + Hopsworks [2].

Screenshot of GE data validation runs inside Hopswork [Image by the Author].

Ground Truth Types: While your model is running in production, you’ll be able to have access to your ground truth in 3 different scenarios:

  1. real-time: a perfect scenario where you’ll be able to easily access your goal. For instance, if you recommend an ad and the buyer either clicks it or not.
  2. delayed: eventually, you’ll access the bottom truths. But, unfortunately, it should be too late to react in time adequately.
  3. none: you’ll be able to’t routinely collect any GT. Often, in these cases, you may have to rent human annotators when you need any actuals.
Ground truth/targets/actuals types [Image by the Author].

In our case, we’re somewhere between #1. and #2. The GT is not precisely in real-time, but it surely has a delay only of 1 hour.

Whether a delay of 1 hour is OK depends so much on the business context, but as an instance that, in your case, it’s okay.

As we considered that a delay of 1 hour is okay for our use case, we’re in good luck: we’ve access to the GT in real-time(ish).

This implies we will use metrics resembling MAPE to observe the model’s performance in real-time(ish).

In scenarios 2 or 3, we wanted to make use of data & concept drifts as proxy metrics to compute performance signals in time.

Screenshot with the observations and predictions overlapped over time. As you’ll be able to see, the GT is not available for the newest 24 hours of forecasts [Image by the Author].

ML Monitoring: ML monitoring is the strategy of assuring that your production system works well over time. Also, it gives you a mechanism to proactively adapt your system, resembling retraining your model in time or adapting it to recent changes within the environment.

In our case, we are going to continually compute the MAPE metric. Thus, if the error suddenly spikes, you’ll be able to create an alarm to tell you or routinely trigger a hyper-optimization tuning step to adapt the model configuration to the brand new environment.

Screenshot with the mean MAPE metric between on a regular basis series computed over time [Image by the Author].

LEAVE A REPLY

Please enter your comment!
Please enter your name here