Home Artificial Intelligence From Data Platform to ML Platform Starting of the Journey: Online Service + OLTP + OLAP Data lake: Storage-Compute Separation + Schema on Write Realtime Data/ML Infra: Data River + Data Streaming + Feature Store + Metric Server MLOps: Abstraction, Observability and Scalability What Next

From Data Platform to ML Platform Starting of the Journey: Online Service + OLTP + OLAP Data lake: Storage-Compute Separation + Schema on Write Realtime Data/ML Infra: Data River + Data Streaming + Feature Store + Metric Server MLOps: Abstraction, Observability and Scalability What Next

0
From Data Platform to ML Platform
Starting of the Journey: Online Service + OLTP + OLAP
Data lake: Storage-Compute Separation + Schema on Write
Realtime Data/ML Infra: Data River + Data Streaming + Feature Store + Metric Server
MLOps: Abstraction, Observability and Scalability
What Next

There may be nothing incorrect with those systems so long as it fulfil business requirements. All systems that fulfil our business need are good systems. If there are easy, it’s even higher.

At this stage, there are multiple ways of doing data evaluation:

  1. Simply submit queries to OLTP database’s replica node. (Not beneficial).
  2. Enabling CDC(Change Data Capture) of OLTP databse and ingest those data to OLAP database. Come to the choice of ingestion service for CDC logs, you may select based on the OLAP database you’ve chosen. For instance, Flink data streaming with CDC connectors is a approach to handle this. Many enterprise services include their very own suggested solution, e.g. Snowpipe for Snowflake. Additionally it is beneficial to load data from replica node to preserve the CPU/IO bandwidth of master node for online traffic.

On this stage, ML workloads is likely to be running in your local environment. You may arrange a Jupyter notebook locally, and cargo structured data from OLAP Database, then train your ML model locally.

The potential challenges of this architecture are but not limited to:

  • It is difficult to administer unstructured or semi-structured data with OLAP database.
  • OLAP might need performance regression when come to massive data processing. (greater than TB data required for a single ETL task)
  • Lack of supporting for various compute engines, e.g. Spark or Presto. Most of compute engine do support connecting to OLAP with JDBC endpoint, however the parallel processing will probably be badly limited by the IO bottleneck of JDBC endpoint itself.
  • The price of storing massive data in OLAP database is high.

You would possibly know the direction to unravel this already. Construct a Data lake! Bringing in Data lake don’t crucial mean you want to completely sunset OLAP Database. It continues to be common to see company having two systems co-exist for various use-cases.

A knowledge lake means that you can persist unstructured and semi-structure data, and performs schema-on-write. It allows you reduce cost by storing large data volume with specialised storage solution and spun up compute cluster based in your demand. It further means that you can manage TB/PB dataset effortlessly by scaling up the compute clusters.

There may be how your infrastructure might looks like:

LEAVE A REPLY

Please enter your comment!
Please enter your name here