Home Artificial Intelligence A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems The challenge when facing the ‘monster’ Looking for data Accessing Operational Data Understanding Operational Data Operational data management in Data Mesh Source-aligned Data Products Consumer experience Conclusion References

A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems The challenge when facing the ‘monster’ Looking for data Accessing Operational Data Understanding Operational Data Operational data management in Data Mesh Source-aligned Data Products Consumer experience Conclusion References

0
A Data Mesh Implementation: Expediting Value Extraction from ERP/CRM Systems
The challenge when facing the ‘monster’
Looking for data
Accessing Operational Data
Understanding Operational Data
Operational data management in Data Mesh
Source-aligned Data Products
Consumer experience
Conclusion
References

Enabling fast data development from big operational systems

Towards Data Science
Photo by Benjamin Zanatta on Unsplash

For an information engineer constructing analytics from transactional systems equivalent to ERP (enterprise resource planning) and CRM (customer relationship management), the fundamental challenge lies in navigating the gap between raw operational data and domain knowledge. ERP and CRM systems are designed and built to fulfil a broad range of business processes and functions. This generalisation makes their data models complex and cryptic and require domain expertise.

Even harder to administer, a typical setup inside large organisations is to have several instances of those systems with some underlaying processes in command of transmitting data amongst them, which may lead to duplications, inconsistencies, and opacity.

The disconnection between the operational teams immersed within the day-to-day functions and people extracting business value from data generated within the operational processes still stays a big friction point.

Imagine being an information engineer/analyst tasked with identifying the top-selling products inside your organization. Your first step may be to locate the orders. You then begin researching database objects and find a few views, but there are some inconsistencies between them so that you have no idea which one to make use of. Moreover, it is actually hard to discover the owners, one in every of them has even recently left the corporate. As you don’t need to begin your development with uncertainty, you choose to go for the operational raw data directly. Does it sound familiar?

I used to connect with views in transactional databases or APIs offered by operational systems to request the raw data.

Order snapshots are stored in my very own development area (image by the creator)

To stop my extractions from impacting performance on the operational side, I queried this data often and stored it in a persistent staging area (PSA) inside my data warehouse. This allowed me to execute complex queries and data pipelines using these snapshots without consuming any resource from operational systems, but could lead to unnecessary duplication of information in case I used to be not aware of other teams doing the identical extraction.

Once the raw operational data was available, then I needed to take care of the subsequent challenge: deciphering all of the cryptic objects and properties and coping with the labyrinth of dozens of relationships between them (i.e. General Material Data in SAP documented https://leanx.eu/en/sap/table/mara.html)

Regardless that standard objects inside ERP or CRM systems are well documented, I needed to take care of quite a few custom objects and properties that require domain expertise as these objects can’t be present in the usual data models. More often than not I discovered myself throwing ‘trial-and-error’ queries in an try and align keys across operational objects, interpreting the meaning of the properties in line with their values and checking with operational UI screenshots my assumptions.

A Data Mesh implementation improved my experience in these facets:

  • Knowledge: I could quickly discover the owners of the exposed data. The space between the owner and the domain that generated the info is vital to expedite further analytical development.
  • Discoverability: A shared data platform provides a catalog of operational datasets in the shape of source-aligned data products that helped me to grasp the status and nature of the info exposed.
  • Accessibility: I could easily request access to those data products. As this data is stored within the shared data platform and never within the operational systems, I didn’t have to align with operational teams for available windows to run my very own data extraction without impacting operational performance.

In accordance with the Data Mesh taxonomy, data products built on top of operational sources are named Source-aligned Data Products:

Source domain datasets represent closely the raw data at the purpose of creation, and usually are not fitted or modelled for a specific consumer — Zhamak Dehghani

Source-aligned data products aim to represent operational sources inside a shared data platform in a one-to-one relationship with operational entities and so they mustn’t hold any business logic that would alter any of their properties.

Ownership

In a Data Mesh implementation, these data products should
strictly be owned by the business domain that generates the raw data. The owner is accountable for the standard, reliability, and accessibility of their data and data is treated as a product that may be utilized by the identical team and other data teams in other parts of the organisation.

This ownership ensures domain knowledge is near the exposed data. That is critical to enabling the fast development of analytical data products, as any clarification needed by other data teams may be handled quickly and effectively.

Implementation

Following this approach, the Sales domain is accountable for publishing a ‘sales_orders’ data product and making it available in a shared data catalog.

Sales Orders DP exposing sales_orders_dataset (image by the creator)

The info pipeline in command of maintaining the info product could possibly be defined like this:

Data pipeline steps (image by the creator)

Data extraction

Step one to constructing source-aligned data products is to extract the info we wish to reveal from operational sources. There are a bunch of Data Integration tools that supply a UI to simplify the ingestion. Data teams can create a job there to extract raw data from operational sources using JDBC connections or APIs. To avoid wasting computational work, and at any time when possible, only the updated raw data for the reason that last extraction ought to be incrementally added to the info product.

Data cleansing

Now that we have now obtained the specified data, the subsequent step involves some curation, so consumers don’t have to take care of existing inconsistencies in the actual sources. Although any business logic mustn’t not be implemented when constructing source-aligned data products, basic cleansing and standardisation is allowed.

-- Example of property standardisation in a sql query used to extract data
case
when lower(SalesDocumentCategory) = 'invoice' then 'Invoice'
when lower(SalesDocumentCategory) = 'invoicing' then 'Invoice'
else SalesDocumentCategory
end as SALES_DOCUMENT_CATEGORY

Data update

Once extracted operational data is ready for consumption, the info product’s internal dataset is incrementally updated with the most recent snapshot.

Considered one of the necessities for an information product is to be interoperable. Which means we want to reveal global identifiers so our data product may be universally utilized in other domains.

Metadata update

Data products should be comprehensible. Producers need to include meaningful metadata for the entities and properties contained. This metadata should cover these facets for every property:

  • Business description: What each property represents for the business. For instance, “Business category for the sales order”.
  • Source system: Establish a mapping with the unique property within the operational domain. As an illustration, “Original Source: ERP | MARA-MTART table BIC/MARACAT property”.
  • Data characteristics: Specific characteristics of the info, equivalent to enumerations and options. For instance, “It’s an enumeration with these options: Invoice, Payment, Grievance”.

Data products also should be discoverable. Producers have to publish them in a shared data catalog and indicate how the info is to be consumed by defining output port assets that function interfaces to which the info is exposed.

And data products have to be observable. Producers have to deploy a set of monitors that may be shown throughout the catalog. When a possible consumer discovers an information product within the catalog, they will quickly understand the health of the info contained.

Now, again, imagine being an information engineer tasked with identifying the top-selling products inside your organization. But this time, imagine that you have got access to an information catalog that provides data products that represent the reality of every domain shaping the business. You just input ‘orders’ into the info product catalog and find the entry published by the Sales data team. And, at a look, you’ll be able to assess the standard and freshness of the info and skim an in depth description of its contents.

Entry for Sales Orders DP throughout the Data Catalog example (image by the creator)

This upgraded experience eliminates the uncertainties of traditional discovery, allowing you to begin working with the info straight away. But what’s more, you realize who’s accountable for the info in case further information is required. And at any time when there’s a difficulty with the Sales orders data product, you’ll receive a notification so you could take actions upfront.

We have now identified several advantages of enabling operational data through source-aligned data products, especially once they are owned by data producers:

  • Curated operational data accessibility: In large organisations, source-aligned data products represent a bridge between operational and analytical planes.
  • Collision reduction with operational work: Operational systems accesses are isolated inside source-aligned data products pipelines.
  • Source of truth: A typical data catalog with an inventory of curated operational business objects reducing duplication and inconsistencies across the organisation.
  • Clear data ownership: Source-aligned data products ought to be owned by the domain that generates the operational data to make sure domain knowledge is near the exposed data.

Based alone experience, this approach works exceptionally well in scenarios where large organisations struggle with data inconsistencies across different domains and friction when constructing their very own analytics on top of operational data. Data Mesh encourages each domain to construct the ‘source of truth’ for the core entities they generate and make them available in a shared catalog allowing other teams to access them and create consistent metrics across the entire organisation. This allows analytical data teams to speed up their work in generating analytics that drive real business value.

https://www.oreilly.com/library/view/data-mesh/9781492092384/

Because of my Thoughtworks colleagues Arne (twice!), Pablo, Ayush and Samvardhan for taking the time to review the early versions of this text

LEAVE A REPLY

Please enter your comment!
Please enter your name here