A Glossary with Use Cases for First-Timers in Data Engineering
Are you a knowledge engineering rookie focused on knowing more about modern data infrastructures? I bet you might be, this text is for you!
On this guide Data Engineering meets Formula 1. But, we’ll keep it easy.
I strongly imagine that the most effective technique to describe an idea is via examples, though a few of my university professors used to say, “If you happen to need an example to elucidate it, it means you didn’t get it”.
In any case, I wasn’t paying enough attention during university classes, and today I’ll walk you thru data layers using — guess what — an example.
Imagine this: next 12 months, a brand new team on the grid, Red Thunder Racing, will call us (yes, me and also you) to establish their recent data infrastructure.
In today’s Formula 1, data is on the core, way greater than it was 20 or 30 years back. Racing teams are improving performance with an exceptional data-driven approach, making improvements millisecond by millisecond.
It’s not only concerning the lap time; Formula 1 is a multi-billion-dollar business. Boosting fan engagement isn’t only for fun; making the game more attractive isn’t only for drivers’s fun. These activities generate revenues.
A strong data infrastructure is a must-have to compete within the F1 business.
We’ll construct a knowledge architecture to support our racing team ranging from the three canonical layers: Data Lake, Data Warehouse, and Data Mart.
Data Lake
A knowledge lake would function a repository for raw and unstructured data generated from various sources throughout the Formula 1 ecosystem: telemetry data from the cars (e.g. tyre pressure per second, speed, fuel consumption), driver configurations, lap times, weather conditions, social media feeds, ticketing, fans registered to marketing events, merchandise purchases, …
All kind of knowledge will be stored in our consolidated data lake: unstructured (audio, video, images), semistructured (JSON, XML) and structured (CSV, Parquet, AVRO).
We’ll face our first challenge while we integrate and consolidate all the pieces in a single place. We’ll create batch jobs extracting records from marketing tools and we’ll also cope with real-time streaming telemetry data (and make certain, there shall be very low latency requirements with that).
We’ll have an extended list of systems to integrate and every shall be supporting a unique protocol or interface: Kafka Streaming, SFTP, MQTT, REST API and more.
We won’t be alone on this data collection; thankfully, there are data integration tools available available in the market that will be adopted to configure and maintain ingestion pipelines in a single place (e.g. in alphabetical order: Fivetran, Hevo, Informatica, Segment, Stitch, Talend, …).
As an alternative of counting on tons of of python scripts scheduled on crontab
or having custom processes handling data streaming from Kafka topics, these tools will help us simplifying, automating and orchestrating all these processes.
Data Warehouse
After just a few weeks defining all of the datastreams we want to integrate, we at the moment are ingesting a remarkable variety of knowledge in our data lake. It’s time to maneuver on to the subsequent layer.
The info warehouse is used to wash, structure, and store processed data from the information lake, providing a structured, high-performance environment for analytics and reporting.
At this stage, it’s not about ingesting data and we’ll focus an increasing number of on business use cases. We should always consider how the information shall be utilised by our colleagues offering structured datasets, usually refreshed, about:
- Automotive Performance: telemetry data is cleaned, normalised and integrated to supply a unified view.
- Strategy and Trend Review: past race data are used to discover trends, driver performance and understand the impact of specific strategies.
- Team KPI: pit stop times, tyres temperature before pit stop, budget control on automobile developments.
We’ll have quite a few pipelines dedicated to data transformation and normalisation.
Like for the information integration, there are many products available available in the market to simplify and efficiently manage data pipelines. These tools can streamline our data processes, reducing operational costs and increasing developments’ effectiveness (e.g. in alphabetical order: Apache Airflow, Azure Data Factory, DBT, Google DataForm, …).
Data Marts
There may be a skinny line between Data Warehouses and Data Marts.
Let’s not forget that we’re working for Red Thunder Racing, a big company, with 1000’s of employees involved in diverse areas.
Data should be accessible and tailored to specific business units requirements. Data models are built around business needs.
Data marts are specialized subsets of knowledge warehouses that concentrate on specific business functions.
- Automotive Performance Mart: RnD Team analyses data related to engine efficiency, aerodynamics, and reliability. Engineers will use this data mart to optimize the automobile’s setup for various race tracks or run simulations to grasp the most effective automobile configuration based on weather conditions.
- Fan Engagement Mart: Marketing Team analyses social media data, fan surveys, and viewer rankings to grasp fan preferences. The Marketing Team is using this data to perform tailored marketing strategies, merchandise development, and improve their Fan360 knowledge.
- Bookkeeping Analytics Mart: The Finance Team needs data as well (lot of numbers, I imagine!). Now greater than ever, racing teams should cope with budget restrictions and regulations. It’s essential to maintain track of budget allocations, revenues and price overviews basically.
Furthermore, It’s often a requirement to be certain that sensitive data stays accessible only to authorised teams. As an illustration, the Research and Development team may require exclusive access to telemetry information, and so they need that data will be analysed using a particular data model. Nevertheless, they won’t be permitted (or interested) in accessing financial reports.
Our layered data architecture will enable Red Thunder Racing to leverage the ability of knowledge for automobile performance optimization, strategic decision-making, enhanced marketing campaign… and beyond!
That’s it?
Absolutely not! We barely scratched the surface of a knowledge architecture. There are probably other tons of of integration points we must always consider, furthermore we didn’t transcend just mentioning data transformation and data modeling.
We didn’t cover the Data Science domain in any respect, which probably deserves its own article, same for data governance, data observability, data security, and more.
But hey, as they are saying, “Rome was not inbuilt a day”. We’ve already quite rather a lot on our plate for today, including the primary draft of our data architecture (below).
Data Engineering is a magical realm, with a plethora of books dedicated to it.
Throughout the journey, data engineers will engage with unlimited integration tools, diverse data platforms aiming to cover a number of of the layers mentioned above (e.g. in alphabetical order: AWS Redshift, Azure Synapse, Databricks, Google BigQuery, Snowflake, …), business intelligence tools (e.g. Looker, PowerBI, Tableau, ThoughtSpot, …) and data pipelines tools.
Our data engineering journey at Red Thunder Racing has just began and we must always leave loads of space for flexibility in our toolkit!
Data Layers will be often combined together, sometimes in a single platform. Data platforms and tools are raising the bar and reducing gaps day-to-day releasing recent features. The competition is intense on this market.
- Do you usually must have a knowledge lake? It depends.
- Do you usually must have data stored as soon as possible (a.k.a. streaming and real-time processing)? It depends, what’s the information freshness requirement by Business Users?
- Do you usually must depend on third party tools for data pipelines management? It depends!
? It depends!
If you may have any questions or suggestions, please be happy to achieve out to me on LinkedIn. I promise I’ll answer with something different from: It depends!
Opinions expressed in this text are solely my very own and don’t reflect the views of my employer. Unless otherwise noted, all images are by the creator.
The story, all names and incidents portrayed in this text are fictitious. No identification with actual places, buildings, and products is meant or must be inferred.