Strategies for efficiently managing dimension changes and data restatement in enterprise data warehousing
Imagine this, you’re a knowledge engineer working for a big retail company that utilizes the incremental load technique in data warehousing. This system involves selectively updating or loading only the brand new or modified data for the reason that last update. What could occur when the product R&D department decides to vary the name or description of a current product? How would such updates impact your existing data pipeline and data warehouse? How do you propose to deal with challenges like these? This text provides a comprehensive guide with solutions, utilizing Slowly Changing Dimensions (SCD), to tackle potential issues during data restatement.
What are Slowly Changing Dimensions (SCD)?
Slowly changing dimensions consult with infrequent changes in dimension values, which occur sporadically and aren’t tied to a each day or regular time-based schedule, as dimensions typically change less regularly than transaction entries in a system. For instance, a jewellery company that has its customers placing a brand new order on their website will grow to be a brand new row within the order fact table. Alternatively, the jewellery company rarely changes their product name and their product description but that doesn’t mean it is going to never occur in the longer term.
Managing changes in these dimensions requires employing Slowly Changing Dimension (SCD) management techniques, that are categorized into defined SCD types, starting from Type 0 through Type 6, including some combination or hybrid types. We will employ certainly one of the next methods:
SCD Type 0: Ignore
Changes to dimension values are completely disregarded, and the values of dimensions remain unchanged from the time they were initially created in the info warehouse.
SCD Type 1: Overwrite/ Replace
This approach is applicable when the previous value of the dimension attribute isn’t any longer relevant or essential. Nevertheless, historical tracking of changes is just not mandatory.
SCD Type 2: Create a Latest Dimension Row
This approach is advisable as the first technique for addressing changing dimension values, involving the creation of a second row for the dimension with a start date, end date, and potentially a “current/expired” flag. It’s suitable for our scenarios like product description or address changes, ensuring a transparent partitioning of history. The brand new dimension row is linked to newly inserted fact rows, with each dimension record linked to a subset of fact rows based on insertion times — those before the change linked to the old dimension row, and people after linked to the brand new dimension row.
SCD Type 3: Create a “PREV” Column
This method is suitable when each the old and recent values are relevant, and users will probably want to conduct historical evaluation using either value. Nevertheless, it is just not practical to use this method to all dimension attributes, as it will involve providing two columns for every attribute in dimension tables or more if multiple “PREV” values need preservation. It must be selectively used where appropriate.
SCD Type 4: Rapidly Changing Large Dimensions
What if in a scenario you might want to capture every change to each dimension attribute for a really large dimension of retail, say 1,000,000 plus customers of your huge jewelry company? Using type 2 above will in a short time explode the variety of rows in the shopper dimension table to tens and even lots of of hundreds of thousands of rows and using type 3 is just not viable.
A simpler solution for rapidly changing and enormous volume dimension tables is to categorize attributes (e.g., customer age category, gender, purchasing power, birthday, etc.) and separate them right into a secondary dimension, like a customer profile dimension. This table, acting as a “full coverage” dimension table all potential values for each category of dimension attributes preloaded into the table, which may higher manage the granularity of changes while avoiding excessive row expansion within the important customer dimension.
For instance, if we’ve got 8 age categories, 3 different genders, 6 purchasing power categories, and 366 possible birthdays. Our “full coverage” dimension table for customer profiles that accommodates all of the above mixtures shall be 8 x 3 x 6 x 366 mixtures or 52704 rows.
We’ll must generate surrogate_key
for this dimension table and establish a connection to a brand new foreign key in the very fact table. When a modification occurs in certainly one of these dimension categories, there’s no necessity so as to add one other row to the shopper dimension. As a substitute, we generate a brand new fact row and associate it with each the shopper dimension and the brand new customer profile dimension.
SCD Type 5: An Extension to Type 4
To boost the Type 4 approach mentioned earlier, we will establish a connection between the shopper dimension and the shopper profile dimension. This linkage enables the tracking of the “current” customer profile for a selected customer. The important thing facilitates the connection of the shopper with the most recent customer profile, which allows seamless traversal from the shopper dimension to essentially the most recent customer profile dimension without the necessity to link through the very fact table.
SCD Type 6: A Hybrid Technique
With this approach, you integrate each Type 2 (recent row) and Type 3 (“PREV” column). This blended approach offers the benefits of each methodologies. You possibly can retrieve facts using the “ PREV “ column, which provides historical values and presents facts related to the product category at that specific time. Concurrently, querying by the “recent” column provides all facts for each the present and all preceding values of the product category.
Bonus and Conclusion
Normally, data extraction is available in STAR schema, which incorporates one fact table and multiple dimension tables in an enterprise. While the dimension tables store all of the descriptive data and first keys, the very fact table accommodates numeric and additive data that references the first keys of every dimension around it.
Nevertheless, in case your marketing sales data extract is provided as a single denormalized table without distinct dimension tables and lacks the first key for its descriptive data, future updates to product names may pose challenges. Handling such scenarios in your existing pipeline will be more complicated.
The absence of primary keys within the descriptive data can result in issues during data restatement, especially if you end up coping with large datasets. For example, if a product name is updated within the restatement extract and not using a unique product_key
, the incremental load pipeline may treat it as a brand new product, impacting the historical data in your consumption layer. To handle this, creating surrogate_key
for the product dimension and a mapping table to link original and restated product names is mandatory for maintaining data integrity.
In conclusion, every aspect of information warehouse design must be fastidiously considered, taking into consideration potential edge cases.