Data Quality dimensions
Taking a consumer viewpoint of information quality is undoubtedly a worthwhile initial step. Nevertheless it may not cover the completeness of the test scope. Extensive literature reviews have addressed this issue for us, offering a variety of information quality dimensions which are relevant to most use cases. It’s advisable to review the list with data consumers and collectively determine which dimensions are applicable and create tests accordingly.
| Accuracy | Format | Comparability |
| Reliability | Interpretability | Conciseness |
| Timeliness | Content | Freedom from bias |
| Relevance | Efficiency | Informativeness |
| Completeness | Importance | Level of detail |
| Currency | Sufficiency | Quantitativeness |
| Consistency | Usableness | Scope |
| Flexibility | Usefulness | Understandability |
| Precision | Clarity | |
You would possibly find this list too long and wonder the way to start with it. Data products or any information system might be observed or analyzed from two perspectives: external view and internal view.
External view
The external view is in regards to the use of the information and its relation with the organization. It’s often considered a “black box” with functionality to represent the real-world system. The scale that fall into the external view are highly business-driven. Sometimes, the evaluation of those dimensions might be subjective, so it’s not at all times easy to create automated tests for them. But let’s try a couple of well-known dimensions:
- Relevancy: The extent to which data are applicable and helpful for the evaluation. Considering a market campaign aimed toward promoting a brand new product. All data attributes should directly contribute to the success of the campaign comparable to customer demographic data and buy data. Data like city weather or stock market prices are irrelevant data on this case. One other example is the extent of detail (granularity). If the business wants the market data to be on the day level, but it surely’s delivered on the weekly level, then it’s not relevant and useful.
- Representation: The extent to which data is interpretable for data consumers and the information format is consistent and descriptive. The importance of the representation layer is usually missed when accessing data quality. It includes the format of the information — being consistent and user-friendly, and the meaning of the information — being comprehensible. For example, consider a scenario where data is anticipated to be available in a CSV file with descriptive column descriptions, and the values are expected to be in EUR currency relatively than in cents.
- Timeliness: The extent to which data is fresh for data consumers. For instance, the business needs the sales transaction data with a maximum delay of 1 hour from the purpose of sale. It indicates that the information pipeline needs to be refreshed often.
- Accuracy: The extent to which data is compliant with business rules. Data metrics are sometimes related to complicated business rules comparable to data mapping, rounding modes, etc. Automated tests on data logic are highly advisable and the more, the higher.
Out of the 4 dimensions, in the case of creating data tests, timeliness and accuracy are more straightforward. Timeliness is achieved by comparing the timestamp column with the present timestamp. Accuracy tests are feasible through customer queries.
Internal view
In contrast, the interior view is worried with the operation that is still independent of specific requirements. They’re essential whatever the use cases at hand. Dimensions in the interior view are more technical-driven versus business-driven dimensions within the external view. It also implies that data tests are less depending on consumers and might be automated more often than not. Listed here are a couple of key perspectives:
- Quality of information source: The standard of the information source significantly impacts the general quality of the ultimate data. The info contract is an amazing initiative to make sure source data quality. As data consumers of the source, we are able to employ the same approach to observe the source data as data stakeholders do when evaluating the information products.
- Completeness: The extent to which information is retained in its entirety. Because the complexity of the information pipeline increases, there’s the next likelihood of data loss occurring inside the intermediate stages. Let’s consider a economic system that stores customer transaction data. The completeness test ensures that each one transactions successfully traverse the whole lifecycle without being omitted or unnoticed. For instance, the ultimate account balance should accurately mirror the real-world situation, capturing every transaction with none omissions.
- Uniqueness: This dimension goes hand-in-hand with the completeness test. While completeness guarantees that nothing is lost, uniqueness ensures that no duplication occurs inside the data.
- Consistency: The extent to which data is consistent across internal systems every day. The discrepancy is a typical data issue that usually stems from data silos or inconsistent metric calculation methods. One other aspect of the consistency issue occurs between days when data is anticipated to have a gradual growth pattern. Any deviation should raise a flag for further investigation.
It’s price noting that every dimension might be related to a number of data tests. What’s crucial is knowing the suitable application of dimensions to specific tables or metrics. Only then, the more tests employed, the higher.
So far, we’ve discussed the scale of external views and internal views. In future data test designs, it’s vital to contemplate each the external and internal perspectives. By asking the appropriate inquiries to the appropriate people, we are able to enhance efficiency and reduce miscommunication.