Data Quality

Designing Data Quality Rules Solely from Data: Is It a Good Idea?

Tereza Mlynarova

April 5, 2024 | 11 min read

The reliance on data-driven insights forces robust data quality management (DQM). But is data alone sufficient for building effective DQM rules, particularly in complex domains like transportation? This question fueled our recent collaboration with the Czech Republic's Technology Agency and the Faculty of Transportation Sciences at the Czech Technical University in Prague. Our objective: designing transportation-specific data quality rules that go beyond basic data checks.

This focus on expertise arose not only from our project but also from industry trends. Clients increasingly seek automated data quality management solutions, prompting us to explore the limitations of purely data-driven approaches.

Data Quality: Navigating Between Expertise and Data-Driven Approaches

Let's explore the foundational principles of data quality management, with a specific focus on how data quality rules are created. At its core, the approach to data quality hinges on whether it's primarily driven by the expertise of professionals in the field or by the data itself.

Let's have a look now at each approach in more detail.

Expertise-Driven Approach:

If the expertise is the main driver for data quality rules, then the whole process works as following:

Begin by creating a comprehensive business glossary, capturing the essential business terms that define the company's landscape.
Document the data catalog to understand the available data sources.
Establish a business data lineage between the business glossary and the data catalog, ensuring alignment between business requirements and data assets.
Collaborate with business data owners across various domains to design logical business rules that align with the organization's objectives.
Implement these logical rules for the relevant data sources and assess data quality against them.

This approach empowers organizations to efficiently measure complex data quality metrics, leveraging the expertise of domain professionals. Accurity is designed with this methodology in mind, facilitating the implementation of business-oriented data quality rules without requiring SQL expertise.

^{Image 1:}^{An example of a logical data quality rule in Accurity}

As depicted in the image, here's a snapshot from Accurity showcasing an example of a logical data quality rule. With such a rule, one doesn't require any proficiency in technical language syntax; rather, they simply need to devise the logic grounded in defined business terms.

Data-Driven Approach:

If the data is the main basis for data quality rules, then the entries from business-oriented people are aside and the whole process looks like this:

Begin by documenting the data catalog, focusing solely on the available data sources.
Utilize automated tools to generate and measure data quality rules.

In this approach, the primary emphasis is on the data itself rather than expert input. Automated tools measure data quality based on standard conditions such as null checks or email masks validation etc. However, more intricate data quality rules often remain unmeasured, as they require domain expertise captured in a business glossary. So, business glossary is back to business again.

Ultimately, to effectively measure data quality while leveraging organizational expertise, initiating the process with a business glossary is essential. This foundational step ensures that data quality assessments are aligned with business objectives, emphasizing the importance of domain knowledge in driving meaningful insights.

In summary, whether the egg or the chicken comes first in data quality management depends on whether the process is guided by business expertise or solely by data. Starting with a business glossary ensures that data quality assessments are rooted in organizational knowledge, setting the stage for comprehensive and relevant data quality initiatives. In this case therefore we can say we know that the egg was before the chicken.

The Drawbacks of Relying Solely on Data-Driven Approaches for Data Quality

Before delving into our project's specific approach, it's important to highlight the drawbacks associated with relying solely on a data-driven approach for defining data quality rules which one must think about when sticking to this approach:

Basic Rule Design: Typically, a data-driven approach leads to the creation of only basic technical data quality rules, which may overlook nuanced business requirements and intricacies.
Limited Adaptability: It's challenging to tailor data quality rules to accommodate specific characteristics or anomalies present in the documented data catalog, hindering their effectiveness in diverse data environments.
Lack of Business Insight: Without input from domain experts, there's a lack of understanding regarding the contextual relevance and significance of data elements stored in the catalog, potentially resulting in rule inaccuracies or oversights.
Dependency on Data Scientists: Implementing more advanced data-driven approaches often requires significant involvement from data scientists, adding complexity and resource requirements to the process.
Quality of Training Data: Algorithms developed solely based on data may suffer from issues related to the quality of the training data, leading to suboptimal performance and reliability of designed data quality rules.
Specificity to Training Data: Enhanced algorithms developed through unsupervised learning methods are often tailored to the specific dataset they were trained on, limiting their applicability to new or different datasets.
Recalculation Overhead: Enhanced algorithms may need frequent recalibration or adjustment based on newly delivered data, posing logistical challenges, and increasing computational overhead.
Performance Demands: Implementing data-driven techniques for data quality assessment can be resource-intensive, potentially impacting system performance and scalability.
Lack of Expertise Integration: Without insights from domain experts, navigating and optimizing these algorithms can be challenging, highlighting the importance of leveraging business glossaries and domain knowledge.

In essence, despite the allure of automated approaches, the necessity of expertise from domain professionals remains paramount in ensuring effective and contextually relevant data quality management practices. This underscores the foundational importance of incorporating business glossaries into data quality initiatives again.

Challenges Encountered During the Transportation Project

Our project was supported by the Road and Motorway Directorate of the Czech Republic, which provided invaluable assistance by supplying historical data from the transportation sector. Initially, the abundance of data seemed promising for quickly identifying data quality rules. However, we quickly realized that without a comprehensive business glossary, interpreting the data proved challenging.

With the guidance of experts from the directorate, we successfully categorized key parameters such as speed, intensity, and road sections. This initial achievement was overshadowed by the discovery of data quality issues. Attempting to apply algorithms to data of subpar quality proved counterproductive, requiring extensive data cleaning and problem-solving in the transportation domain.

Our approach entailed employing statistical modeling to analyze trends and seasonal variations in both intensity and speed for individual road sections. By establishing confidence intervals based on historical data, we aimed to identify potential deviations in newly measured data that fell outside these intervals. Such deviations would serve as indicators of potential data quality issues, prompting further investigation and validation.

This method allowed us to systematically monitor the consistency of data patterns over time, enabling us to detect anomalies that might signify data inaccuracies or irregularities. By integrating statistical rigor with real-time data monitoring, we strived to ensure the reliability and integrity of the transportation data being analyzed.

In one instance, we encountered a significant anomaly on a particular road section where vehicle data suddenly went missing, despite being present before and after the occurrence. Without further investigation, this anomaly could have misled our trend and seasonality modeling or led us to assume the sensor was malfunctioning or bad data was created and sent from the sensor. However, upon conducting a thorough investigation, we discovered that the traffic flow had been redirected to another road due to ongoing construction. Equipped with this insight, we were able to refine our analysis and continue our work effectively.

Similarly, on another road section, we observed a sudden decrease in traffic flow and intensity on a day when the flow had before been consistent. Through meticulous examination, we traced this anomaly back to adverse weather conditions, which had a significant impact on traffic patterns. Despite the deviation from the norm, the data remained of good quality. However, had we not considered external factors like weather conditions in our analysis, we might have erroneously flagged the data as poor quality.

These examples underscore the importance of conducting thorough investigations and considering external and expert-specific factors when assessing data quality. By incorporating such insights into our analysis, we ensure more accurate and reliable interpretations of data anomalies, thereby enhancing the overall quality and integrity of our findings.

Integrate Data-Driven Approach with Expertise: Establish Robust Data Quality Rules

Ultimately, relying solely on measured data for defining data quality rules proved insufficient. While suitable for creating standard rules such as completeness checks (null checks), this approach lacked the depth and specificity necessary for meaningful data quality assessments. Instead, involving domain experts in defining rules through a business-centric approach proves to be more effective.

This is where Accurity excels, as it's designed to prioritize business perspectives in data management and quality initiatives. By leveraging a business glossary and engaging stakeholders in rule definition, Accurity facilitates the creation of tailored, contextually relevant data quality standards.

Explore Accurity Data quality and data observability solution or request a demo to find out how it suits your particular use case.

Tereza Mlynarova

Head of Product Development

Blog post