Let us start with a real-world parallel of building a new house and then living in it. I am sure we can all agree that the goal is clear: you want to have a nice, fully functional house where you can happily live for a long time. Maybe at the beginning, just after finishing the build, there are no issues. But you already have some complex systems installed e.g., electrical wiring, water pipes, heating and venting systems, etc.). And you might add to the building or upgrade it once you start living in it, such as adding solar/wind power or building an extension/extra garage. And, of course, you must also continuously maintain it once it is built.
The longer you live there the more probable it is that there will be some issues to handle. For some time, you might not even know about the problem as the issues might not have had a detrimental impact on you living there, but the longer the problem exists, the worse it can get.
That is why you will want to know if there is something wrong happening and then obviously, you can fix it. It all starts by having everything well documented such as knowing which pipes go through the whole house, where they are actually installed, plus what is their quality. Only then you can say something is or is not working well. And it is much better to have everything documented before the problems start occurring – hence, you document a house when it is built, not afterwards.
And that is the same with data quality – you must start with documenting your data, so that you know what you can work with, you must know where the data comes from, and what they should fulfil. Only then, if your data fulfils what they should by design, can you effectively measure.
Let us dig deeper into the thought of measuring data quality without any context around doing it. A common misconception that we have seen occur is where a business wants to simply start measuring data quality, but they believe they will not need a business glossary or data catalog. This is why they usually start this initiative in some technical department and the steps taken are often similar to the following:
A technical person is asked to start measuring the data quality. The first questions this person has (if they are able to get the access to the appropriate data) are: “what are they looking at?”, “what the data stores as information?”, and potentially “how are the metrics calculated?”.
They must find out who is responsible for the data domains they are about to measure the data quality for, so that they know who to ask. This takes time.
Once they know, they can go to somebody from the business department to find out what is stored in the data. This again takes quite a lot of time.
If they are already familiar with the fundamentals, they will probably start writing simple, obvious technical checks, such as whether data stored in a date column are in the proper format or whether an email address fulfils the proper mask. This step might be redundant because the data source might have such validation already in place (but this person does not know this). The technical person might be able to go one step even further by putting in place some more enhanced data quality rules based on the business knowledge about which they may have heard.
Then they will move on to another data domain, repeat the process from the second step, and discover that this new person they are talking to tells them a different meaning of the data or calculated metrics and so what the data should fulfil. This might contradict what they have been previously told.
With this approach, the technical person will never be able to measure the data quality properly so that it is consistent across the whole company. And without the deeper business defined rules, they can only measure basic technical checks. For example, the characteristics of a basic technical check might be for customers that have “active customer” flag set to 1 to fulfil the condition of being active, but the deeper business rule is that an active customer has also made an order in the last year for more than 10 EUR. Without this context based on business information; the technical person is not defining the check properly – they need this information (consistent across the company).
In addition, we can see that all of this is a time-consuming process, especially if they keep discovering new rules for data of similar or the same meaning and not having it well documented. What is the way out of this corner? We recommend starting by describing business terms with calculations and defining business data lineage. Only then are you able to understand what data you are looking at and have a consolidated point of view on the reality. In addition, if described and defined well, it is possible to know exactly where these metrics are stored too.
Based on the one defined truth captured in the business terms, businesses can define rules which must be fulfilled in the data while using the business term definitions without any technical knowledge. However, this type of information is crucial for the whole data quality monitoring.
Then your technical people will become very happy as they will know what to measure without any investigation and with clearer requirements. This has a positive impact on their effectivity and faster realization of the technical side. And they will be able to do it correctly by design from the business perspective. Overall this approach helps to improve the clarity, speed, and efficiency of internal projects. As a result, the business sees value as the initiative is bringing benefits back to the business – it’s a win-win for both the business and technical teams.
Another point of view to data quality monitoring not being only about measurements is based on the opportunity to put the business processes at the center and apply data quality solutions to the wider spectrum of business scenarios and data domains. This newfound understanding can then be used to help formulate rules for data quality, effectively beginning the company’s initiative towards, what is known as, data governance.
Ideally, at the beginning, before you start measuring any data quality you should decide who will be responsible for which data by assigning data stewards. Then, instead of having quality measurements just being run somewhere, sometime that will not bring you much added value, you can consider where and under which conditions it makes sense to launch specific groups of measurements.
Again, here is an example from the real world: Nobody wants that bad data quality to get to the production stages of your data warehouse (DWH). That is why it is clever to call specific groups of measurements before this stage and create from it a so called “quality gateway”. If the quality does not fulfil what it should, then the entire process of loading the data to the production can be stopped, and you will save a lot of effort with potential consequential data cleansing.
In other words, it is good to think about the automation of data quality rules in line with that they only be done if you incorporate them into some processes. Think about automating the process of making sure that the data you have is accurate. If you do this, you will be able to use less resources and spend less time on data quality tasks. Apart from the productivity benefits, such an initiative will further increase the trust in data across the organization and help achieve business goals faster.
In this scenario of putting business first, data quality processes will no longer be individual and distinct tasks, but they become incorporated into the very business workflows where they are needed. The value of data quality should be seen as one part of the data governance initiative for the overall company-wide processes and domains. This can maximize the benefits, increase competitive advantage, and play a role in data stewardship and regulatory compliance.
The monitoring and observability of data quality is not just about measurements. There are several other very important steps in a data quality journey.
Creating a consistent and proper handling of a business’ data across the whole organization requires a data governance initiative including people, processes, and technology.
It is best practice to make your data quality management framework one part of that data governance initiative.
Define and harmonize your business terms in a business glossary.
You need to document where everything is – a data catalog.
And, preferably, you should also connect the conceptual data model or the business glossary with your technical systems and data sources – thus creating business data lineage.
When the above are done well, you will not only be able to ensure consistent data quality from company-wide definitions, but you will be able to attain this more efficiently, saving time and money.
You will need an easy-to-use tool that can get you started with the basic facilities (data quality and data catalog) but one that will also be able to grow and scale with your business’ needs and requirements, such as adding a business glossary and/or business data lineage.
If you would like to see how Accurity, our all-in-one data intelligence platform, can help you with your data quality, data catalog, business glossary, and/or business data lineage needs, you can schedule your own free, personalized 1-on-1 demo with us, at a date and time that suits you.