Data Governance

AI's Impact on Data Governance: Separating Hype from Reality

David Vavruska
June 27, 2023 | 13 min read
Many claims of full automation that sound too good to be true most probably are. So how can one navigate this tangled web of science, spectacle, buzzwords, and marketing illusions? What really is artificial intelligence (AI) in data governance, and how much of it is just a circus trick?

Since the early twentieth century, people have been asking the question, “Why do I have to do this? Couldn’t a robot do it for me?“ Wild stories of servitor mechanical constructs that would make human lives easier by taking over the burden of manual, menial, and repetitive tasks, allowing humans to focus on endeavors of the mind and spirit, have made their way into philosophy, popular culture, and even some political ideologies contemplating a hypothetical post-scarcity world order.

Fast forward to 2023, and we as a society discuss the same topics that fascinated our grandfathers, albeit in a slightly more negative light. Suddenly, the idea of robots taking over somebody’s manual, menial, or repetitive work isn’t so inviting for many among our population out of their fear of losing their livelihood.

Where there is fear, there is logically also a deal of panic. Panic – just as well as overexcitement and fascination – enables misinformation and false assumptions about what a phenomenon can or cannot do. Once again, wild stories about whose job is going to be taken over by AI or automated industry systems permeate our society. Not even the otherwise strictly logical and pragmatic industry of data governance hasn’t escaped the trend. Vendors and analysts alike now compete in who can conjure up the most fascinating feature that will do its task all by itself without any need for human input.

AI, the tastiest piece of the pie in the storefront

When you look at websites, whitepapers, and social media posts of data governance tool vendors, you will get the feeling that the industry mastered artificial intelligence ten years ago. Everywhere you look, you see AI-powered automation, machine learning (ML) enrichment, and predictive anomaly detection features that give away the feeling that from the point of purchase, you need not lift a finger; your data governance problems simply disappear with the pressing of the button. And perhaps it began honestly with a genuine innovation that one of the data governance companies introduced. Then, however, its competitors took notice, and a huge feeling of FOMO set in.

So began the endless mental gymnastics tournament of vendors one-upping each other in coming up with flimsy justifications for why a certain feature is AI. It was a wonderful time because nobody really knew what AI meant.

Now that actual AI is making its presence known; it is becoming ever more apparent that most of them might have been exaggerating. You could say that AI is now becoming an even more important trend than before, provided you can prove your solution is actually AI-powered.

An AI system would be, in a strict sense, defined as a rational agent capable of solving tasks based on past experiences. It should be able to learn and adapt in order to pick the best viable solution option.

But much more is being presented to us as AI. The issue is that the industry acts as if everything that isn’t done manually by the user is AI. As an example, let us take a look at an automated feature of Accurity that greatly contributes to automation in the tool: our lineage suggestion engine.

With the click of a button, a column documented in a data catalog can be suggested a number of business terms via a vertical data lineage – essentially explaining what the data in that column mean. These suggested connections are then scored by the algorithm’s confidence in the likelihood of the connection being true.

Sounds pretty AI-ish, right? Well, guess what. Not really.

What the feature does is compare the names and descriptions of the two objects and suggest those that display the highest degree of familiarity. I admit, I could see it under a flashing sign loudly exclaiming, “AI-powered!“ But the feature doesn’t learn anything. It doesn’t actively solve a rational issue. It makes no decision.

Let’s pick apart some of our competitors‘ cherished AI-powered features and shed some light on how we could turn them into true AI…

Good and bad AI use cases in data governance

Business term suggestions

Many of our competitors boast the AI-powered feature of automatic business term suggestions. In most cases, it works like this: The user creates a technical metadata object in the data catalog – a table or a column representation. If the tool believes it is a technical metadata object that currently has no business metadata counterpart in the business glossary, it will propose the creation of a business term. The logic is the same as with Accurity’s suggestion engine. A new data catalog object is compared in similarity to business terms, and if a certain similarity score isn’t reached with at least one, the suggestion is that it should be newly created instead. Once again, no learning or decisions are actually involved.

Some of our more ingenious colleagues might even propose a definition of some sort. You might think there must be some degree of learning going on there, but you would be mistaken. The proposed definition is created by cannibalizing existing definitions of business terms whose connected data catalog objects share a high degree of similarity with the one currently in question. That similarity is based on shared tags, data classes, or types. Once again, no actual learning took place. All the feature does is copy-pasting bits of text together from already existing text, not creating text that would be entirely new based on something it has learned.

A proper use of AI in creating a business term suggestion feature could be achieved by employing an adjacent discipline, natural language processing (NLP). NLP algorithms are able to scour through chunks of unstructured data, such as the company’s internal documentation, BI reports, etc., in order to detect and find ontologies systemically relevant to the company and its data management. Just to be able to do that, it needs to learn what is and isn’t relevant. At the very beginning, we see an actual learning mechanism. With each new business term defined, it becomes more accurate in its suggestions, and human input is absolutely necessary for the beginning phase if the algorithm is ever to get anything right. But at the end of the process, you will be left with a feature that can detect these relevant ontologies on its own and suggest their definitions not based on similarity to other already existing definitions but by having learned the context of what the user regards as relevant and irrelevant. Learning, solving tasks, and decision-making. AI wrapped up in a neat little package.

Anomaly detection

Many data governance tools claim to be able to detect anomalies, and almost all of them claim to do it using machine learning. This has the most beneficial use in the area of data quality monitoring, where it is important to be able to detect problems with the quality of data a company’s management depends on for making informed decisions. In many cases, this is achieved by aggregating past results of relevant data quality rules. With enough data about the results, an algorithm is able to calculate what is the average result range and paint two cool lines onto a result visualization diagram that describes the boundaries that must not be crossed. If they are, the tool simply sends an alarm that an anomaly has happened and asks for investigation.

However, how can such a functionality tell an anomaly from newly normalized development? It can only change its tolerance thresholds by accepting enough newly recorded anomalous values. The problem here is that this algorithm only uses statistics to calculate the probable threshold. It isn’t learning anything, nor is it adapting. It only ingests new variables and updates a calculation. This is a very poor way of accounting for the seasonality of results which will end up with the user being plagued 50% of the year with falsely assumed anomalies.

The true AI approach to anomaly detection comes in the form of predictive anomaly detection. Yes, once again, the algorithm ingests past result data, but instead of making an inference, it learns trends of results. It learns the context of results. In the end, the anomaly tolerance threshold is not two lines but two curves created to reflect result seasonality. With learning seasonality comes the algorithm’s ability to detect not only result-based anomalies but also anomalies detected by the unusual time of the result’s arrival. Or unusual sizes of checked datasets.

The main difference between these two features that look so alike is that with a truly AI-based approach, you are not simply switching up numbers in an equation, but you come to believe the algorithm knows what to expect and what not to expect.

There is still plenty of time

There are many other examples one could name. And the line between traditional statistics and AI/ML is still not fully understood by most researchers, who argue over the meaning of machine learning. But that only tells us one thing: AI as a technology is still in its infancy. Generative AI gave us just the first glimpses of realization into how this field of research sets itself aside from other methods of automation.

The more applications for AI out there, the clearer picture we will have of what it means to have an artificial creation take over repetitive and menial data governance tasks and give us more freedom to perfect the human element of data governance that is becoming ever more important the more business users get involved.

Accurity will not sit this race out. In fact, the examples above were not chosen at random. Both are a part of Accurity’s roadmap. And we want to make damn well sure that we are using true AI before shouting it out into the world.

If you are interested in what else is on our roadmap or just keen to see our features in general, feel free to contact us with questions or schedule a live demo. We can tailor it to fit your use case.

David Vavruska
Product Analyst