Originally published 7 October 2009
The movie "The Matrix" was a big hit a few years ago and remains popular today, partly because it had a clever plot whereby what people perceived as reality was actually completely simulated by computers. In the "real reality" of the film, the computers had enslaved humans and provided the simulated reality to keep them quiet.
Based on my anecdotal conversations with colleagues, the movie was not just appealing because it was entertaining. Somehow it touched on an unease within us all that perhaps only consists of vague worries about where the information age is taking us. More specifically, at least for those of us who work with data, it is possible to see grounds for asking an odd question that is usually taken for granted – "Is this real?"
If there is any data that should be solidly connected to reality, then surely it will be master data. Master data is about solid, physical things like people and products and we can easily understand what they are. It seems silly to ask if such master data is real – it just has to be.
However, before we can answer this question we need to consider what data is. The currently accepted definition in the data management profession is:
Data is the stored representation of a fact.
That data is persisted and represents something else is quite correct, but just what a fact is has never been determined. I have seen data that clearly represents lies, and data which is incoherent due to data quality problems.
Nevertheless, let us accept this definition, and take a closer look at master data management (MDM). As noted, master data management is concerned about the representation of things. In the real, physical world around us, we see things like people, airplanes, toasters, and so on. These are individual. There are other things that are not individual, like air, water, chemicals and the like. However, these can be measured, and they can even be packed in a way that the package has an individual identity, making them more susceptible to master data management.
One of the central problems of master data management, which is often poorly stated, is the need to determine if one individual thing is the same as another individual thing. But the only way we have to do this is by matching records, and a record is not the same as the thing it represents. Unlike The Matrix, we are more in danger of confounding two "realities" rather than recognizing them as distinct.
This can be shown when we think of individual things. An individual thing is always an instance of a concept. John Doe and Richard Roe are instances of human beings. We represent them as individual records in a database, as shown in Table 1, and hopefully relate them to the original instances.
|Instances in Reality ||Master Data in Application A ||Master Data in Application B |
|John Doe||Record for John Doe||Record for John Doe|
|Richard Roe||Record for Richard Roe||Record for Richard Roe|
In master data, one instance in the real world is represented by a record. The instance has reality, and the record represents the instance. Of course, there are all kinds of ways a record can be implemented, but for master data, let us keep to a single record in a normalized relational database for now.
All this might seem blindingly obvious. However, a major problem arises when we fail to keep the distinction between the real instance and the record that represents it. We tend to speak of the data as if it "is" the instance, as if it has a life of its own. Yet the data is not "real" in the same sense that the instance is.
Let us look a little more closely at instances and records, and their interactions. Suppose we have a master data application for the Division of Motor Vehicles (DMV), and that John Doe has to visit a DMV office to get his driving license renewed. Table 2 shows some examples of the kinds of questions we can ask as we relate instance and record.
|Instance ||Record |
|Instance ||1. Is that John Doe going into the DMV?||2. Mr. Doe, I have to ask you for some personal details to verify we are dealing with you.|
|Record||3. Please be patient, Mr. Doe, as I try to find your record.||4. Is this record in the DMV database for John Doe the same as the onein the outstanding arrest warrants database?|
Table 2: Matrix of Comparisons of Instance to Record
The first question may be asked by a prying neighbor who thinks he sees John Doe going into the DMV. This does not involve data at all, just the neighbor recognizing John Doe.
Where the instance (John Doe) has to be matched to a database record, there is a need to verify that the instance and the record match. Personal details, or photo ID, or biometrics may be used in this process. Here we are validating the instance against the record. It is the record that is assumed to be true, and the instance who may be an imposter.
In item 3, we try to match a record to the instance. The instance is the source of truth in this process, and it is necessary to create a representation of the instance in terms of search criteria. We are typically a lot less patient in this situation than when a record is validated against an instance.
Finally, we have one of the big problems in MDM – figuring out if two records are for the same instance. But in this, we are not dealing at all with the instance in question, we are only dealing with data that represents the instance.
One way to determine if two records are for a given instance is to match them to the instance. This is almost always impractical, though it can be done in a few important cases. Since the instances are nearly always absent from their records, the records can take on mantle of reality when we try to match them. We point to a record and talk about "this customer" or "that product" and so on. No doubt this is a convenient shorthand, but we need to be careful because representation can happen in multiple valid ways, which can cause problems for MDM.
Not long ago I had to travel to Southern California. The airport I was traveling to had the code "SNA," but had three names: "Orange County," "Santa Ana" and "John Wayne." Initially, I did not know this, and I found I had to do a lot of cross-checking when I bought my tickets, looked at the departure monitors, and so on. Eventually, I learned what was going on, mainly because the airport code was fixed. Worse situations can arise. Suppose a product is made in the USA and France, and identified in a different way, and given a different name in the two locations. The enterprise will have two valid records – one for each product – but we have no basis for thinking that we can merge them based on identifier and name alone.
Of course, there is often some kind of pattern matching that can be done, such as fuzzy matching of product descriptions, to see if a substring in one record is present in another record. The problem is that these algorithms can produce false positive matches and miss other records that should match.
Records are not like real things. They are only representations of real things. They cannot be created in the absence of governance with the expectation that they can be matched up later in the same way that the prying neighbor saw John Doe going into the DMV. No technology can guarantee that. Yet many enterprises expect MDM technology to make that kind of magic happen. If we hold on to the distinction between reality and representation, between the instance and the record, master data management will become easier. We are already living in something like The Matrix, and we should not pretend that we are not.
Recent articles by Malcolm Chisholm