Oops! The input is malformed!
Originally published 12 July 2011
One of the earliest data quality "cleanup" projects I worked on involved remediation of a project tracking database. The application providing the front end for this database allowed users to manually update a Status Code for each project – which, roughly speaking, was "Proposed," "Active," or "Closed.” However, the application also allowed users to update fields that were related to these Status Code values, such as Project Start Date and Project End Date. And the values in these dates could also be set by application logic. As the reader can easily guess, the values in the Status Code were often inconsistent with the values in Project Start Date and Project End Date.
This experience made me interested in "Status Code," which seems to crop up in many situations in data management. It seemed odd, at least to me, that the same abstract attribute – Status Code – appeared in so many different entity types, and that it was so dependent on time. Later, in dealing with historical data, I ran into the related, and perhaps more important, concept of state. An entity type can have a different Status Code value at different times, but ultimately each of these values reflects the entity type being in a different state. But just what is a state, and what is status? How are they special? Do they differ from other kinds of attributes in ways that imply a need to devote particular attention to managing them?
One of my favorite logic authors is George Hayward Joyce. He defined state as "a condition of stable being.” A better way of expressing this to a modern audience might be "a condition of stable existence.” But even that description is not all that informative. I think that the "stable existence" part means that an instance of an entity type has a fixed set of attributes and this fixed set of attributes is part of the overall definition of the entity type. But the attributes appear in different groups during the life of an instance of an entity type.
An example outside of data management can make this clearer. My university degrees are in zoology, and I have always been fascinated by insects. Certain orders of insects, such as butterflies, pass through four completely different states in their life cycles: egg, larva, pupa, and adult. An adult butterfly has tubular (sucking) mouthparts and four wings, whereas the larva (caterpillar) has biting (mandibular) mouthparts and no wings. Thus, each state of the butterfly has its own particular attributes, some of which are not found in the other states. Within each state, the attributes may change their determinations (equivalent to values in data). For instance, a caterpillar will grow in size and its mouthparts will get larger – their size will increase. As an adult butterfly ages, its wings will get battered and progressively lose their microscopic scales. However, these changes are not the same as the presence or absence of specific attributes, which is what happens in each state.
Changes of state also need to be explained. In a change of state, there is a transition in which there is no stability. If there were stability, there would be an intermediate state. It is therefore a contradiction to use the definition of state discussed here to speak of a "state of transition" because in a transition there is no stable condition of existence – in other words, no state. Another feature of transitions in data management is that for the entity types whose data we manage, transitions often seem to be instantaneous. This is interesting and also, I think, simplifies the data management tasks. However, transitions between states cannot be guaranteed to be instantaneous for every entity type. It is not clear to me what data we might want to capture about a transition, other than when it began and ended and what (or who) caused it to happen. We cannot deal with this problem here, but it is certainly something for data managers to think about.
There is yet another issue that can impinge on the idea of state – evolution. This is when instances of an entity type appear with unexpected new attributes, or unexpected missing attributes. An entity type with states whose attributes are known in advance is a complete concept. When evolution occurs and instances appear with new or missing attributes, this is something different to a known state. It might seem rare, but it happens more often in enterprise data management than is commonly realized. The reasons are well known. IT typically sees itself as distinct from the business, and the business is often reluctant to get IT involved in any activity due to IT's reputation for being difficult to work with and poor delivery. Yet the enterprise exists in an environment of constant change – driven by markets, opportunities, inventions, technology, regulation, and other factors. Many of these changes are small and tend to get operationalized in information systems in an ad hoc manner by business users. The applications may not change, but the data is adapted to these new requirements. For instance, a new record may appear in a Customer Type code table to signify a new type of customer. There is much we can discuss about this phenomenon, but the point here is that evolution is not the same as state. State is known in the definition of the entity type, and is quite different from the appearance of new attributes or disappearance of existing attributes in evolution. Management of state needs to be differentiated from the management of evolution. Both can be planned for to some extent, but management of evolution is definitely more difficult.
Yet another question about state arises when states are compared to subtypes. In data modeling terms, it might seem that the different states of an entity type are just like subtypes. In some respects they are. There may be attributes shared by all states, just like a supertype holds the attributes shared by all the subtypes. Each state will have the same primary key, just as subtypes have the same primary key as their supertype. From the data modeler's perspective, therefore, states and subtypes can amount to the same thing. However, this is misleading. In a true supertype-subtype pattern, a single instance is either always confined to just one subtype, or changes from subtype to subtype in an unpredictable manner. For instance, Bank Account can have the subtypes Checking Account and Savings Account, and each account will be only one of these subtypes. But with states, we expect an entity instance to travel though most – if not all – of the states we have defined, and to do so in a predetermined manner. In the example given above, a project would be "Planned" then "Active," and then “Closed." In terms of data storage, it might be possible to say that states and subtypes are the "same thing," but this does not capture the point that there are essential differences between states and subtypes. Here we get into the argument about the scope of data modeling. Good data modelers will want to capture as much information about states as possible, but the rules for how transitions occur are probably going to be textual rather than graphic. More important to the data modeler is the association of attributes with states. If individual states are not manifested as distinct entity types in a data model (i.e., there is just one entity type), then the modeler ought to carefully identify which attributes are associated with which state. Again, this is not usually supported by the graphical features of data modeling methodologies and puts additional pressure on the data modeler.
Unfortunately, space permits only a broad overview of state. In particular, Entity Life History (ELH) has not been discussed. ELH was introduced by Michael Jackson in the 1980s and has been written about by other authors such as David Hay. One important aspect of ELH is that it does have a graphical method of depiction. Another area not covered here is the relationship between state and history, e.g., do snapshots adequately capture state? Yet even at the high level at which we have looked at state, there are still many lines of investigation that could produce useful results for data management.
Recent articles by Malcolm Chisholm