What Is Data?

Originally published 3 February 2011

Data management seems to be in demand as never before. The realization that data is an enormous asset to enterprises seems to be accepted everywhere. Yet it seems that we do not always have a firm grasp of what data is. Perhaps, therefore, it is worth considering how to define "data," given that the data management profession has garnered some authority and is now being treated with respect.
But why bother? After all, don't we go to work every day and find the data there waiting for us as an objective reality we have to work with? I can only give my personal view, which is that data belongs to a different order of existence than do material things. As such, it does not regulate itself in the same way that the laws of nature regulate material things. Data is not subject to the laws of motion, thermodynamics, gravity, and so on. Instead, humans have to put a lot of effort into managing data to prevent it from descending into chaos. Simply put, data does not really look after itself very well, and needs a lot of tending. This leads to the conclusion that we need to understand data to manage it properly. What follows is an attempt at a beginning of a definition of data.


The word data is the plural of the Latin word datum, which means "something given." This sounds a bit odd until you realize "something given" relates to traditional logic. The basic units of traditional logic include terms, propositions, and arguments. Terms signify concepts, and have definitions that can be clear or unclear. Propositions are statements composed of a subject and a predicate and can be true or false. Arguments are ways of processing propositions and can be valid or invalid. Furthermore, traditional logic is divided into material logic and formal logic. Formal logic is mostly about the rules for reasoning and makes up the largest part of traditional logic. Material logic is about the content of the propositions that are being processed in logic. Hence the content of propositions has to be taken as "something given" in order to move on with formal logic. I use the following syllogism as an example:

 All men are mortal
 Socrates is a man
 Therefore, Socrates is mortal

We have to take it as given that no man is immortal and that the Socrates mentioned here is a man and not, for example, a cat or some kind of spirit. Thus the first two propositions (the first two lines) contain data—something given. The last line is the conclusion reached by formal logic.

This long, historical connection between data and logic is what seems to have made data something special in the first place. There is also a strange echo of it today in data management. Just like the logicians, we data managers do not see it as our task to inquire as to whether the data is right or wrong—we are much more concerned about ensuring that we process it correctly and manage it well. Of course, we can test for data quality and audit data to detect inconsistencies or even fraud. But we do not check if the data really represents what it purports to represent. If that is done at all, it is not done by data managers. 

The Transition to the Modern Concept

With the Renaissance, science turned to contemplate nature more than the human mind, and the "data" of the logicians came to be understood more as the recorded "facts" of the scientists. At this point, we need to understand what "fact" means. It is a term that comes from "facta gestae," which is Latin for "things done," meaning deeds, the raw material of history. Today there are other understandings of what "fact" means, but this was what was originally meant.

Scientific data—scientific "facts"— is about observations and measurements. This kind of data can indeed be found today in the databases that data managers look after, although it is usually in a minority. It is very important in some areas, such as petroleum exploration and production, and some aspects of manufacturing. However, it is fair to say that most modern data is not of this type, and so it is not really possible to define data as solely as being scientific data.   

The Modern Concept of Data

So what is data? According to Wikipedia, data is "information in a form suitable for use with a computer." My opinion is that trying to define data in terms of information is always going to be a nonstarter. One reason is that "information" is more difficult to define than "data," and you cannot include a term more obscure than the term you are trying to define in its definition. Another reason is that I believe that the definition of "information" depends upon the definition of "data", so you get a circular definition if you include "information" in the definition of data. I am willing to be proven wrong on all this, but for now I would like to define data without having to bring in "information."
The starter definition that I use, although I can no longer find the source for it, is that data is "the stored representation of a fact."

That data is "stored" implies that it is housed in a physical medium. This also distinguishes data from a signal, which involves the transmission of encoded information from a sender to a receiver. This is important because perception is often referred to as consisting of "sense data."  The problem here is that a receiver consumes the signal it receives, and this part of the signal is no longer available for further consumption. By contrast, data is available in a physical medium that can be re-read, and the act of reading the data does not consume it. Hence, a signal is not like data, and the distinction is because data is stored in a physical medium.

Data clearly involves representation because data has significance. It refers to something else. It is a sign that points to something other than itself. Here we have an important clue because it would seem that signs are the supertype of data, and to get a good definition all we need is to differentiate data from other kinds of signs. So what is a sign?  Charles Sanders Peirce defined "sign" in Baldwin's Dictionary of Philosophy as "Anything which determines something else (its interpretant) to refer to an object to which itself refers (its object) in the same way, the interpretant becoming in turn a sign, and so on ad infinitum."

Peirce may have been one of the most intelligent Americans who ever lived, but he is not exactly an easy read. He goes on to subtype signs into icons, indexes, and symbols. An icon is a sign that looks like what it signifies (e.g., a picture of a flame looks like fire. An index is a sign that does not need an interpretant (e.g., smoke always signifies fire). A symbol is a sign that does require an interpretant to use a convention to understand it (e.g., words are just noises we produce in the air that have to be understood to signify anything).

It is fairly obvious that data corresponds to symbols in Peirce's sense. However, its interpretant is not human consciousness, but machines. Data in the modern sense is never intended to be given directly to a human. We cannot even sense the north and south poles of the individual units of magnetic data storage—only a machine can do that. On the other hand, computers are not interpretants in the same way as humans are. Humans use a sign to identify a concept. Computers can only relate one sign to another according to rules that humans give them.
At this point we have thought through some of the elements in the starter definition of data being "the stored representation of a fact." More elements remain, especially whether we are even justified in having the term "fact" in the starter definition. That will have to wait for another article. However, we can see that "data" has been used, and is still used, to mean many things The "data" of data management has only one of these meanings. It is symbols stored in a medium that can be read only by some form of technology that can relate these symbols to other symbols according to rules ultimately provided by humans. Not very elegant prose perhaps, but hopefully a step closer to a definition of data.

SOURCE: What Is Data?

  • Malcolm ChisholmMalcolm Chisholm

    Malcolm Chisholm, Ph.D., has more than 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of the books: How to Build a Business Rules Engine; Managing Reference Data in Enterprise Databases; and Definition in Information Management. Malcolm writes numerous articles and is a frequent presenter at industry events. He runs the websites http://www.refdataportal.com; http://www.bizrulesengine.com; and
    http://www.data-definition.com. Malcolm is the winner of the 2011 DAMA International Professional Achievement Award.

    He can be contacted at mchisholm@refdataportal.com.
    Twitter: MDChisholm
    LinkedIn: Malcolm Chisholm

    Editor's Note: More articles, resources, news and events are available in Malcolm's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Malcolm Chisholm



Want to post a comment? Login or become a member today!

Be the first to comment!