Shewhart, Deming and Data

Originally published 4 November 2009

I have just finished re-reading Walter A. Shewhart's 1939 book Statistical Method from the Viewpoint of Quality Control. Mine is the 1986 edition which has a foreword by W. Edwards Deming. Shewhart, a Bell Labs man, pioneered quality control and was a major inspiration to Deming (who met him at Bell Labs). Deming is well known in his own right for his contributions to such things as manufacturing quality in post-war Japan and the foundations of Six-Sigma. While these and other accomplishments have made Deming justly famous, he was never shy about recognizing the contribution of Shewhart to science and industry, and the influence of Shewhart on his own career. Indeed, Deming may to some extent have communicated some of Shewhart's ideas in ways that could be more easily understood and applied than Shewhart himself was able to.

Of course, the main interest that data management has in Shewhart and Deming is to learn from them how we can improve data quality. And indeed, it is not difficult to find mention of these names by authors such as Larry English, Danette McGilvray, and others who specialize in data quality. Yet, a brief reprise of some of the ideas in Shewhart's little 1939 book is in order, along with Deming's comments, because they seem to relate not just to data quality, but to data itself.

There is No True Value of Anything

One of the most astonishing passages in Deming's foreword is this:

There is no true value of anything. There is instead a figure that is produced by application of a master or ideal method of counting or measurement .... There is no true value of the speed of light; no true value of the number of inhabitants within the boundaries of (e.g.) Detroit. A count of the number of inhabitants of Detroit is dependent upon the arbitrary rules for carrying out the count. Repetition of an experiment or of a count will exhibit variation. Change in the method of measuring the speed of light produces a new result.

This is a good deal to think about, but one of the most valuable lessons that can be learned from it is that we need to set expectations about data. Too often this is not done. When a data warehouse is built, the business sponsors may not even stop to think about what its limitations may be. In my experience, they simply seem to think that a data warehouse must be 100% accurate and complete, and that any deviation from this is due to technical problems caused by someone or something in IT. Too often, the team building the warehouse fails to set any kind of expectation about the accuracy of what the data warehouse contains. If we think in terms of just the technology involved, then we have no reason to believe that it will not work perfectly. But if we think about the data itself, we know in our hearts that Deming's admonition is true. It might seem strange that we have to tell our users that a technically perfect environment will only be, say, 92% accurate on average, but that is the reality.

In case anyone should think that Deming did not know what he was talking about because he was too far removed from data, it should be remembered that he was involved in the 1940 U.S. Census and the 1951 Japanese Census. In the former, he introduced sampling techniques that greatly improved error rates.

There is one caveat about Deming's comments, however. They seem to echo Heisenberg's Uncertainly Principle, and what they have in common with that famous precept is the involvement of measurement. Measurement is itself a very complex subject, and we do not have space to deal with it here. However, although measurement predominates in scientific data, it is not always so in many of the enterprises data management professionals work in. We deal with insurance policies, trades, employee actions, orders, and so on. These are not material objects like cars, telephones and people. Deming and Shewhart focused on manufactured products, where measurement is vital. Much of the data we deal with seems to belong to a different realm, and while the spirit of Deming's comment seems to apply to it, the detail of what he is telling us may be only partially relevant.

The Order of Production of Data

Shewhart's book, just 150 or so pages long, is packed with all kinds of implications for data management. One concept that Shewhart introduces early on is the order of production of data. This is of vital importance for Shewhart. It can be illustrated as follows. Suppose we are testing a printer cartridge to see how many pages it will print. Suppose we run 2000 pages through it and find that at page 1,975 the print quality has deteriorated to the point where it is unacceptable. Now, if we look at the results as a whole we are forced to say that 1,975 pages out of 2,000, i.e., 98.75% of pages, are printed perfectly. This implies that our printer cartridge will continue to print for all eternity with a 1.25% error rate. Of course this is nonsense. After page 1,975, every page that comes out is unacceptable.

This may seem blindingly obvious, but I have rarely seen it applied to data. Shewhart is saying that the order of production of information contains extremely valuable information that is lost when the results are pooled and treated as a single population. We may find that a medical billing clerk has miscoded only 0.5% of his or her work in the past 5 years. That may seem acceptable, but if all those errors have occurred in the last 10 days, then we have a serious problem.

If we take Shewhart seriously on this point, then we may need to rethink the way in which we design databases and the way in which we test for data quality. For instance in a Type 2 Slowly Changing Dimension table, do we need to know what columns have changed from one record version to the next and when they changed? Of course we can infer this by scanning the entire table, but if we had metadata flags to represent such changes, then we would be better positioned to run this kind of query. Perhaps variance in the relative rates of changes between columns could alert us to underlying problems.

Statistical Control

A more difficult concept that Shewhart introduces - and he is not always an easy read – is that of statistical control. Shewhart divides up sources of variation into two categories:

  • Chance Causes. These are actually causes due to the manufacturing system itself and are the responsibility of whoever built or manages the system.
  • Assignable Causes. These are due to events that are not inherent in the system itself. Deming calls them special causes.

Shewhart was greatly concerned with detecting and identifying the assignable causes. These can then be overcome and removed from the system. Once this was done, only the variation caused by the system itself remains. The system is in statistical control. That is, its performance is statistically predictable.

If we do not measure the performance of a system, we have no way of knowing if it is in statistical control. Something has to measure the system to monitor that it really is in statistical control and that its performance is as predicted. The things that do the measuring - instruments in manufacturing processes – themselves have to be in statistical control. Here we have a "who guards the guardians" kind of problem, and it is ultimately necessary for people to judge whether the entire complex is under statistical control.

Just about everyone in data management hates production environments, but Shewhart's thoughts on statistical control lead us inevitably in that direction. Most of our emphasis tends to be on design. We tend to wait for processing jobs to blow up before we (or hopefully someone else) are called in to correct the error. But surely process control for data should involve something deeper, something that approaches the monitoring of statistical control.

Alas, space only permits the sketching of just a few of the ideas of Shewhart and Deming, and really nothing about how they can be practically implemented for data. Writing in 1986, Deming lamented that it would take at least another 50 years before we would fully comprehend Shewhart's contribution and benefit from it. Perhaps it will take longer.

SOURCE: Shewhart, Deming and Data

  • Malcolm ChisholmMalcolm Chisholm

    Malcolm Chisholm, Ph.D., has more than 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules. His experience includes the financial, manufacturing, government, and pharmaceutical industries. He is the author of the books: How to Build a Business Rules Engine; Managing Reference Data in Enterprise Databases; and Definition in Information Management. Malcolm writes numerous articles and is a frequent presenter at industry events. He runs the websites;; and Malcolm is the winner of the 2011 DAMA International Professional Achievement Award.

    He can be contacted at
    Twitter: MDChisholm
    LinkedIn: Malcolm Chisholm

    Editor's Note: More articles, resources, news and events are available in Malcolm's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Malcolm Chisholm



Want to post a comment? Login or become a member today!

Be the first to comment!