Itís the Data, Stupid!

Originally published 17 November 2009

BeyeNETWORK welcomes Bart Baesens as a new member of the team. He will write about all aspects of analytics and data mining. This first article will focus on the all-important aspect of data quality including a discussion of data accuracy, data completeness, data recency, data bias and data definition.

In todayís business context, corporate decision making, operations and planning rely more and more on the input of high quality data. This is further amplified by recent trends such as regulatory compliance, the need for intelligent decision models and discovering interesting business patterns from data. Consider, for example, the recent introduction of the Basel II and Solvency II regulatory standards in the financial industry. These guidelines force banks to inventory their corporate data, preprocess it, and prepare it for input into business intelligence solutions and software that are used to predict future measurements (e.g., default risk, loss or exposure risk) of interest. More than ever, these predictions serve as inputs for determining and outlining the future strategy and corporate decisions of the firm.

The well-known GIGO (garbage in, garbage out) principle teaches us that bad data yields bad prediction models which can then be the cause of bad strategic decisions and, in the end, corporate failure!† Consider the example of what happened recently in the credit markets. The rating agencies (Moodyís, S&P) had insufficient (bad?) data to rate exposures resulting from (black-box) securitization structures, resulting in well-known problems. Furthermore,† past research has clearly shown that the best way to augment the performance of any business intelligence (BI) prediction model is not by exploring new, exotic data mining algorithms but by improving data quality.

Given the crucial importance of high quality data, letís take a further look into the different aspects of data quality. In this article, I will discuss data accuracy, data completeness, data recency, data bias and data definition.

Data accuracy measures to what extent the data measures in a consistent and correct way what it is supposed to measure. Bad data accuracy can be due to entry errors, measurement mistakes or the presence of extreme values (outliers). Data accuracy can be enforced using an adequate design of data entry processes, validation constraints, integrity rules and/or quality control. In the short term, bad data accuracy has to be dealt with using appropriate outlier handling schemes, such as Winsorisation. The latter scheme allows a decrease in the influence of outliers by bringing them closer to the mean of the distribution.

Data completeness refers to the degree to which data fields contain missing information. Much BI software will simply impute missing information using standard imputation routines (e.g., replace by the mean/median/mode). However, it is largely preferred to avoid missing information from the outset as much as possible, again by carefully designing data entry processes. It is also worth mentioning that the fact that a data item is missing (such as income) may yield important information with respect to the target concept (such as default risk).

Data recency measures the extent to which the data collected is recent or not. It appears obvious that BI systems should be based on recent data. However, in practice it is not uncommon to have BI models that work on completely outdated data. Using recent data, it will be a lot easier to detect important trends and exploit these to create competitive advantages. Extending BI applications with incremental learning facilities, capable of recognizing newly observed patterns in recent data is a key challenge here.

Although not explicitly aimed for, many databases contain intrinsically biased data. Consider the example of a credit approval process in a financial institution. Typically, one only stores data about accepted loan applications. No data is stored about the rejected applications. Clearly, a bias is introduced when analyzing the data of accepted loan applications using BI software and then generalizing those findings to the population that comes through the door of the financial institution. Furthermore, data bias might be introduced when a company changes its strategy or undergoes a merger and acquisition (endogenous effects) or when the environment/macro-economy in which it operates changes substantially (exogenous effect). Although data bias is typically hard to avoid, it is important that BI modelers are at least aware of its presence and impact!

Finally, letís consider data definition. Providing unambiguous, consistent, enterprise-wide data definitions is often a key challenge. Mergers and acquisitions (M&A) and the corresponding data consolidation often lead to inconsistent data definitions resulting in bad data quality. Consider the data item loan-to-value, which measures the outstanding value/exposure of a loan against the value of the collateral, and which is a very important predictor for loss risk. This data item is typically defined in various ways, depending on how the collateral is valuated (e.g., book versus market value). Hence, it is of crucial importance to agree upon a common set of data definitions and vocabulary which is interpreted and applied in a consistent way throughout the entire enterprise.

Improving data quality is a crucial challenge for all the different types of data within a firm: quantitative data (e.g., GDP of a country), qualitative data (e.g., management quality of a firm), external data (e.g., Moodyís credit rating), etc. Its importance cannot be stressed enough given the fact that BI models that are built using these data sources are, now more than ever, steering the strategic decisions of the firm. As said, the best way to improve the performance of a BI system is not to look for fancy, new, exciting algorithms and/or tools but by improving data quality. This will allow the creation of better BI models and as such better strategic decisions. Developing data quality scorecards that consider the data quality criteria mentioned above is one of the key challenges lying ahead in order to create competitive and strategic advantages in the future.

  • Bart BaesensBart Baesens

    Dr. Bart Baesens is an assistant professor at the Faculty of Applied Economic Sciences at the K.U.Leuven (Belgium) and the School of Management of the University of Southampton (United Kingdom). He has done extensive research on predictive analytics, data mining, customer relationship management, fraud detection and credit risk management. His findings have been published in well-known international journals (e.g., Machine Learning, Management Science, IEEE Transactions on Neural Networks, IEEE Transactions on Knowledge and Data Engineering, IEEE Transactions on Evolutionary Computation, Journal of Machine Learning Research) and presented at international conferences. He is also co-author of the book" Credit Risk Management: Basic Concepts," published in 2008. He regularly tutors, advizes and provides consulting support to international firms with respect to  data mining, predictive analytics and credit risk management policy.



Want to post a comment? Login or become a member today!

Be the first to comment!