Judgment Day on Enterprise Data, Part 2

Originally published 6 December 2006

Requirements for Creating Trusted Data
What are the requirements and the technology components needed to create trusted data?

Enterprise compliance is based on understanding processes and policies and then monitoring those processes and reporting on business activity. All of this requires trusted data both in operational and analytical systems.

Fundamental to creating trusted data is to establish an enterprise data quality “firewall” where a shared data quality “service” (with common shared rules) is used by applications and infrastructure to prevent bad data entering and spreading across the enterprise. To establish this firewall, an organisation must identify existing data quality issues. This means being able to assess data quality in any file, database or message to identify incorrect and incomplete data.

Once this has been done, it is then possible to set up shared business rules to cleanup anomalies and maintain high quality data. Data content cleanup and completion also needs to be done both in batch processing and in real time so that all bases are covered. Note that by real time we mean either on-demand (e.g., at the request of an application) or event-driven (e.g., on arrival of an inbound message from a customer over EDI or via the Internet). Having provided this capability, we must then be able to integrate shared enterprise data quality infrastructure technology with every application, every process and all types of business integration software that manage the flow of data into and out of the enterprise and between internal systems.

Figure 1 shows points where data quality needs to be controlled.

 
Figure 1

Data quality is, however, not just about cleaning and repairing data content or preventing bad data from entering the enterprise. Trust in data is also about users knowing what data means so that they have the confidence to use it. If the content is correct but the meaning is unclear, then a user may still not trust the data. As a result, they may resort back to using their own personal databases and spreadsheets, reinventing “their version” of data that they believe to be correct. Once this happens, we spiral out of control again as people cut and paste “their data” and pass it to others in e-mail attached files and spreadsheets. For example, what is the difference between three metrics used in three different systems when they are respectively called Revenue, Total Revenue and Total Sales? Does the user know the difference? The point here is that when implementing enterprise data quality, the data names, data definitions and data integrity rules are just as important as cleaning and repairing the data content itself.

Enterprise data quality is, therefore, really about two things:

  • Managing data quality throughout the enterprise to validate, clean and complete data content. 

  • Managing metadata quality so that data carries consistent and unambiguous meaning with it everywhere it goes.

Metadata quality is about making sure that the enterprise uses a shared business vocabulary so that the meaning of that data is never in doubt. This shared business vocabulary should then be used to describe or “markup” data being presented to a user or flowing between systems. Having a shared business vocabulary means that each data item has an enterprise-wide common name, a common definition and common data integrity rules so as to guarantee understanding. It also means that we need to map disparate definitions in systems around the enterprise to these common data definitions so as to fully understand how and where data is used. This should also allow us to understand how data has been created and how to track that data back to a common place where a clear definition of it resides.

Taking Figure 1 and this discussion into account, the following set of technical requirements attempts to define what is needed to help achieve trusted data:

  1. Data quality assessment software is required to profile data in one or more data sources to determine its quality. Sources should include data in relational databases (e.g., IBM DB2 and Informix, Oracle, Microsoft SQL Server, etc.), non-relational DBMSs, flat files, XML files or messages, office documents such as spreadsheets etc.

  2. The data quality software should also allow business rules to be defined via a graphical user interface to manage data content cleanup. This includes correcting erroneous data, standardizing data to common data types, formats and lengths, repairing missing data, and enhancing data (e.g., adding spatial data to address information).

  3. Data quality business rules should be capable of being reused and nested inside other business rules to create more complex rules.

  4. Data quality software should support data comparison precedence rules so that data can be corrected one way or another to resolve differences.

  5. Data quality business rules should be able to be defined using common shared business vocabulary data names or other application specific vocabularies or a combination of both.
  6. The data quality software should have the ability to link business rules together in a sequence or workflow to manage the order of checking and cleaning data content.

  7. Business analysts should be able to define data rejection rules to denote data quality failure and define automatic actions to take if such data is detected (e.g. resend an incoming EDI or XML message back to wherever it came from with a rejection notification). This would prevent poor quality data from entering the enterprise.

  8. Data quality business rules should be held in a shared metadata repository in a relational database of choice.

  9. The data quality software should be capable of matching data records from multiple sources based on a subset of data items that meet match criteria. It should also be possible to define a minimum set of matching rules that determine a match. In particular for customer, supplier and partner information, it should be possible to do this on name and address information with support for different address formats in multiple countries.

  10. Enterprise data quality software should be deployable on a shared server and have an industry standard API (e.g., Web services API) so it can be easily integrated with all applications and business processes in an industry standard way.If industry-standard interfaces are not available, then at least a published API should be available so the applications can call the data quality server to clean and repair data on-demand.

  11. Operational applications should be able to invoke data quality software as an on-demand service during process execution in order to perform real-time checking of data entered at a keyboard. This is to prevent invalid or incomplete data from entering the enterprise. Data quality products should also offer pre-built integration with popular operational application packages to speed up integration.

  12. It should also be possible to invoke enterprise data quality in batch and on a pre-scheduled basis to carefully examine data in various files and databases.

  13. Enterprise data quality software should be able to seamlessly integrate with third party vendor data integration software products (preferably in an industry standard way) to enforce data quality as part of a data integration process (i.e., data quality as part of an ETL process when building a data warehouse, in data migration or master data management).

  14. Enterprise data quality software should be able to integrate with application integration software in a standard way so that it can be triggered by external inbound events or by internal events as part of a business process. By event, we mean message. This requirement means that when any data message (e.g., external inbound EDI message, external inbound XML message, or XML messages between internal applications) appears on message bus, the data within it can be checked and repaired “in-flight” while the message is en route to its destination system(s) and/or device(s).

  15. In the case of data representing names, it should also be possible to validate that name against known lists of names provided by legislative and regulatory bodies and alert nominated users if a name on those lists appears in the data. Data quality software should also be able to identify hidden relationships between individuals, often referred to as social networks. These capabilities will help enterprises to comply with anti-terrorist, fraud and anti-money laundering legislation and regulations.

  16. A log of all data cleansing activity should be supported to indicate what data has been cleaned, rejected or enhanced and when. This log should be capable of being viewed on-line or printed in a report for auditing purposes.

  17. Data quality software should be capable of automatically importing metadata describing the data structure of data sources or messages. If possible, metadata import should be done in an industry standard way.

  18. It should be possible for data quality software to either:
  • Provide tooling to allow the creation of a shared business vocabulary consisting of common names, definitions and integrity rules as well as tooling to discover disparate definitions for data and for mapping disparate data definitions to a shared business vocabulary. Tooling should also be provided to export these mappings to other software to facilitate data vocabulary translation to the shared business vocabulary as data moves around the enterprise, or

  • Integrate with third-party software that can provide this functionality (e.g., IBM Rational Data Architect, Informatica, SypherLink, etc.) so that disparate data names can be translated into common ones as data moves between applications or between applications and presentation devices. An example here is message broker software XSLT translation of XML vocabularies during message routing between systems. Satisfying this requirement means that the definition of the data and the data content is standardised and that the data content is correct.

Technical Architecture and Components Needed to Establish Enterprise Data Quality
In addition to requirements for trusted data, it is also important for business and IT staff to know how technology components should fit together to enforce enterprise data quality. This is shown in the high level technical architecture in Figure 2.


Figure 2


Enterprise compliance demands that companies must monitor process activity and report on what they know about their business activity and performance. To ensure that enterprise compliance is using trusted data, it is necessary for data quality software to be continuously active in operational processes. The reason for this is to prevent bad data entering the enterprise and also to stop it from spreading between systems anywhere in the enterprise. This strategy should ensure that data used internally for compliance in process monitoring and in reporting is correct.

Enterprise compliance demands that companies must monitor process activity and report on what they know about their business activity and performance. To ensure that enterprise compliance is using trusted data, it is necessary for data quality software to be continuously active in operational processes. The reason for this is to prevent bad data entering the enterprise and also to stop it from spreading between systems anywhere in the enterprise. This strategy should ensure that data used internally for compliance in process monitoring and in reporting is correct.

Enterprise compliance demands that companies must monitor process activity and report on what they know about their business activity and performance. To ensure that enterprise compliance is using trusted data, it is necessary for data quality software to be continuously active in operational processes. The reason for this is to prevent bad data entering the enterprise and also to stop it from spreading between systems anywhere in the enterprise. This strategy should ensure that data used internally for compliance in process monitoring and in reporting is correct.

Figure 2 shows data quality software as a shared service connected to the enterprise via an enterprise service bus in a service oriented architecture (SOA). It includes tools to:

  • Define an enterprise shared business vocabulary for data including common names, common definitions and common data integrity rules.

  • Discover data in multiple disparate sources with disparate definitions for data items and to map these to a shared set of common definitions.

  • Assess data quality in a wide range of data sources to compare it with shared vocabulary data integrity rules.

  • Create data quality enforcement and integration rules to resolve discrepancies in the data.

The enterprise service bus is the messaging backbone that manages message queuing, message routing (routing messages to the right systems) and message translation. Message translation is the translation of application specific XML tags (that mark up data to describe what the data means) into shared (common) business vocabulary XML tags as the data moves between systems and devices. Also connected to the service bus are operational applications, data warehousing systems, business process management software and an enterprise portal. In addition, the data quality software is capable of running in batch mode to vet data moving between systems during batch processing.

To maintain enterprise data quality, it is necessary to carefully examine all data entering the enterprise though keyboards, electronic file transfers or electronic messages.

Enforcing Data Quality at the Keyboard
With regards to keyboard data entry, data could be entered via the portal or directly via the application user interface for applications with a non-Web based user interface or for Web enabled applications not integrated with a portal. Increasingly, many internal applications are becoming Web enabled and being integrated into portal technology for personalised access from desktop and mobile Web devices.

In the case of data entry through the portal, it depends on whether the portal is connected to the enterprise service bus (ESB) or directly to the application. If connected to the ESB, the data entered at a Web device (PC or mobile device) travels through the ESB en route to the application(s) that need it. In this case, data can be checked and cleaned or rejected “in-flight” by the data quality software as the first step in the process before the application(s) due to receive it actually get it.

If the portal is directly connected to a back end application or an application is not yet Web enabled, then the data may be entered directly via the application user interface.In this case, the application itself can then call the data quality software as an on-demand service in real time to check and clean or reject the data. Either way, data quality is maintained.

Enforcing Data Quality on Electronic Messaging
The architecture in Figure 2 also allows data quality software to be triggered in real time during business process execution via a Web service interface. This is because data quality business rules can be published as web services and then invoked via the Web service interface as part of a business process. Event driven business processes running under the management of the business process management software (process engine) can be triggered by the arrival of an external message such as an inbound order or payment for example. In this case, the first step in the business process is to invoke the data quality service needed to check and clean or reject the data in the inbound electronic message. Once this has been done, the data is passed on to the application that processes it. Therefore, bad data is prevented from entering the enterprise.Using exactly the same mechanism, data quality software can prevent bad data from spreading by “screening” messages in real time as data messages flow between applications in an executing business process.

Even inbound files can be validated this way. For example, a process could be triggered by the arrival of a file which would invoke the data quality service to validate and clean the data within that file and perhaps also check for fraud as the first step in the process.

In addition, because of the support for shared business vocabulary definition and the mapping of disparate data definitions to common ones, it becomes possible for data quality software to generate XSLT (extensible stylesheet language transformation)  message translations for message brokers so that the broker can translate application specific data names (coming out of applications) into common ones as message data moves between applications or between applications and presentation devices. (Note: XSLT provides developers and applications with the ability to transform XML into other formats such as other XML vocabularies, HTML and xHTML.)

Enforcing Data Quality in Batch
In addition to real-time screening of keyboard data, message data and file data, it is also possible with this architecture to assess data quality in data sources, define rules to improve quality and run this cleanup in batch. The high quality data output from this batch process could then be loaded into a database or processed in other ways.

Enforcing Data Quality in Data Warehousing and BAM
One way to use data quality in batch or real time is to integrate the enterprise data quality service into data integration tools (a.k.a., extract, transform and load – ETL – tools) to build business intelligence systems. This is often called data warehousing. The enterprise data quality Web service interface shown in Figure 2 allows seamless integration and execution within a batch or event-driven data integration process. This guarantees that data from multiple data sources is cleaned, enhanced, matched, etc., en route to a data warehouse so that compliance and business performance reporting takes place on trusted data.

In addition, because many data integration tools can also operate in event-driven mode, they can be triggered to integrate data “on the fly” for automatic analysis during business activity monitoring (BAM). This is necessary in process monitoring for compliance for example. By calling data quality services during real-time event-driven data integration, companies can ensure that the data being automatically analysed in BAM processing is trusted data. Therefore, any automated actions taken on the back of analysis results to remain compliant can be taken in confidence.

Conclusions
Enterprise data quality is fundamental to compliance, performance management, investor relations and business confidence. Without the establishment of trusted data and an enterprise quality firewall, companies will never escape from the risks of inadvertently providing inaccurate information to regulatory and legislative bodies and all the consequences that could follow on from this. Reporting and acting on inaccurate information not only causes business problems, it could also prove to be very costly at some later point if a compliance audit identifies breeches in regulations.

Companies that are intent on raising the standard of their business practices by establishing compliant business processes can’t do this without trusted data. Therefore investment in enterprise data quality software is fundamental to prevent bad data from entering the enterprise and spreading across systems.

Recent articles by Mike Ferguson

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!