From Business Intelligence to Enterprise IT Architecture, Part 3

Originally published 8 June 2010

The previous article in this series introduced the Business Integrated Insight (BI2) architecture as an evolution of data warehousing / business intelligence. The scope of BI2 is far broader than these previous approaches because of the growing need for a holistic and integrated view of all the information, processes and people’s activities that comprise a modern business. This article drills into the information layer, the Business Information Resource (BIR) to understand its drivers and to describe a key part of its structure.

Information Topography and the Information Space

Figure 1, reproduced from my previous article, shows the overall layered structure of BI2 and the placement of the Business Information Resource in the overall architecture. As shown by its position as the foundational layer of the architecture, information is the basis for addressing the often competing needs of the business for integrity and flexibility, completeness and timeliness. And, as previously shown, these needs extend to all information accessed and used by a business – from operational transactions to data marts and from contractual documents to images on or from the Web. With such a diverse set of information types, we need some form of classification to discuss their characteristics and to understand the opportunities and limitations of working with them.



Figure 1:
The Three Layers of the Business Integrated Insight (BI2) Architecture

Information topography provides this structure. Like physical topography, the information variety is all about understanding the “lay of the land.” The objective is to create a map of the territory to enable navigation, decisions about use and so on. In physical topography, some map features are strictly measurable, such as map coordinates or height above sea level of a contour, while others are based on a human judgment: for example, is this land wet enough to be a marsh, is this a hill or a mountain? As we look at information topography, we also use a combination of measurable and more subjective criteria to classify the information. But while physical topography is a well-established discipline with well-known criteria, information topography is much more recent and less well defined.

When discussing information types, a binary classification approach has generally been used in the past. Information is classified as operational or informational, hard or soft, real-time or historical, personal or public, etc. While this is a convenient way of talking, it cannot be easily used to decide how to treat a particular set of data; there are simply too many binary options brokers. As a more usable approach, and in the absence of any established methodology, I have defined a three-dimensional information space, the three axes of which classify important characteristics of information. These three axes are shown in Figure 2 and are, of course, the source of the cubic grid representing the BIR in the overall architecture.



Figure 2:
The Three Axes of the BIR Information Space

Before describing these axes in detail, there are two points of note. First, the axes represent to some extent combinations of related characteristics. I have done this to reduce the number of axes; I cannot draw a greater than three-dimensional space and only mathematicians can conceive of n-dimensional space (where n>3) anyway. As a result, one can envisage other axes describing the same space. Also, there are potentially additional axes describing characteristics that may be of interest. Consequently, this representation may evolve. Second, the axes are continua and the classifications along these axes generally merge from one to the next: they are subjective to a large extent. While this may be deemed unscientific, my experience suggests that this type of loose definition is the most that can be achieved in the real business world and will provide a sufficient level of definition to architect the virtual and distributed information store that is the BIR.

The independent axes of the three-dimensional space are:
  1. Timeliness / Consistency: Describes how information moves from creation for a specific purpose to broader, consistent usage and on to archival or deletion.
  2. Structure / Knowledge Density: Describes the journey of information from soft to hard and the related concept of the amount of knowledge embedded in it.
  3. Reliance / Usage: Describes the path of information from personal to enterprise usage and beyond and the implications for how far it can be relied upon.
The remainder of this article deals with the timeliness / consistency axis and the technological implications of this thinking. The other two axes will be considered in Part 4.

The TC Axis: Timeliness / Consistency

Data warehousing has long described the concept of data moving from operational to informational or historical. This is the basic idea of timeliness. Operational data exists at a moment in time and is correct at that moment. It is not safe to assume that it is correct at some later time. Informational data, on the other hand, is deemed to be correct over a period of time – it has historical validity. It records valid states of the business over time. In order to do this, the consistency of data as it is combined from disparate sources has to be considered. Data has to be reconciled as it moves from the operational to informational class. This thinking leads directly to the TC axis and the classes depicted on it. And while the above is described specifically in terms of hard information, soft information can be seen in the same manner.

Although, in principle, timeliness and consistency are unconnected, in practice in IT they are inversely related. Because logically related data is often created and manipulated in disconnected or dispersed processes, it can be very difficult and/or expensive to ensure that a newly created or changed data item is instantly consistent with related data items. A new invoice, for example, may be created with a local copy of a unit price at the precise moment when the master price file is being updated. Until reconciled, these data items are inconsistent, and if used together in an operational BI application, for example, will create errors or confusion. If we wish to use them together for more tactical or strategic decision making, we must wait until they have been reconciled.

These considerations lead to the five broad classes on the TC axis. In-flight data consists of messages on the wire or the enterprise service bus, valid only at the instant it passes by. Traditionally, such data is recorded in a database, which we term live. (In high volume, high speed event-processing scenarios, in-flight data may not be stored at atomic level.) Live data has a limited period of validity and may be inconsistent as mentioned above. Stable and reconciled data, which are stable over the medium term, are the next classes on the TC axis. In addition to being stable, reconciled data is also internally consistent in meaning and timing. Historical data extends the period of validity and consistency to, in principle, forever and wholly consistent.

Technology Implications of the TC Axis

In this section, we focus exclusively on hard information (atomic and derived classes on the knowledge density axis in figure 2) because of its familiarity to readers of BeyeNETWORK and its current high importance to decision making. Similar considerations apply to soft information, although with less urgency.

The TC axis is an enhancement of the view that led to physical layering of the data warehouse architecture into operational, enterprise data warehouse and data marts. This physical separation of layers leads to a dubious conclusion: that the boundaries between these different types of information can be cleanly drawn. However, BI2 defines the TC and other axes as logical and the classes on the axes are described as deliberately fluid. This reflects the real world. It can be very difficult to decide when a particular data item at a certain moment in its life cycle is, for example, live or stable. Each class boundary poses a similar problem. More interestingly, a particular data item can logically change from one class to another (usually the adjacent one) without any physical movement or copying. The physical layering of data in the traditional warehouse architecture along these logical and porous boundaries is now seen to be somewhat arbitrary, and strict adherence to it can lead to unnecessary physical data copying and significant problems with timeliness, consistency and management of multiple data copies.

So the question arises: Why were these physical layers instantiated in the first place? One answer is that technological limitations on database size, performance, optimization, etc. have mandated that multiple data stores were required. The logical and loose boundaries between information classes became, almost by default, the physical margins of the data layers. Given recent advances in database technology and hardware performance/cost ratios, the opportunity now exists to revisit those boundary decisions and database choices. Over much of the last decade and more, the dominant technology along the length of the TC axis has been traditional, row-oriented, general-purpose relational databases. For larger volumes of data and faster performance, most of these databases have taken advantage of some form of parallel processing.

The past few years has seen the emergence of column-based databases such as Vertica and ParAccel (not to forget the veteran Sybase IQ), which offer very significant performance improvements in query times. The vendors have focused on data mart offerings in the stable and historical information classes for a combination of technical and marketing reasons, but there is every reason to expect that columnar databases could mature sufficiently in areas of workload management and such that would allow them to make inroads in the reconciled data area. These developments could allow a substantial reduction in the complexity and data duplication in the enterprise data warehouse (EDW) / data mart area.

But what of the long-standing physical boundary between in-flight / live (operational) information and the classes to the right? The recent emergence of hybrid databases – a combination of row- and column-based approaches – from Oracle, Vertica and others suggests that even this boundary may be breached. (Of course, fundamental issues around data cleanliness and completeness in operational data must also be addressed.) Such databases could potentially serve all classes of hard information equally for most small and medium-sized businesses. With further development of parallel processing and in-memory database support, they could possibly support all but the largest data sizes. Of course, at such an early stage of its development, much will need to be done to prove and harden this technology to enable businesses to invest with confidence in it; it is a serious change with widespread possible implications, both negative and positive. The risks of such a deep-seated change in technology will be weighed carefully against the enormous potential benefits of significant reductions in the number of copies of data, improved quality, reduced management costs and increased timeliness and availability of information.

Conclusion

Our first dive into the structure and components of the Business Information Resource of BI2 has revealed some interesting possibilities to radically change the way information is structured and stored across the breadth of the business. Taking a more holistic view of information, together with recent advances in technology, we can see exciting near-term possibilities to simplify and streamline information for decision making. The next article in the series will look at the knowledge density axis of the BIR and tackle the thorny issue of data vs. content.
 

SOURCE: From Business Intelligence to Enterprise IT Architecture, Part 3

  • Barry DevlinBarry Devlin
    Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation.

    Barry's interest today covers the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT. These aims, and a growing conviction that the original data warehouse architecture struggles to meet modern business needs for near real-time business intelligence (BI) and support for big data, drove Barry’s latest book, Business unIntelligence: Insight and Innovation Beyond Analytics, now available in print and eBook editions.

    Barry has worked in the IT industry for more than 30 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

    Editor's Note: Find more articles and resources in Barry's BeyeNETWORK Expert Channel and blog. Be sure to visit today!

Recent articles by Barry Devlin

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!