The Flaws of the Classic Data Warehouse Architecture, Part 2 The Introduction of the Data Delivery Platform

Originally published 1 April 2009

In Part 1 of this series, we described the classic data warehouse architecture (CDWA). This architecture is based on a set of data stores linked by a chain of copy scripts; see Figure 1. Examples of data stores are the central data warehouse, the operational data store, the data marts, and the multidimensional cubes. This architecture has served us well for the last twenty years. Based on the available technology plus the demands and requirements of the users, it was the right architecture. But on one hand, technology has evolved. For example, new database technology is available – in-memory analytics have been introduced, and from the world of the Internet, we received the mashup. And on the other hand, the demands and requirements of the users are changing. 

alt

Figure 1: The Classic Data Warehouse Architecture 

In that same article, we summarized the flaws of the CDWA with respect to those new demands and requirements. In this article, we introduce a new architecture, one that we call the data delivery platform (DDP). 

Figure 2 shows a high-level overview of the DDP. The main difference between the DDP and the CDWA is that the central data warehouse forms the heart of the CDWA, whereas in the DDP, it is the software layer residing between the data stores and the reports that can be seen as the heart of the system. 

alt

Figure 2: The High-Level Architecture of the Data Delivery Platform 

In the rest of this article, we will give a more detailed description of the DDP, and we will compare the CDWA with the DDP based on the flaws mentioned in Part 1. Note that in this article, we will refer to data stores as data providers and to reports, KPIs, scorecards and so on as data consumers

The essence of the DDP is to decouple the data consumers from the data providers. The data consumers will request the information they need, and the DDP will supply the information by retrieving it from the data providers. The data consumers have no idea whether the data they request comes from the central data warehouse, a data mart, a cube, a production database, an external source, or maybe a combination of all those. In fact, this is not important to them – the data providers are hidden for the consumers. They see one large database. For a data consumer, it is more important to know that the data supplied has the right quality level, is exactly what they requested, is sufficiently up to date, and is returned with the right performance than it is to know that the data is coming from a specific data store. 

Important to understand is that we do not propose to phase out the central data warehouse itself. There are and will always be good reasons for copying data entered in production systems to a central data warehouse. These are the classic reasons for introducing a data warehouse in the first place. For example, most production databases do not contain historical data that we need for trend analysis. Another reason is that if we run complex queries on the production databases, the data entry process is slowed severely. A third reason is that we (quite often) have to clean and filter the production data before it becomes usable for consumption. And there are more reasons why we would still want to have a data warehouse. So again, the DDP most likely needs a data warehouse. 

If we look at the flaws described in Part 1, how does the data delivery platform prevent them? 

The first flaw had to do with the number of data layers in the CDWA. One disadvantage of too many layers is that it makes development of operational BI applications extremely complex. All the copying between data stores slows the process. If the DDP is in place, if all data consumers extract data through that DDP, we can start to simplify the underlying storage structure. For example, we can drop a data mart and redirect all the data consumers accessing that data mart to the central data warehouse. Or, we could remove a cube, and redirect the queries to a data mart or data warehouse. In both cases, we are removing data layers. In other words, we are simplifying the architecture. Less data layers means less copying, and that means that we will be able to get the data more quickly from the source of entry to the data consumers. In fact, we could even consider (if the systems are powerful enough) letting some data consumers access the production databases through the DDP to get access to 100% up-to-date data. 

The second flaw relates to the enormous amount of duplicate data that we store. The DDP can minimize duplicate data storage in two ways. First of all, by decoupling consumers from producers, it becomes easier to replace one database server product with another. If the two are connected, the queries probably contain product specific features that make it hard to port those reports to another database server. If the DDP does what it should do, it should be able to handle different dialects. This should allow us to replace a more classic database server product, one that requires a lot of duplicate data storage to perform adequately, with, for example, a data warehouse appliance that needs a minimal amount of duplicate data storage. Secondly, because each data consumer accesses the DDP and not one specific data store, less need exists to create duplicate data, although performance issues could still demand duplication of data. But, hopefully, the DDP minimizes the need for storing duplicate data in whatever form. 

The third flaw relates to analytics and reporting on external and unstructured data sources. The DDP should be intelligent enough to access external sources, through, for example, SOAP-based services and mashup technology. The DDP should also be able to access document management systems, email systems, and other systems that contain unstructured data. The fact that those systems will not be accessible through SQL, MDX, or other common database languages should not be an issue to the data consumer. The DDP should be able to convert the language the data consumer uses into the language the unstructured data source supports. If this is possible, there exists no need to copy the huge amounts of data from unstructured sources to the data warehouse, or to extract data from those sources. 

Non-sharable specifications form the fourth flaw of the CDWA. Currently, each data consumer tool has its own set of data-related specifications and there is no way to share those specifications. The DDP should be "smart" enough to hold specifications related to data structures. And all the data consumers should be able to “use” those specifications. Whether a report is developed in Excel, Business Objects, Cognos, or Spotfire, they should all see and use the same specifications. For example, whether or not the Northern European Region includes the United Kingdom should be known to the DDP, and all the data consumers should be able to exploit that specification. Or, if an optional one-to-many relationship exists between two tables (even if those two tables are stored in separate data stores), that fact should be known to the DDP. It is also the DDP that should be aware that different users might have different definitions for the same concept. Specifications dealing with security, such as who is allowed to see what, should also be maintained by the DDP. Of course, in each data store, we also need to register security specifications, but they only store security specifications related to their own data elements. Maybe we want to indicate that some users are not allowed to integrate data elements coming from two different data stores. 

Again, the fact that the DDP holds all those specifications does not mean that the data consumer tools are not allowed to store any specifications. It might well be that they need their own specifications just to be able to function properly. What it means is that the DDP holds the source of all those specifications. We could state that the DDP gives access to and is the guardian of all the data and all the specifications related to data access. 

The last flaw discussed in Part 1 dealt with the concept of information hiding. This whole architecture is based on information hiding. The data consumers have been decoupled from the data providers by the DDP. In fact, the DDP is the information hider. Many changes to the data stores have no impact on the consumers. Of course, they influence specifications stored within the DDP, but those changes will not be reflected in the reports. The ideal situation is that if a specific change to the data providers is not relevant to a data consumer, that data consumer should not be affected at all. 

Aside from the fact that the data delivery platform does not have the flaws the CDWA has, or not with the same intensity, the DDP has some extra benefits. For example, decoupling of data consumers and data producers makes it easier to outsource the data stores and its technical management. Those advantages will be discussed later in this series. Also, in the coming parts of the series, we will discuss topics such as the importance of contracts; the differences between the DDP and enterprise integration, information, the virtual data warehouse, and ETL; the technologies we can use to develop a DDP; and how to introduce a DDP gradually.

SOURCE: The Flaws of the Classic Data Warehouse Architecture, Part 2

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!