The Flaws of the Classic Data Warehouse Architecture, Part 3 The Data Delivery Platform versus the Rest of the World

Originally published 12 May 2009


In Part 2 of this series, we introduced the data delivery platform (DDP). In a nutshell, the DDP is an architecture for developing data warehouse environments where the emphasis lies on decoupling the data stores (also known as the data providers) from the business intelligence (BI) applications (also known as the data consumers). What makes the DDP special is that it doesn’t have the same drawbacks as the classic data warehouse architecture (CDWA) (see Part 1 of this series). This article discusses the relationship between the DDP and other technologies, such as ETL tools, EII tools, the virtual data warehouse and service-oriented architectures (SOAs).

Let’s begin with extract, transform and load (ETL), which is still one of the most used technologies of the four in the context of data warehousing. We all know the role of ETL (or ELT) in CDWAs, and we all know it’s quite important. First of all, an ETL helps to pump data from one data store to another. Secondly, the ETL tool will be responsible for most of the data integration. Integration here means making sure the data fits together, is correct and adheres to all the data definitions. This is done by tasks such as data aggregation, data cleansing, data transformation and data joining. Thirdly, an ETL tool can manage the whole process of pumping and integration. In fact, the ETL tool is responsible for what you could call data logistics. The last reason why ETL plays a crucial role in data warehouse environments is that these tools offer lineage and impact analysis capabilities. We have to be able to see relationships between data structures, ETL scripts, and reports. Through lineage and impact analysis, we get a better grip on the whole process.

The DDP allows us to minimize the number of data stores, and therefore minimize data redundancy, and the need to copy data. Does that mean there will be no role for ETL vendors anymore or, at most, a limited role? My answer would be no; see Figure 1. Of course, if we do develop a DDP the right way, less data needs to be pumped, and thus less data logistics. But pumping of data will still occur. For example, it could still be necessary to create a data mart to get the required performance, and that data mart has to be populated with data.

alt

Figure 1: The Need for ETL in a Data Delivery Platform

And don’t forget – and this is probably the most important reason why ETL is still needed – the data integration needs still exist. The amount and complexity of integration work is not determined by whatever warehouse architecture we come up with, but by the state of the source databases. And if we implement a DDP architecture, that state does not change.

The task of metadata management that is now handled by the ETL tools should be transferred to the DDP. The DDP should manage a repository accessible by any tool (including the ETL tool), in which all the metadata is stored. This approach will make those specifications more sharable. Note that this repository should be treated the same way by the DDP as it treats the data itself. This implies that metadata stores are decoupled from the metadata consumers.

In the coming years, I expect ETL vendors to upgrade their products in such a way that they can support the DDP. A few years ago, the ETL vendors started to upgrade their products from straightforward batch-oriented ETL tools to full-blown data integration tools. The DDP will be the next step of evolution for them. So, will the DDP and ETL coincide? Absolutely, as long as we have multiple data stores with duplicate data, we need ETL tools. But in the long run, I expect ETL to migrate to DDP, or that ETL becomes one of the core features of the DDP.

Another technology that plays a role in the worlds of data warehousing and data integration is enterprise information integration (EII). To summarize, EII tools are able to present multiple, heterogeneous data sources as one logical database. If data consumers access these EII tools, it will feel as if they are accessing one large database. Instead of EII, the terms federated databases and data virtualization are regularly used. EII tools on the market today include Composite Software's Information Server, IBM’s InfoSphere Federation Server and Oracle's BI Server.

EII tools have been around for quite some time, but they have never had the same level of popularity as the ETL tools. The main difference between the two is that EII performs data integration on demand; therefore, let’s call this form on-demand integration. Data does not have to be stored before it becomes available for the data consumers. Classic ETL tools perform a form of batch integration; data is integrated and stored before it becomes available for reporting. Note that many commercially available ETL tools are now also able to do on-demand integration.

The good thing about EII is that the data consumers do not access the data producers directly. In other words, on a very high conceptual level, EII has a lot in common with the DDP. It could well be that the DDP, in the coming years, is at heart an EII product. However, the current EII products do not have all the features we want. So let’s say the DDP will be based on an EII++ tool.

Another concept we have to discuss is the virtual data warehouse. Different people have different ideas of what exactly a virtual data warehouse is. In most cases, the virtual warehouse is seen as a software layer that gives users the feeling there is one big database, but in fact, data is retrieved from a set of production databases or data marts. So, no physical central data warehouse exists. If a data consumer needs data, he accesses something that integrates data stored in different data stores in real-time. In his article, The Elusive Virtual Data Warehouse, Bill Inmon presents the disadvantages of this approach. Most of the disadvantages he describes relate to all the real-time integration of data that has to take place and the amount of resources that will be required.

The DDP is not the same as a virtual data warehouse, although some people will think it is. In a virtual data warehouse, there really is not a physical database that contains all (or most of) the data. In a DDP, this is optional. You choose. It might be that in a DDP all the data is stored only once, and every form of integration is done on demand. Or, for technical, financial or performance reasons, data is copied and pre-integrated into various databases, and integration will be done in advance. The good thing about the DDP is that the architecture doesn't force you to choose between the virtual or non-virtual solution. Plus, the data consumer won’t see it because the location where the data comes from is hidden from them. In fact, it is irrelevant to the data consumers as long as the result conforms to their requirements with respect to quality, performance, structure and so on. The DDP even allows an organization to start with a virtual solution and slowly migrate to a non-virtual one, or vice versa.

Another technology we have to mention is the service-oriented architecture. The strength of the SOA is the concept of service itself. A service presents a clean and well-defined interface and hides the way it operates and the location of the data the produced the results – a notion that is very similar to that of the DDP. Therefore, the DDP should integrate quite well with the SOA. If the DDP makes data available, it should be accessible through a classic ODBC/JDBC/SQL-like interface but also through a more SOAP/XML-like interface.

In addition, SOA governance products, such as Amberpoint and SOA Software, allow us to bring in service level agreements, and they support technology that reacts to those SLAs. The DDP needs this or comparable technology. For each data consumer, a contract has to be set up. This contract contains specifications related to data quality and definitions, but also performance and availability aspects. SOA governance products do have that technology. Let's reuse that in the DDP.

To summarize, ETL will keep playing an important role in DDP-driven data warehouse architectures. A future DDP may well be based on existing EII technology, but the products have to be extended to be able to fully implement the DDP. DDP would make it possible to create a virtual data warehouse, but it is not a requirement. In addition, a virtual data warehouse is not a DDP. And finally, the SOA brings a lot of integration and governance technologies to the table; therefore, adoption of SOA within the DDP is mandatory.

In the next article of this series, we will continue explaining what the DDP is and what it could mean to an organization. The focus of that article will be on how to introduce a DDP. Again, stay tuned.

SOURCE: The Flaws of the Classic Data Warehouse Architecture, Part 3

  • Rick van der LansRick van der Lans

    Rick is an independent consultant, speaker and author, specializing in data warehousing, business intelligence, database technology and data virtualization. He is managing director and founder of R20/Consultancy. An internationally acclaimed speaker who has lectured worldwide for the last 25 years, he is the chairman of the successful annual European Enterprise Data and Business Intelligence Conference held annually in London. In the summer of 2012 he published his new book Data Virtualization for Business Intelligence Systems. He is also the author of one of the most successful books on SQL, the popular Introduction to SQL, which is available in English, Chinese, Dutch, Italian and German. He has written many white papers for various software vendors. Rick can be contacted by sending an email to rick@r20.nl.

    Editor's Note: Rick's blog and more articles can be accessed through his BeyeNETWORK Expert Channel.

Recent articles by Rick van der Lans

 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!