The data delivery platform (DDP) is a modern architecture for developing business intelligence systems where data consumers, such as reporting and analytical tools, are decoupled from data stores. The DDP offers many practical advantages, including:
- Increased flexibility of the architecture
- Shareable transformation and reporting specifications
- Easy migration to other data store technologies
- Cost reduction due to simplification of the architecture
- Easy adoption of new technology
- Transparent archiving of data
The DDP can coexist with other more well-known architectures, such as the data warehouse bus architecture and the corporate information factory.
A DDP can be developed with different types of tools, such as a federation server or an enterprise service bus. Today, however, it’s most practical to use a federation server, such as Composite Information Server, IBM Infosphere Federation Server, and the Denodo Platform. This article presents an overview of the features of Composite Information Server. The material is derived from a technical white paper
that I wrote entitled "Developing a Data Delivery Platform with Composite Information Server."
Composite Information Server is an open federation server capable of presenting a heterogeneous set of data stores as one logical database. This unified view can be used by almost any reporting and analytical tool. In addition, it can be accessed by applications through service-oriented interfaces. All those tools and applications will share the transformation specifications managed within the server.
In classic business intelligence architectures, data transformations are executed by ETL
jobs (extract, transform, load). In most cases, all the data transformation is not done in one step, but in multiple steps where intermediate results are stored in various data stores, such as staging areas, operational data stores, and data marts. Additionally, the ETL jobs are scheduled to run periodically, for example, once a week or every midnight. As a result, all the data transformation takes place before users run the reports. In other words, the transformation of data is done in advance and periodically. In this article, we will refer to this form of transformation as periodic transformation
Unlike ETL tools, the Composite Information Server, like most federation servers, delivers on-demand transformation
. With on-demand transformation, when a reporting tool requests data, only then is data retrieved from the data stores and only then is the data transformed. The advantages are that users can work with more timely data, there is less need for creating and managing derived data stores, and report and transformation changes can be applied more quickly.
Views and data services are the core building blocks of Composite Information Server. Views are used by data consumers such as analytical and reporting tools to access data using relational methods. Data services are used by consumers such as applications and websites to access data using service-oriented methods. Beyond this consuming method distinction, Composite Information Server views and data services perform common functions. As a result and for the purpose of this article, we will focus primarily on views.
Views are used as transformation steps and are defined using SQL. Each view can hold a number of transformation steps, such as join, selection, projection, and aggregation. Views can be stacked on top of each other (see Figure 1). For different users (groups), different sets of views can be defined. Shareable transformation specifications can be placed in views that must be used by all users. Views can also be used to transform data stored in XML documents, sequential files, MDX cubes, SOAP-based services, and Java components, to relational tables. This makes it possible to seamlessly integrate non-relational data with relational data.
Joining MDX Cubes with Relational Data and XML Documents
Regardless of how efficient a federation server is, it’s an extra layer of software that sits between the reporting tools and the data stores, so it will consume CPU cycles and it will increase the response time of queries. Although the performance of a query is determined by the amount of time used by the federation server plus the time used by the underlying database server(s), the latter will consume most of the processing time and the former only a small fraction. Still, it’s important that a federation server optimizes and improves the performance of queries as much as possible. Therefore, The Composite Information Server offers several mechanisms to optimize query performance, including advanced distributed join optimization, instant and scheduled caching, and push-down of query processing to the underlying database servers. In addition, developers can see how a query is being processed.
Another type of optimization technique is the use of a cache
. For each view, a cache can be defined. If caches are defined, we switch from on-demand transformation to periodic transformation. There can be various reasons for defining a cache:
- Load optimization: A cache might be useful to minimize the load on the underlying system. It could be that a view is defined on tables in an old system that already has issues with performance. Additional queries might be too much for this system. By defining a cache, fewer queries will be executed on the old system.
- Consistent reporting: A cache could also be useful if a user wants to see the same report results if he runs a report several times for a specific period of time (a day, week, or month). This is typically true for users of reports. It can be quite confusing if the same report returns different results. In this case, a cache might be necessary if the contents of the underlying database are constantly being updated.
- Source availability: If the underlying system is not always available, a periodically refreshed cache might enable 7x24 operation.
- Complex transformations: The transformations to be applied to the data might be so complex that doing them on-demand might be too slow. Storing the transformed result in the cache and reusing the result several times might be more efficient.
The side effect of caching is that the data returned when querying the view may no longer be 100% up to date.
A large environment may end up with many views, and many relationships between the views and the data sources. It’s important that developers and administrators can easily see all those relationships. This makes it possible to determine what will happen if the structure of a foreign table or view changes—which other views have to be changed as well?
Composite Information Server stores all the definitions of data sources, views, procedures, and so on, in one central repository. This makes it easy to show all the dependencies between those objects. For example, Figure 2 shows a dependency diagram that presents all the views and data sources that directly and indirectly make up a view called PRINTER_ORDERS (on the left hand side of the diagram). This is sometimes called a lineage diagram
. Such a diagram allows us to do impact analysis. If the structure of a foreign table or view changes, we can see on which other views it might have an impact.
A Lineage Diagram that Shows the Interdependencies between Views and Data Sources
To summarize, Composite Information Server is a powerful and flexible open federation server that delivers on-demand data stored in various data stores. Using the extensive caching mechanism, it can also offer periodic transformation. The product hides where and how data is stored to reporting and analytical tools and applications. Internally, it has a very modular structure based on views and data sources that allows developers to set up their federation solution the way they think is right. Internally, the popular language SQL is used to specify nearly all the required aggregations, transformations, and calculations. Because SQL is used, the tool is easy to learn for most developers.
The modular approach of Composite Information Server and its extensive optimization technologies make it very well suited for developing a business intelligence architecture based on the data delivery platform.
SOURCE: Using Composite Information Server to Develop a Data Delivery Platform
Recent articles by Rick van der Lans