Blog: Barry Devlin Subscribe to this blog's RSS feed!

Barry Devlin

Hello and welcome to my blog!

About the author >

Dr. Barry Devlin is among the foremost authorities in the world on business insight and data warehousing. He was responsible for the definition of IBM's data warehouse architecture in the mid '80s and authored the first paper on the topic in the IBM Systems Journal in 1988. He is a widely respected consultant and lecturer on this and related topics, and author of the comprehensive book Data Warehouse: From Architecture to Implementation published by Addison-Wesley in 1997.

Over the past few years, Barry has extended his interest to cover the wider field of a fully integrated business, covering informational, operational and collaborative environments and, in particular, how to present the end user with an holistic experience of the business through IT.

Barry has worked in the IT industry for more than 25 years, mainly as a Distinguished Engineer for IBM in Dublin, Ireland. He is now founder and principal of 9sight Consulting, specializing in the human, organizational and IT implications and design of deep business insight solutions.

I was speaking to Susan Davis and Bob Zurek of Infobright the other day, and one statement that caught my attention was that they try to go to the actual data as little as possible.  An interesting objective for a product that's positioned as a "high performance database for analytic applications and data marts", don't you think?

It sounds somewhat counter-intuitive until you realize that in a world of exploding data volumes that need to be analyzed, you have only two choices if you want to maintain a reasonable response time for users: (1) throw lots of hardware at the problem--parallel processing, faster storage, and more--or (2) be a lot cleverer in what you access and when.  The first approach is pretty common and based on recent developments, quite successful.  And as we move into solid-state disks (SSD) and in-memory databases, we'll see even more gains.  But, let's play with the second option a bit.

How can we minimize access (disk I/O) to the actual data?  So, we can say immediately that the minimum number of times we have to touch the actual data is once!  In the case of a data warehouse or mart, that is when we load it.  In a traditional row-based RDBMS, that's when we build an indexes we need to speed access for particular queries or further processes.  With column-based databases, we often hear that indexes are no longer needed or much reduced--reducing database size, load time and ongoing maintenance costs.  And it's certainly true that columnar databases improve query response time.  And yet, we might ask (and it applies in the case of row-based databases as well) is there anything else we could do on that single and mandatory access to all the data that could help reduce later data access during analysis?

Infobright's solution is the Knowledge Grid, a set of metadata based on Rough Set theory generated at load-time and used to limit the range of actual data a query has to retrieve in order to figure out which values match the query conditions.  Each 64K items block of data (Data Pack) on disk has a set of metadata such as maximum and minimum values, sum, count, etc. for numerical items calculated for it at load-time.  At query run-time, these statistics inform the database engine that some data packs are irrelevant because no item meets the query conditions.  Other data packs contain only data that meets the query conditions, and if the statistics contain the result needed by the query, the data here need not be accessed either.  The remainder of the data packs contain some data that matches the query and will have to be accessed.  Given the right statistics, the amount of disk I/O can be significantly reduced.  Infobright also create metadata for character items at load-time and for joins at query-time.

Generalizing from the above, we can begin to imagine other possibilities.  What if you didn't load the actual data into the database, but just left it where it was and crawled through it to create metadata of a similar nature to allow irrelevant data for a particular query to be eliminated en masse?  Of course, that sounds a bit like the indexing approach used by search engines and extended by Attivio and others to cover relational databases as well.  Of course, the problem with indexes and similar metadata is that they tend to grow in volume also, until they reach a significant percentage of the actual data size; then we're back to square one.

My mathematical skills are far too rusty (if they were ever bright and shiny enough in the first place) to know if Rough Set theory has anything to say about that issue or how it could be applied beyond the way that Infobright have implemented it, but it does seem like a interesting area for exploration as data volume continue to explode.  Any bright PhDs out there like to give it a try?


Posted July 29, 2010 2:03 PM
Permalink | No Comments |
Synchronicity is a wonderful thing! I get yet another follower notice from Twitter today, and for the first time in ages I am curious enough to check the profile. It turns out that @LaurelEarhart is marketing director for the Smart Content Conference, among other things, including Biz Dev Maven! And there, I read "Perfect storm: #Google acquired #Metaweb" announced on July 16. Having just done a webinar with Attivio yesterday on the topic "Beyond the Data Warehouse: A Unified Information Store for Data and Content" my interest was piqued. Let me tell you why.

I suspect that very few data warehouse vendors or developers have paid much attention to Metaweb or its acquisition. As far as I can tell, it hasn't turned up on the data warehouse or BI analyst blogs either. Perhaps the reason is that Metaweb's business is in providing a semantic data storage infrastructure for the web, and Freebase, an "open, shared database of the world's knowledge". For data warehouse geeks, the former is probably a bit off-message, while the latter may sound like Wikipedia, although the mention of a shared database may raise the interest level slightly.

But, if you're thinking about what lies beyond data warehousing (as I am), and wondering how on earth we're ever going to truly integration relevant content with the data in our warehouses, what Metaweb and now Google are doing should be of some interest. Here's a quote from Jack Menzel, director of product management at Google on his blog:

"Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we've acquired Metaweb because we believe working together we'll be able to provide better answers."

For me, the interesting point here is the inclusion in the hard questions of conditions that would make sense to even the most inexperienced BI user. Take either of these two hard questions and you can easily imagine the SQL statements required, provided you defined and populated the right columns in your tables. The problem is that you need to have predefined columns and the tables in advance of somebody asking the questions.

What Metaweb on the Internet and Attivio on the intranet (and, of course, other vendors in both areas) are trying to do is to bridge the gap between data and content, so that users can ask mixed search and BI queries based on the implicit understanding that exists in the data/content stores of the semantics of the information. And, perhaps more importantly, to be able to do that in a fully ad hoc manner that doesn't require prior definition of a data model and its instantiation in columns and tables of a relational database. If you want to dig deeper, I invite you to take a look at my recent white paper.

In the meantime, my thanks to @LaurelEarhart and the wonder of synchronicity.

Posted July 22, 2010 3:39 AM
Permalink | No Comments |
Any acquisition in the database market, in this case, the July 6 announcement of EMC's plan to acquire Greenplum, generates a flurry of analyst activity speculating about the financial or technical rationale for the acquisition, winners and losers among other database vendors and the effect of the move on customers' buying patterns.  Personally, I find these opinions very interesting and highly informative.  And I invite you to check out, for example, Curt Monash or Merv Adrian to explore these aspects of the acquisition.

However, I'd like to take the opportunity to focus our minds once again on a more fundamental question: how is IT going to manage data quality and reliability in a rapidly expanding data environment, both in terms of data volumes and places to store the data?  I'm currently describing a logical enterprise architecture, Business Integrated Insight (BI2), that focuses on this.

So, for me, what the acquisition emphasizes, like that of Sybase by SAP, is that specialized databases, with their sophisticated features and functions, are rapidly entering the mainstream of database usage.  Their ability to handle large data volumes with vast improvements in query performance has become increasingly valuable in a wide range of industries that want to analyze enormous quantities of very detailed data at relatively low cost.  How to do this?  Vendors of these systems typically have a simple answer: copy all the required data into our machine and away you go!

My concern is that IT ends up with yet another copy of the corporate data, and a very large copy at that, that must be kept current in meaning, structure and content on an ongoing basis.  Any slippage in maintaining one or more of these characteristics leads inevitably to data quality problems and eventually to erroneous decisions.  Such issues typically emerge unexpectedly, in time-constrained or high-risk situations and lead to expensive and highly visible firefighting actions by IT.  Unfortunately, such occurrences are common in BI environments, but typically relate to unmanaged spreadsheets or relatively small data marts.  We have just jumped the problem size up by a couple of orders of magnitude.

So, am I suggesting that you shouldn't be using these specialized databases?  Would I recommend that you stand in front of a speeding freight train?  Clearly not!

There are two ways that these problems will be addressed.  One falls upon customer IT departments, while the other comes back to the database industry and the vendors, whether acquiring or acquired.  These paths will need to be followed in parallel.

IT departments need to define and adopt stringent "data copy minimization" policies.  The purist in me would like to say "elimination" rather than "minimization".  However, that's clearly impossible.  Minimization of data copies, in the real world, requires IT to evaluate the risks of yet another copy of data, the possibility of using an existing set of data for the new requirement and, if a new copy of the data is absolutely needed, whether existing analytic solutions could be migrated to this new copy of data and the existing data copies eliminated.

Meanwhile, it is incumbent upon the database industry to take a step back and look at the broader picture of data management needs in the context of emerging technologies and the explosive growth in data volumes.  The basic question that needs to be asked is: how can the enormous power and speed of these emerging technologies be crafted into solutions that equally support divergent data use cases on a single copy of data?  And, if not on a single copy, how can multiple copies of data be managed to complete consistency invisibly within the database technology?

Tough questions, perhaps, but ones that the acquirers in this industry, with their deep pockets, need to invest in.  As the database market re-converges, the vendors that solve this architectural conundrum will become the market leaders in highly consistent, pervasive and minimally duplicated data that enables IT to focus on solving real business needs rather than managing data quality.  Wouldn't that be wonderful?

Posted July 7, 2010 1:18 PM
Permalink | No Comments |
Integrating soft information (aka unstructured data or content) into the data warehouse has long been a concern for implementers in many industries.  And it was one of the issues I wanted to address head-on in my evolving Business Integrated Insight (BI2) architecture.

Last year, I was intrigued by the term "Unified Information Access" used by Attivio Inc. to describe their product that offers highly integrated access to both hard and soft data.  My impression from the marketing material was that they were describing exactly the sort of approach I envisaged to address the convergence of these to ends of the structuredness spectrum of information.  Deeper discussions with Andrew McKay, SVP at Attivio confirmed my first impressions, and led to my writing a white paper describing a "Unified Information Store" published this week.

You can read the summary of the paper below, but the bottom line is that by building the right set of metadata (an enhanced inverted index) you can provide integrated, contextually-rich, agile and easy-to-use access to hard and soft information residing in their normal technologies - relational databases and content management systems, respectively.  To my mind, this approach is the only viable way to come to avoid copying vast amounts of soft information into your warehouse and reap the benefits of combined data and content. 

Posted May 26, 2010 11:22 AM
Permalink | No Comments |
Teradata Magazine has just published a synopsis of a recent white paper of mine in their most recent issue.  Here's the first couple of sentences:

"The data warehouse architecture established in the 1980s has stood us in good stead, but the time has come to look at decision making--as it spans the business process spectrum--in a new light. Business today is so interlinked, its information needs so interdependent, its processes so inter­twined, and its reaction times so pressured that only a new architected approach can support it.

This new architecture must cover the entire IT support for the business. It must be fully integrated. And it must stretch beyond "simple" intelligence."

I recommend the article as very well worth a read (well, I would, wouldn't I?), but particularly stunning is the photo of the Queen Sofia Palace of the Arts in Valencia, Spain that the editor chose to illustrate the text.  Take a look!

And regarding the content, stand by for a deeper look at the Business Information Resource, due to appear in the Business Intelligence Journal, 2nd Quarter 2010 issue...




Posted May 24, 2010 9:52 AM
Permalink | 1 Comment |
I was teaching my seminar on Business Integrated Insight last Thursday, 15 April, in Rome when the ash cloud descended over Europe. I was supposed to fly to Dublin on Saturday, but by Monday morning I had decided to set out by trains, car and ferries to get home. And I did - on Thursday evening, 22 April. Four days travel Rome to Dublin would probably have looked good on the Victorian railways and steamers! In any case, it turned out to be a very nice trip with some built-in thinking time...

In the coverage of the unfolding chaos, the word that seemed to spring most frequently to the mouths of people responsible for managing any aspect of the situation was "unprecedented". A great word if you want to suggest that you shouldn't be blamed in any way for anything that ensued. After all, if it's unprecedented, you have no basis of information from the past to make decisions about what to do now. Or do you?

The truth of the matter is that there were probably enough precedents of most separate aspects of the event to allow reasonable judgments to be made. The problem was that no-one was able to consolidate enough of the disparate information to really make a difference. Focusing just on the issue of getting hordes of stranded passengers across Europe to every point of the compass: Which trains go where? How do they connect? How to connect from a train to a ferry, or a bus to a train? Minimize travel time or cost? Not to mention hotel rooms?

Could the airlines have minimized their regulatory compensation costs if they could work this out? For sure. Could surface travel companies maximize profits by fully utilizing spare capacity (as opposed to raising prices to exorbitant levels!)? Absolutely. Could groups of enterprising travelers get together to make the best plan to get home? Probably. So, there's lots of incentive to make it work. But none of this happened.

Don't get me wrong. I'm not complaining. I'm just pointing out much of the underlying information to answer the above questions not only exists, but is often accessible on the internet. Every stranded traveler with web access spent hours checking options, trying to make online bookings (usually at severely overloaded sites) and then starting all over again as one link in the chain broke. Some succeeded, while others went and queued for hours at ticket offices. 

Operational BI was probably used by some of the more advanced travel companies to track what was going on. Some even managed to schedule additional services to carry extra passengers. Others, such as the Calais-Dover ferries, just stopped taking bookings and went back to the "just turn up at the pier and we'll try to get you onboard as soon as possible" model.

But the really interesting question is this: given that all that information was out there on the web in all its various forms and gory details, how would one go about integrating it in a way that allowed it to be used in an end-to-end travel discovery and booking process?

I'm not expecting the IT industry to have a complete solution any time soon, for a wide variety of political and financial reasons. But a little thought, and none of it very new, suggests we'd need: a common model spanning the information of multiple companies, the ability to link hard and soft information together in a meaningful way, services that act in a fully plug-and-play manner with well-defined interfaces and the ability to mashup a dashboard joining the different steps of the journey together. What I really needed was Business Integrated Insight to get me home!  

Posted April 25, 2010 9:11 AM
Permalink | No Comments |
Preparing materials for a seminar really forces you to think!  I just finished the slides for my two-day class in Rome next week, and after I got over my need for a strong drink (a celebration, of course), I got to reflect on some of what I had discovered.

Perhaps the most interesting was the amazing changes in the database area that have been happening over the past couple of years.  A combination of hardware advances and software innovations have come together with a recognition that data is no longer what it once was to pose some fundamental questions about how databases should be constructed.

Let's start on the business side - always a good place to start.  Users now think that their internal IT systems should behave like a combination of Google, Facebook and Twitter.  Want an answer to the CEO's question on plummeting sales?  Just do a "search", maybe "call a friend", join it all together and voila!  We have the answer. 

From an information viewpoint, this brings up some very challenging questions about the intersection of soft (aka unstructured) information and hard (structured) data and how one ensures consistency and quality in that set.  IT's problem is no longer just combining hard data from different sources; it's about parsing and qualifying soft information as well.  This is not a truly new problem.  Data modelers have struggled with it for years.  It's the speed with which it needs to be done that causes the problem.

So, what has this got to do with new software and hardware for databases?  Well, the key point is that database thinking has suddenly moved on from strict adherence to the relational paradigm.  The relational model is an extraordinarily structured view of data.  Relational algebra is a very precise tool for querying data.  You need to have a strong understanding of both to make valid queries, but do you really want your users to think that way?  Should you necessarily store the information physically in that model?  When you free yourself of these assumptions, you can begin to think in new ways.  Store the data in columns instead of rows?  Perfect!  A mix of row- and column-oriented data, and maybe some in memory only?  Yes, can do!  And then there's mixing searching (a soft information concept) with querying (a hard data thought) to create a hybrid result.  That's easy too!

And on the edges of the field, there are even more fundamental questions being asked.  Do we need always need consistency in our databases?  Can we do databases without going to disk for the data?  Could we do away with physically modeling the data and just let the computer look after it?  The answers to these questions and more like them are not what you might expect if you've been around the database world for 20 years.  And with those different answers, the overall architecture of your IT systems is suddenly open to dramatic change.

Believe me, the first businesses to adopt some of these approaches are going to gain some extraordinary competitive advantages.  Watch this space!

Posted April 8, 2010 9:58 AM
Permalink | No Comments |
Business users of information have increasingly high expectations these days.  Not only do they want relevant information, irrespective of source and consistent across multiple sources, but they also want it up to the minute.  Such demands require a new approach to Enterprise IT Architecture, and nothing less!

And yet, the question of how to create such a consistent, integrated information resource almost begs a simplistic answer.  If that's what you want, you must stop creating duplicates of existing information that have to be managed to consistency in ever shorter time windows, and you must eliminate--or, at the very least, substantially reduce--existing data duplication.

The original data warehouse architecture from 1988 showed the way. It proposed a logically single data store--the Business Data Warehouse--modeled at the enterprise level as the consistent and integrated source of all information for decision making. This simplicity was ultimately lost with the emergence of the layered architecture (with multiple data marts fed from an enterprise data warehouse), due to a combination of database performance and enterprise modeling issues.

Nonetheless, the approach remains valid for the current much-expanded needs for integration. First, model all the information according an enterprise-level model and then implement as far as possible in alignment to that model. This is the approach proposed in a new architecture, Business Integrated Insight (BI2), which for the first time gathers all the information of the enterprise, hard and soft; operational, informational and collaborative into a single component called the Business Information Resource...

Read more in my article just published and if you're in the vicinity, come to my two-day seminar in Rome on 15-16 April :-)

Posted April 1, 2010 11:16 AM
Permalink | No Comments |
I'm presenting a two-day seminar for Technology Transfer in Rome in mid-April, entitled "BI2--From Business Intelligence to Enterprise IT Integration" and am currently researching and preparing the material.  And the more I research, the more excited I get about the prospects for the next wave of development from BI to... what?  Well, that's the real question for me!

It's my belief, and I've been writing and speaking about this for quite a while now, that the way we do BI today has reached its limits.  Business today demands ever closer to real-time information that must be consistent and meaningfully integrated across ever wider scopes.  These demands simply cannot be satisfied by our current concept of a layered, triplicated (and more) data warehouse of hard information--largely numerical data arranged in neat tables--along with some soft information thrown in as an afterthought.  The only way forward that I can see is to begin to treat all business information as a conceptually single, integrated, modelled resource with minimal duplication of data.  I've described this business information resource (BIR) to a first approximation elsewhere and my seminar will, among other things, dig deeper into the structure of the BIR and the technology needed to create and maintain it.

My current excitement stems from the growing reality of "hybrid" databases--combining the features and strengths of row-oriented and columnar relational databases.  Now, I know that academia has proposed approaches to this as much as 8 years ago, but it's only in the last year that commercial databases are introducing it.  I wrote about Vertica's FlexStore feature, introduced in 2009, in my last post.  The latest announcement I found  is of a technology preview program for Ingres VectorWise, the newest entrant in the hybrid database arena.  Add Oracle's Exadata V2, announced last year with typical modesty by Larry Ellison as the "fastest machine in the world for data warehousing, but now by far the fastest machine in the world for online transaction processing", and we can see that the approach is finally gaining market traction.

Why is this important?  Well, despite the hype, Larry hit the nail on the head.  If we finally have databases that can handle both operational and informational workloads equally well, we can begin to define an architecture that doesn't insist on copying vast quantities of data from one database to another.  That doesn't mean the death of the data warehouse any time soon, but it does mean that a much more integrated IT environment is coming your way.

Posted March 18, 2010 10:12 AM
Permalink | 2 Comments |
Columnar databases, especially those with an MPP approach, have been notching up impressive query performance figures, showing gains of 100X and more on the traditional players.  Such figures make great press releases, but they do place the emphasis on using these databases as data marts rather than the enterprise data warehouse (EDW) component of the data warehouse architecture.  Focusing on data marts and very specific business intelligence applications makes a lot of sense for new market entrants and smaller players in the DW space, allowing quick wins and easily understood sales messages.

But, I have been convinced for some time now of the much greater potential such performance unleashes in the broader and more complex EDW environment.  And the vendors have been fairly quiet about this part of the market so far, maybe preferring to leave such more technically and politically complex projects to the big guys.  So, it was good to see Vertica's 4.0 announcement last week beginning to address the EDW market with its emphasis on "enterprise ready" and a number of interesting new features and expansions of old functions.

Robust workload and resource management for mixed workloads is a prerequisite for an EDW.  Vertica's introduction of administrator-defined resource pools with memory-usage, priority and concurrency settings and the assignment of users to these pools is a big step in this direction.  A rework of the optimizer in support of this and other features suggests that Vertica are serious about this support.

Also introduced in V4.0 is a newly optimized single record lookup on primary keys.  While aimed at a particular financial analysis use case, this function shows that the database can do more than just crunch columns.  Added to the FlexStore feature introduced in V3.5 where newly loaded data is kept in row format in memory for some period of time, I believe we're seeing the database's growing ability to handle the sort of record-level processing often needed in EDWs.  The new time-series support in V4.0 also plays directly in EDW needs.

Time and customer experience will, of course, prove if I'm correct, but it seems to me that Vertica is beginning to test my assertion that columnar, MPP databases can be applied to EDWs.  And further that their performance characteristics offer the possibility of re-architecting the EDW / data mart divide.

Posted March 3, 2010 11:17 AM
Permalink | No Comments |