Blog: Barry Devlin http://www.b-eye-network.co.uk/blogs/devlin/ As one of the founders of data warehousing back in the mid-1980s, a question I increasingly ask myself over 25 years later is: Are our prior architectural and design decisions still relevant in the light of today's business needs and technological advances? I'll pose this and related questions in this blog as I see industry announcements and changes in way businesses make decisions. I'd love to hear your answers and, indeed, questions in the same vein. Copyright 2010 Fri, 27 Aug 2010 07:05:48 -0700 http://www.movabletype.org/?v=4.261 http://blogs.law.harvard.edu/tech/rss Hurry, hurry, hurry--from operational BI to CEP Information Overload? 3 ways real-time information is changing decision management" hosted by Eric Kavanaugh, got me into a philosophical mood.  Robin Bloor of the Bloor Group and David Olson of Progress Software provided a fascinating overview of the development of the complex event processing (CEP) market and its increasing importance for business competitiveness and, perhaps, even survival.

Robin's message was twofold.  From a business viewpoint, decision making is moving from reactive to predictive, driven by competition in business to best understand upcoming market opportunities right to the edge of what can be foreseen.  In technology, the exponential growth in speed of processors and, indeed, the majority of the hardware infrastructure is driving or enabling application architectures from batch orientation, through transaction processing and into real-time event handling.  David provided some interesting examples of how CEP is being used by his customers in manufacturing, logistics and airlines to monitor business events in real time, to react earlier to changing circumstances and to drive process improvement.  Their bottom line: CEP is a major architectural transition that is rapidly becoming mainstream; if you're not on board, you risk severe competitive disadvantage.

From one point of view, the message makes sense.  It's yet another twist of the screw towards speedier decision making.  Operational BI promotes the use of near real-time data either copied into the warehouse or accessed in transaction-processing systems via federated query.  CEP goes one step further and says let's access and analyze the data as it flows through the network; we need to make decisions before we land the data on a disk, if we even land it at all.  In the financial markets, with the data volumes and reaction speeds involved, once the technology became available the approach seemed like a no-brainer.  In financial systems, CEP enables high-value applications such as fraud detection in credit card transactions.

However, looking at some of the applications presented in the webinar, operational BI has also been used effectively to solve similar business issues.  The boundary between the more traditional operational BI approach and CEP depends on the required speed of decision making and the volumes of events involved.  CEP certainly extends the high-end of pattern recognition and trend detection to higher speeds and volumes.  And in the middle range, it provides another set of implementation options beside operational BI.

So, what were my philosophical musings?  Robin presented a very interesting scale of human decision-making timescales, from months and years at one extreme to one tenth of a second at the other.  That latter number is the fastest human reaction time and, by the way, slower than a cobra's strike!  CEP and, to some extent operational BI, operate in the range of decisions speeds faster than one tenth of a second: that is, entirely beneath human radar.  While it is clear that some decisions--collision avoidance on the highway, for example--naturally fall in this timescale, my concern is the implications that arise from pushing more and more decisions into this realm and, by definition, beyond human oversight.  We've already seen the consequences of this approach in the financial markets, where computer-based trading has driven wild, unpredictable and potentially dangerous swings in the markets.  Decision-making algorithms are only as good as the assumptions that have been encoded in them, which depend, in turn, on the knowledge available and the business requirements--both explicit and implicit--when they were created.  It is really sensible to design systems that unnecessarily exclude human wisdom?

The current business mindset that competitiveness is next to godliness is, in many cases, driving decision making into tighter and tighter circles, removing wisdom, insight, intuition and basic humanity from the loop.  Are we prepared to learn any lessons from the recent financial market fluctuations?  And, it is wise to arbitrarily remove more and more important decision making from human oversight just because the technology is available?  Just asking...  ]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/08/hurry_hurry_hurry--from_operat.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/08/hurry_hurry_hurry--from_operat.php Business Integrated Insight Fri, 27 Aug 2010 07:05:48 -0700
Doing More Data with Less Computers Pervasive Software presented at the Boulder BI Brain Trust (BBBT) last Friday, August 13.  What caught my attention was their DataRush product and technology, and particularly the technological driver behind it.  For a brief overview of the other aspects of Pervasive covered, check out Richard Hackathorn's blog on the day.

But, back to DataRush.  DataRush was originally conceived as a redesign of Pervasive's data integration tool, acquired from Data Junction in 2003.  However, it was soon recognized that the underlying function could be applied to other data-intensive tasks such as analytics.  Pervasive CTO, Mike Hoskins, described DataRush as a toolkit and engine that enables ordinary programmers to create parallel-processing applications simply and easily using data flow techniques to design them and without having to worry about the complexities of parallel-processing design, such as timing and synchronization between parallel tasks.

Now, of course, there's nothing new about parallel processing or the inherent difficulties it presents to programmers.  It's been at the heart of large-scale data warehousing, particularly through the use of MPP (massively parallel processing) systems, for a number of years.  Mike's point, however, was that parallel processing is about to go mainstream.  The technology shift enabling that has been underway for a few years now--the growing availability of multi-core processors and servers since the mid-2000s.  4-core processors are already common on desktop machines, while processors with 32 cores and more are already available for servers.  Multiply that by the number of sockets in a typical server, and you have massive parallelism in a single box--if you can use it.  The problem is that with existing applications designed for serial processing, the only benefit to be gained from such multi-core servers at present it in supporting multiple concurrent users or tasks or in what's known as "embarrassingly parallel" applications where there are no inter-task dependencies.  DataRush's claim to fame is that it moves data-intensive parallel processing from high-end, expensive and complex MPP clusters and specialist programmers to commodity, inexpensive and simple SMP multi-core servers and ordinary developers.

Of course, Pervasive is not alone in trying to tackle the issues involved in software development for parallel-processing environments.  But their approach, coming from the large-scale data integration environment, makes a lot of sense in BI.

However, to see the really significant implications, we need to see this development in the context of other technological advances.  There is the emergence of solid-state disks (SSDs) and the growing sizes and dropping costs of core memory that remove or reduce the traditional disk I/O bottleneck.  The decades-old supremacy of traditional relational databases is being challenged by a variety of different structures, some broadly relational and others distinctly not.  Add to this the explosive growth of data volumes, especially soft or "unstructured" information.  Pervasive, along with other small and medium-sized software vendors, is pushing information processing to an entirely new level.

]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/08/doing_more_data_with_less_comp.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/08/doing_more_data_with_less_comp.php Data integration Wed, 18 Aug 2010 07:58:33 -0700
Business Integrated Insight - the continuing story "From Business Intelligence to Enterprise IT Architecture" in March and part 5 has just appeared, part 6 is under construction and I can see at least another couple to come.  So, what's it all about?

Business Integrated Insight* (BI2 - BI to the power of 2) is an architectural effort I've committed to undertake in order to extend and update the long-serving data warehouse / business intelligence architecture. Why?  It's very clear to me that the business requirements and expectations for business intelligence have expanded dramatically over the past ten years.  And that expansion has been into areas more usually thought of as operational applications and collaborative systems.  Both of these areas have also encroached into BI.  So, requirement boundaries have blurred significantly, and an architecture that was defined in the 1980s based on the then current definition of decision support could clearly do with a substantial overhaul.

One approach could be to try to address the changed needs from within the boundaries of data warehousing, as Bill Inmon, for example, has done with DW 2.0™.  However, in my opinion, this effort has too narrow a focus.  Today's business users and decision-makers come from an entirely different generation to those that were the target audience of the original data warehouse efforts in the 1980s (Devlin & Murphy, 1988).  Modern business users are tech savvy, internet-aware, social networking animals.  Technology boundaries such as operational / informational / collaborative are simply not their scene.  My belief is that any new architecture for business intelligence must, by definition, extend over the entire IT infrastructure for business.

Is this a tall order?  Well, it's certainly ambitious.  But so was data warehousing way back in the mid-1980s when relational databases were still young.  But, the technology available today has advanced by leaps and bounds since then, especially in the last few years.  Take a look at columnar / parallel databases such as Netezza, Vertica, ParAccel, Aster Data, Infobright and more, not to mention Oracle's Exadata V2.  While the hype is about query speed and large data volumes, the potential is surely for a paradigm shift (there! I've said it) in data storage architectures.  The same drive is evident in the mushrooming interest in unstructured information, which I prefer to call soft information.  I wrote recently about Attivio, who are building a bridge between the worlds of unstructured and structured information.  From beyond the database world SOA-like and Web 2.0 technologies are also fundamentally changing the way the old operational and collaborative environments are structured.

These technologies and tools, and more, will enable the leap to BI2 over the coming few years.  And if you know of particularly innovative solutions that are coming to market, I'd love to hear about them!

*For the record, I coined the term Business Integrated Insight and drew the first architectural diagram in an August 2009 white paper sponsored by Teradata

]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/08/business_integrated_insight_-.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/08/business_integrated_insight_-.php Business Integrated Insight Thu, 05 Aug 2010 12:16:04 -0700
Step Away from the Data! Hands off that Information! Infobright the other day, and one statement that caught my attention was that they try to go to the actual data as little as possible.  An interesting objective for a product that's positioned as a "high performance database for analytic applications and data marts", don't you think?

It sounds somewhat counter-intuitive until you realize that in a world of exploding data volumes that need to be analyzed, you have only two choices if you want to maintain a reasonable response time for users: (1) throw lots of hardware at the problem--parallel processing, faster storage, and more--or (2) be a lot cleverer in what you access and when.  The first approach is pretty common and based on recent developments, quite successful.  And as we move into solid-state disks (SSD) and in-memory databases, we'll see even more gains.  But, let's play with the second option a bit.

How can we minimize access (disk I/O) to the actual data?  So, we can say immediately that the minimum number of times we have to touch the actual data is once!  In the case of a data warehouse or mart, that is when we load it.  In a traditional row-based RDBMS, that's when we build an indexes we need to speed access for particular queries or further processes.  With column-based databases, we often hear that indexes are no longer needed or much reduced--reducing database size, load time and ongoing maintenance costs.  And it's certainly true that columnar databases improve query response time.  And yet, we might ask (and it applies in the case of row-based databases as well) is there anything else we could do on that single and mandatory access to all the data that could help reduce later data access during analysis?

Infobright's solution is the Knowledge Grid, a set of metadata based on Rough Set theory generated at load-time and used to limit the range of actual data a query has to retrieve in order to figure out which values match the query conditions.  Each 64K items block of data (Data Pack) on disk has a set of metadata such as maximum and minimum values, sum, count, etc. for numerical items calculated for it at load-time.  At query run-time, these statistics inform the database engine that some data packs are irrelevant because no item meets the query conditions.  Other data packs contain only data that meets the query conditions, and if the statistics contain the result needed by the query, the data here need not be accessed either.  The remainder of the data packs contain some data that matches the query and will have to be accessed.  Given the right statistics, the amount of disk I/O can be significantly reduced.  Infobright also create metadata for character items at load-time and for joins at query-time.

Generalizing from the above, we can begin to imagine other possibilities.  What if you didn't load the actual data into the database, but just left it where it was and crawled through it to create metadata of a similar nature to allow irrelevant data for a particular query to be eliminated en masse?  Of course, that sounds a bit like the indexing approach used by search engines and extended by Attivio and others to cover relational databases as well.  Of course, the problem with indexes and similar metadata is that they tend to grow in volume also, until they reach a significant percentage of the actual data size; then we're back to square one.

My mathematical skills are far too rusty (if they were ever bright and shiny enough in the first place) to know if Rough Set theory has anything to say about that issue or how it could be applied beyond the way that Infobright have implemented it, but it does seem like a interesting area for exploration as data volume continue to explode.  Any bright PhDs out there like to give it a try?

]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/07/step_away_from_the_data_hands.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/07/step_away_from_the_data_hands.php Database Thu, 29 Jul 2010 14:03:45 -0700
Google buys Metaweb @LaurelEarhart is marketing director for the Smart Content Conference, among other things, including Biz Dev Maven! And there, I read "Perfect storm: #Google acquired #Metaweb" announced on July 16. Having just done a webinar with Attivio yesterday on the topic "Beyond the Data Warehouse: A Unified Information Store for Data and Content" my interest was piqued. Let me tell you why.

I suspect that very few data warehouse vendors or developers have paid much attention to Metaweb or its acquisition. As far as I can tell, it hasn't turned up on the data warehouse or BI analyst blogs either. Perhaps the reason is that Metaweb's business is in providing a semantic data storage infrastructure for the web, and Freebase, an "open, shared database of the world's knowledge". For data warehouse geeks, the former is probably a bit off-message, while the latter may sound like Wikipedia, although the mention of a shared database may raise the interest level slightly.

But, if you're thinking about what lies beyond data warehousing (as I am), and wondering how on earth we're ever going to truly integrate relevant content with the data in our warehouses, what Metaweb and now Google are doing should be of some interest. Here's a quote from Jack Menzel, director of product management at Google on his blog:

"Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we've acquired Metaweb because we believe working together we'll be able to provide better answers."

For me, the interesting point here is the inclusion in the hard questions of conditions that would make sense to even the most inexperienced BI user. Take either of these two hard questions and you can easily imagine the SQL statements required, provided you defined and populated the right columns in your tables. The problem is that you need to have predefined columns and the tables in advance of somebody asking the questions.

What Metaweb on the Internet and Attivio on the intranet (and, of course, other vendors in both areas) are trying to do is to bridge the gap between data and content, so that users can ask mixed search and BI queries based on the implicit understanding that exists in the data/content stores of the semantics of the information. And, perhaps more importantly, to be able to do that in a fully ad hoc manner that doesn't require prior definition of a data model and its instantiation in columns and tables of a relational database. If you want to dig deeper, I invite you to take a look at my recent white paper.

In the meantime, my thanks to @LaurelEarhart and the wonder of synchronicity.]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/07/google_buys_metaweb.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/07/google_buys_metaweb.php Data and content Thu, 22 Jul 2010 03:39:16 -0700
EMC acquires Greenplum - another view of what it means Curt Monash or Merv Adrian to explore these aspects of the acquisition.

However, I'd like to take the opportunity to focus our minds once again on a more fundamental question: how is IT going to manage data quality and reliability in a rapidly expanding data environment, both in terms of data volumes and places to store the data?  I'm currently describing a logical enterprise architecture, Business Integrated Insight (BI2), that focuses on this.

So, for me, what the acquisition emphasizes, like that of Sybase by SAP, is that specialized databases, with their sophisticated features and functions, are rapidly entering the mainstream of database usage.  Their ability to handle large data volumes with vast improvements in query performance has become increasingly valuable in a wide range of industries that want to analyze enormous quantities of very detailed data at relatively low cost.  How to do this?  Vendors of these systems typically have a simple answer: copy all the required data into our machine and away you go!

My concern is that IT ends up with yet another copy of the corporate data, and a very large copy at that, that must be kept current in meaning, structure and content on an ongoing basis.  Any slippage in maintaining one or more of these characteristics leads inevitably to data quality problems and eventually to erroneous decisions.  Such issues typically emerge unexpectedly, in time-constrained or high-risk situations and lead to expensive and highly visible firefighting actions by IT.  Unfortunately, such occurrences are common in BI environments, but typically relate to unmanaged spreadsheets or relatively small data marts.  We have just jumped the problem size up by a couple of orders of magnitude.

So, am I suggesting that you shouldn't be using these specialized databases?  Would I recommend that you stand in front of a speeding freight train?  Clearly not!

There are two ways that these problems will be addressed.  One falls upon customer IT departments, while the other comes back to the database industry and the vendors, whether acquiring or acquired.  These paths will need to be followed in parallel.

IT departments need to define and adopt stringent "data copy minimization" policies.  The purist in me would like to say "elimination" rather than "minimization".  However, that's clearly impossible.  Minimization of data copies, in the real world, requires IT to evaluate the risks of yet another copy of data, the possibility of using an existing set of data for the new requirement and, if a new copy of the data is absolutely needed, whether existing analytic solutions could be migrated to this new copy of data and the existing data copies eliminated.

Meanwhile, it is incumbent upon the database industry to take a step back and look at the broader picture of data management needs in the context of emerging technologies and the explosive growth in data volumes.  The basic question that needs to be asked is: how can the enormous power and speed of these emerging technologies be crafted into solutions that equally support divergent data use cases on a single copy of data?  And, if not on a single copy, how can multiple copies of data be managed to complete consistency invisibly within the database technology?

Tough questions, perhaps, but ones that the acquirers in this industry, with their deep pockets, need to invest in.  As the database market re-converges, the vendors that solve this architectural conundrum will become the market leaders in highly consistent, pervasive and minimally duplicated data that enables IT to focus on solving real business needs rather than managing data quality.  Wouldn't that be wonderful? ]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/07/emc_acquires_greenplum_-_anoth.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/07/emc_acquires_greenplum_-_anoth.php Database Wed, 07 Jul 2010 13:18:44 -0700
Combining Data and Content 2) architecture.

Last year, I was intrigued by the term "Unified Information Access" used by Attivio Inc. to describe their product that offers highly integrated access to both hard and soft data.  My impression from the marketing material was that they were describing exactly the sort of approach I envisaged to address the convergence of these to ends of the structuredness spectrum of information.  Deeper discussions with Andrew McKay, SVP at Attivio confirmed my first impressions, and led to my writing a white paper describing a "Unified Information Store" published this week.

You can read the summary of the paper below, but the bottom line is that by building the right set of metadata (an enhanced inverted index) you can provide integrated, contextually-rich, agile and easy-to-use access to hard and soft information residing in their normal technologies - relational databases and content management systems, respectively.  To my mind, this approach is the only viable way to come to avoid copying vast amounts of soft information into your warehouse and reap the benefits of combined data and content. 
]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/05/combining_data_and_content.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/05/combining_data_and_content.php Business Integrated Insight Wed, 26 May 2010 11:22:53 -0700
Elegant Architectures synopsis of a recent white paper of mine in their most recent issue.  Here's the first couple of sentences:

"The data warehouse architecture established in the 1980s has stood us in good stead, but the time has come to look at decision making--as it spans the business process spectrum--in a new light. Business today is so interlinked, its information needs so interdependent, its processes so inter­twined, and its reaction times so pressured that only a new architected approach can support it.

This new architecture must cover the entire IT support for the business. It must be fully integrated. And it must stretch beyond "simple" intelligence."

I recommend the article as very well worth a read (well, I would, wouldn't I?), but particularly stunning is the photo of the Queen Sofia Palace of the Arts in Valencia, Spain that the editor chose to illustrate the text.  Take a look!

And regarding the content, stand by for a deeper look at the Business Information Resource, due to appear in the Business Intelligence Journal, 2nd Quarter 2010 issue...



]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/05/elegant_architectures.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/05/elegant_architectures.php Business Integrated Insight Mon, 24 May 2010 09:52:33 -0700
A volcanic ash cloud over the future of BI
In the coverage of the unfolding chaos, the word that seemed to spring most frequently to the mouths of people responsible for managing any aspect of the situation was "unprecedented". A great word if you want to suggest that you shouldn't be blamed in any way for anything that ensued. After all, if it's unprecedented, you have no basis of information from the past to make decisions about what to do now. Or do you?

The truth of the matter is that there were probably enough precedents of most separate aspects of the event to allow reasonable judgments to be made. The problem was that no-one was able to consolidate enough of the disparate information to really make a difference. Focusing just on the issue of getting hordes of stranded passengers across Europe to every point of the compass: Which trains go where? How do they connect? How to connect from a train to a ferry, or a bus to a train? Minimize travel time or cost? Not to mention hotel rooms?

Could the airlines have minimized their regulatory compensation costs if they could work this out? For sure. Could surface travel companies maximize profits by fully utilizing spare capacity (as opposed to raising prices to exorbitant levels!)? Absolutely. Could groups of enterprising travelers get together to make the best plan to get home? Probably. So, there's lots of incentive to make it work. But none of this happened.

Don't get me wrong. I'm not complaining. I'm just pointing out much of the underlying information to answer the above questions not only exists, but is often accessible on the internet. Every stranded traveler with web access spent hours checking options, trying to make online bookings (usually at severely overloaded sites) and then starting all over again as one link in the chain broke. Some succeeded, while others went and queued for hours at ticket offices. 

Operational BI was probably used by some of the more advanced travel companies to track what was going on. Some even managed to schedule additional services to carry extra passengers. Others, such as the Calais-Dover ferries, just stopped taking bookings and went back to the "just turn up at the pier and we'll try to get you onboard as soon as possible" model.

But the really interesting question is this: given that all that information was out there on the web in all its various forms and gory details, how would one go about integrating it in a way that allowed it to be used in an end-to-end travel discovery and booking process?

I'm not expecting the IT industry to have a complete solution any time soon, for a wide variety of political and financial reasons. But a little thought, and none of it very new, suggests we'd need: a common model spanning the information of multiple companies, the ability to link hard and soft information together in a meaningful way, services that act in a fully plug-and-play manner with well-defined interfaces and the ability to mashup a dashboard joining the different steps of the journey together. What I really needed was Business Integrated Insight to get me home!   ]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/04/a_volcanic_ash_cloud_over_the.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/04/a_volcanic_ash_cloud_over_the.php Business Integrated Insight Sun, 25 Apr 2010 09:11:47 -0700
The recent interesting reinvention of databases two-day class in Rome next week, and after I got over my need for a strong drink (a celebration, of course), I got to reflect on some of what I had discovered.

Perhaps the most interesting was the amazing changes in the database area that have been happening over the past couple of years.  A combination of hardware advances and software innovations have come together with a recognition that data is no longer what it once was to pose some fundamental questions about how databases should be constructed.

Let's start on the business side - always a good place to start.  Users now think that their internal IT systems should behave like a combination of Google, Facebook and Twitter.  Want an answer to the CEO's question on plummeting sales?  Just do a "search", maybe "call a friend", join it all together and voila!  We have the answer. 

From an information viewpoint, this brings up some very challenging questions about the intersection of soft (aka unstructured) information and hard (structured) data and how one ensures consistency and quality in that set.  IT's problem is no longer just combining hard data from different sources; it's about parsing and qualifying soft information as well.  This is not a truly new problem.  Data modelers have struggled with it for years.  It's the speed with which it needs to be done that causes the problem.

So, what has this got to do with new software and hardware for databases?  Well, the key point is that database thinking has suddenly moved on from strict adherence to the relational paradigm.  The relational model is an extraordinarily structured view of data.  Relational algebra is a very precise tool for querying data.  You need to have a strong understanding of both to make valid queries, but do you really want your users to think that way?  Should you necessarily store the information physically in that model?  When you free yourself of these assumptions, you can begin to think in new ways.  Store the data in columns instead of rows?  Perfect!  A mix of row- and column-oriented data, and maybe some in memory only?  Yes, can do!  And then there's mixing searching (a soft information concept) with querying (a hard data thought) to create a hybrid result.  That's easy too!

And on the edges of the field, there are even more fundamental questions being asked.  Do we need always need consistency in our databases?  Can we do databases without going to disk for the data?  Could we do away with physically modeling the data and just let the computer look after it?  The answers to these questions and more like them are not what you might expect if you've been around the database world for 20 years.  And with those different answers, the overall architecture of your IT systems is suddenly open to dramatic change.

Believe me, the first businesses to adopt some of these approaches are going to gain some extraordinary competitive advantages.  Watch this space!
]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/04/the_recent_interesting_reinven.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/04/the_recent_interesting_reinven.php Database Thu, 08 Apr 2010 09:58:25 -0700
From BI to Enterprise IT Architecture
And yet, the question of how to create such a consistent, integrated information resource almost begs a simplistic answer.  If that's what you want, you must stop creating duplicates of existing information that have to be managed to consistency in ever shorter time windows, and you must eliminate--or, at the very least, substantially reduce--existing data duplication.

The original data warehouse architecture from 1988 showed the way. It proposed a logically single data store--the Business Data Warehouse--modeled at the enterprise level as the consistent and integrated source of all information for decision making. This simplicity was ultimately lost with the emergence of the layered architecture (with multiple data marts fed from an enterprise data warehouse), due to a combination of database performance and enterprise modeling issues.

Nonetheless, the approach remains valid for the current much-expanded needs for integration. First, model all the information according an enterprise-level model and then implement as far as possible in alignment to that model. This is the approach proposed in a new architecture, Business Integrated Insight (BI2), which for the first time gathers all the information of the enterprise, hard and soft; operational, informational and collaborative into a single component called the Business Information Resource...

Read more in my article just published and if you're in the vicinity, come to my two-day seminar in Rome on 15-16 April :-)
]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/04/from_bi_to_enterprise_it_archi.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/04/from_bi_to_enterprise_it_archi.php EDW Thu, 01 Apr 2010 11:16:40 -0700
What's beyond Business Intelligence? "BI2--From Business Intelligence to Enterprise IT Integration" and am currently researching and preparing the material.  And the more I research, the more excited I get about the prospects for the next wave of development from BI to... what?  Well, that's the real question for me!

It's my belief, and I've been writing and speaking about this for quite a while now, that the way we do BI today has reached its limits.  Business today demands ever closer to real-time information that must be consistent and meaningfully integrated across ever wider scopes.  These demands simply cannot be satisfied by our current concept of a layered, triplicated (and more) data warehouse of hard information--largely numerical data arranged in neat tables--along with some soft information thrown in as an afterthought.  The only way forward that I can see is to begin to treat all business information as a conceptually single, integrated, modelled resource with minimal duplication of data.  I've described this business information resource (BIR) to a first approximation elsewhere and my seminar will, among other things, dig deeper into the structure of the BIR and the technology needed to create and maintain it.

My current excitement stems from the growing reality of "hybrid" databases--combining the features and strengths of row-oriented and columnar relational databases.  Now, I know that academia has proposed approaches to this as much as 8 years ago, but it's only in the last year that commercial databases are introducing it.  I wrote about Vertica's FlexStore feature, introduced in 2009, in my last post.  The latest announcement I found  is of a technology preview program for Ingres VectorWise, the newest entrant in the hybrid database arena.  Add Oracle's Exadata V2, announced last year with typical modesty by Larry Ellison as the "fastest machine in the world for data warehousing, but now by far the fastest machine in the world for online transaction processing", and we can see that the approach is finally gaining market traction.

Why is this important?  Well, despite the hype, Larry hit the nail on the head.  If we finally have databases that can handle both operational and informational workloads equally well, we can begin to define an architecture that doesn't insist on copying vast quantities of data from one database to another.  That doesn't mean the death of the data warehouse any time soon, but it does mean that a much more integrated IT environment is coming your way. ]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/03/whats_beyond_business_intellig.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/03/whats_beyond_business_intellig.php Hybrid Database Thu, 18 Mar 2010 10:12:57 -0700
EDW and Columnar Databases
But, I have been convinced for some time now of the much greater potential such performance unleashes in the broader and more complex EDW environment.  And the vendors have been fairly quiet about this part of the market so far, maybe preferring to leave such more technically and politically complex projects to the big guys.  So, it was good to see Vertica's 4.0 announcement last week beginning to address the EDW market with its emphasis on "enterprise ready" and a number of interesting new features and expansions of old functions.

Robust workload and resource management for mixed workloads is a prerequisite for an EDW.  Vertica's introduction of administrator-defined resource pools with memory-usage, priority and concurrency settings and the assignment of users to these pools is a big step in this direction.  A rework of the optimizer in support of this and other features suggests that Vertica are serious about this support.

Also introduced in V4.0 is a newly optimized single record lookup on primary keys.  While aimed at a particular financial analysis use case, this function shows that the database can do more than just crunch columns.  Added to the FlexStore feature introduced in V3.5 where newly loaded data is kept in row format in memory for some period of time, I believe we're seeing the database's growing ability to handle the sort of record-level processing often needed in EDWs.  The new time-series support in V4.0 also plays directly in EDW needs.

Time and customer experience will, of course, prove if I'm correct, but it seems to me that Vertica is beginning to test my assertion that columnar, MPP databases can be applied to EDWs.  And further that their performance characteristics offer the possibility of re-architecting the EDW / data mart divide. ]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/03/edw_and_columnar_databases.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/03/edw_and_columnar_databases.php Database Wed, 03 Mar 2010 11:17:21 -0700
Lyza Commons: how to integrate BI and Enterprise 2.0 white paper on Collaborative Analytics in the first half of 2009, it came as no surprise to me that version 2.0 of Lyza had a major emphasis in the same area.  What did surprise me, however, was how far they have advanced the concepts and implementation in such a short timeframe!

Successful collaboration between decision makers requires an environment that facilitates a free-flowing but well-managed conversation about ongoing analyses as they evolve from initial ideas to full-fledged solutions to business problems.  Consider a common scenario.  The first analyst gathers data she considers relevant and creates an initial set of assumptions, data manipulations and results.  She shares this via e-mail with her peers for confirmation, and she receives suggestions for improvement, some of which she incorporates in a new version.  Her manager reviews the work personally and makes further suggestions; a new version emerges.  She also shared the intermediate solution with a second department, and the analyst there created another solution based on the original.  Meanwhile, the first analyst finds an error in her logic buried deep in cell Sheet3!AB102...

We all know the problems with multiple unmanaged copies, rework, silently propagated errors and so on in the usual spreadsheet- and e-mail-based business analysis environment.  Lyza and Lyza Commons together address these issues by creating a comprehensive tracking and auditing mechanism for every step of an analysis and providing an integrated environment for sharing and discussing work among collaborators.  Integral metadata links all copies derived from an initial analysis.  Twitter-like conversations (called Blurbs) about an analysis are linked to the referenced object creating a comprehensive context for the conversation and the underlying analysis.  The folks at Lyzasoft have also come up with a security concept for sharing analyses they call Mesh Trust that should make sense in most enterprise collaboration environments.

My bottom line?  Lyza and Lyza Commons 2.0 provide a seamless blending of analytic function, managed and controlled access to information resources and enterprise-adapted social networking around analytic results and their provenance.  This is precisely the type of function needed by businesses who want to regain control of spreadmarts that have run amok.  This is the right conceptual foundation for real, meaningful business insight and innovation going forward. ]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/02/lyza_commons_how_to_integrate.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/02/lyza_commons_how_to_integrate.php Social Networking Thu, 25 Feb 2010 14:58:11 -0700
Data Warehouses and Solid State Disks (SSD)
Over the past couple of years, we've seen dramatic improvements in database performance due to hardware and software advances such as in-memory databases, columnar storage, massively parallel processing, compression, and so on as described in my white paper from April 2009.  SSD, in one sense, is just another piece of accelerating technology.  However, add it to the existing list, and you begin to see the possibility of revisiting old assumptions about what is possible within a single database.  Here are a few ideas to play with:

  • Do you still need that Data Mart?  With so much faster performance, maybe the queries you now run in the Mart could run directly on the EDW.  Reducing data duplication has enormous benefits, on storage volumes, but principally in reducing maintenance of ETL to the Marts.
  • Where to do operational BI?  It was once considered necessary to install a separate ODS to support closer to real-time access to consolidated atomic data.  But with such a fast database, couldn't you just trickle feed the data and do it all in the Warehouse itself.  One less copy of data and one less set of ETL can't be all bad!
  • ETL or ELT?  Extract, transform and load has been the traditional way of loading a Warehouse, with a special engine to do the transform step.  Well, with a faster and more powerful database engine, you have the option to try extract, load and transform and let the Warehouse database do the transform work.
Although ParAccel, like all the smaller vendors are focusing more on selling to the "bigger, faster, more complex analytics applications" market at present, I'm pretty sure that the work ParAccel is doing under the covers on query optimization, workload management, loading and updating features will pave the way for a sea change in how we do data warehousing in the next few years.

]]>
http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/02/data_warehouses_and_solid_stat.php http://www.b-eye-network.co.uk/blogs/devlin/archives/2010/02/data_warehouses_and_solid_stat.php Wed, 17 Feb 2010 14:34:45 -0700