Oops! The input is malformed! Structuring the Unstructured by Richard Brayshaw - BeyeNETWORK UK


Structuring the Unstructured

Originally published 4 April 2007

Insurance, like banking, is a vast financial marketplace that, in no small part, underpins the economy of the world. Another thing banking and insurance have in common is the fact that they were both borne out of the strife and uncertainty of living and trading on Planet Earth. Whereas banking was founded because thieves and bandits were particularly keen on taking your money from you (and not necessarily distributing it to the poor), insurance came about to mitigate the risk of transporting precious marine cargoes through treacherous seas in wooden sailing ships. The insurance business model proved to be applicable to many other perils both on land and at sea, where accidents certainly do happen – often with alarming frequency. Fast forward a few hundred years and the list of the perils one can insure against becomes vast, and pretty scary, too.

Reading through the contents of the Accident Types dimension in the data warehouse is a task for lovers of Stephen King novels only – radiation illness, brain injury, crushing, fire/explosion and my particular favourite, the chilling “inadequate instructions”. A quick dissection of the data using the trusty ProClarity OLAP tool determines that these particular perils are, fortunately, few and far between. The top “events” remain those old favourites – fire and flood. Curiously, “Unknown” comes into the running at number 5.

So insurance brings structure and process to an unstructured world. Risk is shared across the marketplace, putting firms on a more even footing with each other and building confidence. The need to properly structure the outside world is a theme that carries right into the heart of the data warehouse.

The data warehouse has now established itself firmly as the appropriate solution to the business intelligence requirement. The best practices for creating, maintaining and growing the data warehouse (and in so doing, provide the necessary structure) have matured nicely, having been battle-tested on many occasions in many different industries and on many different hardware and software platforms. Kimball and Inmon methodologies are now de rigueur. A well designed data warehouse lends itself to pretty much any day-to-day analytical request and is readily extensible to answer more esoteric requirements. A raft of bolt-on technologies is available for dashboarding, balanced scorecards, data mining, slicing and dicing and several other analytical techniques. It can be fashioned to extract data from most electronic data storage systems in the enterprise, and the methodologies for combining and storing that data in a cohesive single database are now well established. In short, if we want good, flexible business intelligence (BI), we need a data warehouse.

Ten years ago only the larger enterprises could afford a data warehouse. Since then, however, storage costs have more than kept pace with the growth in electronic data within the enterprise. Hardware costs have plummeted. We no longer need monumental mainframes nestling in expensive data centres, staffed round-the-clock. Also, the extract, transform and load process no longer has to be laboriously hand-crafted from COBOL, and the database management system is available on the CDs given free with computer magazines! An exaggeration maybe, but not too far off the mark. Whatever your opinions on Microsoft, their SQL Server 2005 offerings are cheap, easy to use and provide an increasingly seamless integration with the enterprise desktop. With the addition of a few client-side components, Excel with its pivot table capability can become a pretty handy OLAP tool. Excel 2007 even more so.

It is more important and of greater utility to have integration at the delivery-end of the warehouse – where users are untrained in complex IT techniques but are comfortable with their office toolset – than it is to continue the architecture of the legacy system into the data warehouse arena and deal with the users as an afterthought. Foisting an unfamiliar and complex toolset on a user base of underwriters and brokers simply will not work – not in the insurance arena at any rate. However, leveraging the existing office toolset will prove popular with everyone, not least with those whose pens write cheques for IT projects. Training costs are minimised and take-up rates increased.

Oracle purists can – and certainly do – balk at the relative simplicity of the Microsoft database offering, but you can satisfy 55% of your BI needs right out of the box (with a little T-SQL coding for good measure).

Which brings us, finally, to structuring the unstructured and to that “missing” 5% of functionality. The enterprise desktop is not just the destination of business intelligence – it is frequently the source of the underlying data. Spreadsheets and  CSV files are a rich supplier of plunder but are, by their very nature, subject to constant change. This makes them a poor bedfellow for the scheduled batch process. The regular ETL will fail badly if an extra column has suddenly appeared or the spreadsheet has been moved to a different folder. Currently, the Integration Services functionality in SQL Server 2005 is not clever enough to deal with these eventualities, but there are a number of third-party products which claim to fill this space.

The most interesting of these is a product called DataDefractor by Interactive Edge of New York. In the same way that ProClarity filled the OLAP-toolset gap in the SQL Server 2005 suite, and, in so doing, was promptly acquired by Microsoft, DataDefractor purports to provide the required extension into the unstructured data space. As we speak, Microsoft technicians are no doubt pouring over the tool and drafting memos to the marketing department…

My development team is currently evaluating this product. I’ll post the results in a future article. Watch this space.

There are, of course, many other problems associated with taking on spreadsheet and “desktop-sourced” data besides its lack of structure. Security and reliability are big issues. The data from a legacy system has been rigorously tested, signed-off and surrounded by layers of access security. A spreadsheet on a server must be likewise secured if we are to entrust its contents to our data warehouse and make business decisions based upon it. The business must be persuaded to tighten its procedures if it wishes to derive BI benefit from such data.

SOURCE: Structuring the Unstructured

  • Richard BrayshawRichard Brayshaw
    Richard is an analyst and team leader on a business intelligence project at a leading insurance company, operating in the Lloyd’s Insurance Market in the City of London. He has almost twenty years of experience in the IT financial sector and has worked for a series of blue-chip financial organisations both in the City of London and in West Yorkshire. Richard may be contacted at richard.brayshaw@virgin.net.

Recent articles by Richard Brayshaw



Want to post a comment? Login or become a member today!

Be the first to comment!