Blog: Rick Barton Subscribe to this blog's RSS feed!

Rick Barton

Hello and welcome to my blog. I am delighted to blog for the BeyeNETWORK, and I'm really looking forward to sharing some of my thoughts with you. My main focus will be data integration, although I don't plan to restrict myself just to this topic. Data integration is a very exciting space at the moment, and there is a lot to blog about. But there is also a big, wide IT and business world out there that I'd like to delve into when it seems relevant. I hope this blog will stimulate, occasionally educate and, who knows, possibly even entertain. In turn, I wish to expand my own knowledge and hope to achieve this through feedback from you, the community, so if you can spare the time please do get involved. Rick

About the author >

Rick is the director (and founder) at edexe. He has more than 20 years of IT delivery experience, ranging from complex technical development through to strategic DI consultancy for FTSE 100 companies. With more than 10 years of DI experience from hands-on development to architecture and consulting and everything in between, Rick’s particular passion is how to successfully adopt and leverage DI tools within an enterprise setting. Rick can be contacted directly at rick.barton@edexe.com.

July 2009 Archives

One of my favourite quotations is from Abraham Maslow, who said "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail".  Ever since I first heard someone use this quotation it has resonated with me, especially since I have seen examples of this behaviour first hand on many occasions and in many different problem contexts.  For this blog though I'm going to concentrate on one specific example that can help to improve DI delivery.

 DI tools are great.  They are faster and easy to use than many other technologies, hence their success in the marketplace.  But let's not forget what they are designed for.  The clue is in the name "Data Integration" where "data" is the key word.  They are great at dealing with data.  But the DI tool is just one tool in the kit bag.  Another, often overlooked tool, is the Operating system that the DI tool resides on. 

Operating systems are very good at handling files, directory structures and hardware.  It is their raison d'etre and it is therefore no surprise that O/S vendors have developed their own tools to effectively handle these resources.  Some of these may not be as pretty as the DI tools, but used properly they can be more effective than any other tools to hand when dealing with files and file names.

So for example, lets take a directory containing many files, all of which have been landed overnight with the wrong date (e.g. infile001_20090701, infile002_20090701 etc.).  The batch process has failed to update the target system and the pressure is on from the business to get the data, so here is my solution (written in Korn Shell Script)

for i in `ls infile*_20090701`; do j=`echo $i | cut -d"_" -f1 `_20090702; mv $i $j;done

One line of shell script!  It took me about 15 minutes to write this.  (Please note that I'm a little rusty now, but any one of Emunio's employees could do this in much less time.) 

In my experience this simple task would take orders of magnitude longer using a DI tool, because this small problem is not the nail to the DI tool hammer.

My point here is that every DI developer has an additional tool in their kit bag.  Those technicians that take time out to understand their O/S and to practice with the supplied tools will be much more effective than those that don't. 

This also applies at the organisational level.  DI Managers should encourage a culture of learning and development such that the practitioners have the ability to determine when and when not to use their tool of choice.  This should also be reflected in sensible standards for DI tool use; they should recognise the power of the O/S supplied tools, encourage their use and in a small way increase the productivity of each member of the DI team.


Posted July 31, 2009 4:15 PM
Permalink | No Comments |

Following on from my last blog about what next for DI, I have been wondering what the effects of current technologies could have on the data model in the near future. 

There are a few technology areas in particular that I see eroding the need for a data model namely; 

Massively Parallel Processing (MPP) databases; these databases have redefined the speed that queries can be executed with net effect of this change is that the database is more "forgiving" of less than perfectly modelled data.

Data virtualisation; virtualisation enables queries that will access many input data sources without the need for the data to be instantiated, therefore removing the need for a formal model.  Instead tailored views are created for the particular end user requirement.  In addition there is a class of tools that while not quite virtualisation tools do enable rapid access of flat file data via SQL.  These are more specialised, however worthy of note.

Dynamic warehousing;   These products store the data independently of the reporting model, such that changes to the model do not require changes to the underlying  tables.  A good example of this is Kalido's Dynamic Information Warehouse (DIW) technology.    In addition Kalido also drive the product via a semantic business model, rather than a traditional data model.

Profiling;  Many data profilers can infer relationships between fields in the same or even different files, thereby enabling keys to be identified across the data.  Composite software have an interesting product (Discovery) that not only enables the keys to be identified but also then to "fix" those relationships such that they can then be used within queries.

One thought I have had, and one which is very interesting is a match up between profiling and MPP databases; an intuitive warehouse, if you will.  In this technology a new data source is simply added to the database as a new table.  This table is then profiled against the existing tables and the key relationships are then discovered.  The relationships are confirmed and then the table is then usable for querying.  The need for the MPP database is that this "unstructured model" would be more inefficient than a traditional "designed" model, so the MPP engine's horsepower would be needed in order to provide result sets in a timely fashion.

So as you can see, current technologies are allowing a relaxation of the rules around data modelling and are challenging the accepted warehouse solution stack by allowing rapid query design without the use of a model. 

I don't think the data model is going anywhere just yet, but I do think it's place at the very heart of the warehouse will be challenged in the coming years. 


Posted July 27, 2009 8:22 PM
Permalink | No Comments |

I've been giving some thought to the next generation of DI tools and the features that I believe would set them apart from the first generation tools.

On a technical level the tools will need to add additional levels of abstraction from the executable code.  The first-gen tools reduced complexity of hand written code by using graphical boxes and lines.  This removed the need for developers to code their own "structures", allowing more time for design and compressing the time and number of people required to create code.  The next-gen tools must achieve a similar step change.  In my view this will mean delivering tools to the business are simple to use and deliver executable code directly from the business rules with minimal technical intervention.  One cannot expect that every scenario can be catered for, however if the majority of transformations can be instantiated as executable code by the business analysts, then this will be a significant step forward for DI.

The tools must also provide test functionality to support the business users creating rules.  Testing the rules at inception is vastly more efficient and cost effective than at later test stages.  If pre-defined data quality rules are also linked in at this stage then the business user can confirm that rules also meet agreed quality definitions prior to any code being generated.

Speaking of quality, functions such as data validation, data profiling, data quality checking and data cleansing should become an integrated part of the day to day execution.  Data sources can then be quality checked and profiled in flight.  Additional configuration would determine if processes should fail when quality thresholds are exceeded.  It is well documented that quality improvements deliver value in terms of cost and reduced errors.  Taking a "day by day" approach will mean that quality improvements need not be considered as a massive undertaking.

Also coupled to business usability is the concept of "macro-components".  By this I mean pre-built "wire frames" that enable simple configuration and therefore automation of common tasks, such as creation of dimensions, working with headers and trailers, validating dated input files etc.    Driving processes through metadata is becoming more and more common and DI tools should make this as easy as possible to achieve.

Next on the list is that the DI tools will need to comply with industry standards for metadata.  Not just in terms of data formats, but also in terms of business and transformation rules.  As data legislation increases, so must the auditability of data and metadata to ensure compliance needs are met.  As more organisations seek to govern and manage their metadata at an enterprise level, tools that "play nicely together" by adhering to a common standards will be preferred over those that don't.  I hasten to add though that this should apply to all products that transform data, not just DI tools, after all databases, data mining tools and reporting tools also transform data.

In conclusion I would like to see the next-gen DI tools adding more value but with much less effort.  The tool vendors need to combine business usability with enterprise functionality; simplicity with visibility.  DI is central to the movement and transformation of data within an organisation, so DI tools are well positioned to provide much higher levels of centralised functionality in terms of enterprise metadata, re-usable functionality, data quality statistics and standardised formats.


Posted July 20, 2009 4:05 PM
Permalink | 3 Comments |

I must confess to being a relative newbie when it comes to Agile development and I put this down to the fact that the majority of our customers and potential customers insist on classic waterfall delivery cycles.

After a recent meeting with an Agile evangelist it seemed to me that the methodology has a sound foundation based upon common sense.  Many of the key messages resonated with my personal experiences, albeit in a waterfall environment.

To set the scene, Agile is an iterative delivery methodology based upon a few key tenets:

  • Mixed teams including business representatives, testers, analysts and developers to enable rapid decision making and delivery
  • Short delivery iterations to enable functionality to be delivered at regular intervals and "learning" to be fed back in (closing the loop).
  • The use of reusable (or even automated) test routines to enable shorter test cycles and therefore shorter release cycles.

In particular I was struck by three very thought provoking messages and concepts;

  • Maximum Marketable Feature (MMF)
  • The right to change your mind
  • The right to stop when you want to

Identifying MMFs is part of the planning and prioritisation process that ensures that the most important and high value functionality is identified.  It is a means of separating "need" from "want",  "must haves" from "nice to haves".  In this way the delivery team will then be focusing on features that can be usable by the business in a relatively short timescale.

Agile enables fluidity in the decision making process because it does not expect all variables to be known at the outset.  It is acknowledged that it is very difficult to define inherently complex systems up front as would be expected within waterfall.  Being able to change ones mind mid-iteration cuts out the waste associated with change management; raising changes, haggling over price, approval processing, respecification etc.

The right to stop when you want to is also very interesting.  Businesses priorities change during projects as does user perceptions.  If core functionality is in place and usable, doesn't it make sense that the user can say "actually this is good enough" at any point in time?  After all features that may have seemed necessary month ago may now be an expensive luxury that isn't really required.

As I mentioned at the beginning of this post much of Agile is common sense, which is why I like it.  I like the flexibility and the focus on automation where possible and the pragmatic acknowledgement that it is very difficult to set rules in stone at the outset, before hard experience kicks in.

To add some balance to the argument, I'm also sure that there are many new challenges to face that are peculiar to Agile that I don't yet understand, however I will now be taking time to find out more about this methodology.


Posted July 13, 2009 12:37 PM
Permalink | No Comments |

I've always like to think of Data Warehouse environments as fast moving and agile, driving value for the business through rapid delivery of important new information, but it seems that this is often not so.  In recent months I have been told of organisations where it takes three months to deliver new data feeds to the warehouse. 

Personally this feels very wrong and I'd really like to understand how prevalent this is.  I'm sure that there are occasions when three months is correct, perhaps for extremely complex feeds, but on a regular basis?

It also begs the question as to whether overlong "time to market" is one of the causes of organisations having too many uncontrolled data marts or huge analytic environments full of ETL  code.    After all how much competitive advantage can be gained from data that is so long overdue?  Is it any surprise therefore that alternatives to the warehouse are being sought out?

It is absolutely correct that the business shouldn't accept these levels of delay, but short term workarounds to the problem will not help anyone in the long term.  Over time the additional complexity will simply confuse the data and reporting landscape.  After all it's hard to get a single version of the truth, when everyone has their own personal favourite!

So what is the answer?  I hate to oversimplify, but the answer lies both with IT and the business.  I know it's an old chestnut, but it's still the truth.

Both parties need to commit to reducing the time to delivery, such that the warehouse can be agile and responsive.  IT should examine processes, methodologies, capability, design and SLAs, and business should support and participate in initiatives to ensure raised levels of understanding and governance relating to business rules, data and metadata. 

Technology can also help, and in some instances I believe that alternatives to the warehouse are the right option, but I'm keeping my powder dry on this one for this blog at least!  Another blog for another day I'm afraid, so stay tuned in the coming weeks and I'll elaborate in one of my future posts.


Posted July 5, 2009 7:58 PM
Permalink | No Comments |