Blog: Rick Barton http://www.b-eye-network.co.uk/blogs/barton/ Hello and welcome to my blog. I am delighted to blog for the BeyeNETWORK, and I'm really looking forward to sharing some of my thoughts with you. My main focus will be data integration, although I don't plan to restrict myself just to this topic. Data integration is a very exciting space at the moment, and there is a lot to blog about. But there is also a big, wide IT and business world out there that I'd like to delve into when it seems relevant. I hope this blog will stimulate, occasionally educate and, who knows, possibly even entertain. In turn, I wish to expand my own knowledge and hope to achieve this through feedback from you, the community, so if you can spare the time please do get involved. Rick Copyright 2011 Fri, 07 Jan 2011 14:51:05 +0000 http://www.movabletype.org/?v=4.261 http://blogs.law.harvard.edu/tech/rss Adventures with Talend Happy New Year to all

It's been quite a while since my last post, a some of which is due to me having to spend time getting to grips technically with my latest data integration interest; Talend, who are the market leaders in open source data integration.

When I first began to seriously look at Talend early last year, I expected that it would not take too long to master, given that I have a long history of using data integration tools.  The fact of the matter though is that it took me much longer than expected, not because of the usability of the product (it is in fact simple to use) or because it is different to what I was used to (to some degree) but because of its scale.  It has a huge number of connectors, components and orchestration methods, each of which can be configured in a number of ways, making it very, very flexible and very big!

Talend is also very extensible.  Being Java based, Talend allows developers to "reach into" the world of Java in order to create new code fragments and shared routines that are then exposed within the product as additional features.

It is also possible to create new connectors and components using JavaJet and whilst this is not required for many projects it is a very useful means of creating re-usable code.  Testament to this is that there are at least as many downloadable community components as there are within the main product release.

To be honest, the whole Java thing took me a while to come to terms with and I kept asking "why do I have to learn Java in order to use a data integration tool?".  It just didn't feel right, but then eventually it dawned on me.  All DI tools have some internal, often proprietary, scripting language that has to be used for handling more complex requirements, so why not Java?  It is open, incredibly rich, well established and it has a thriving community that is constantly extending the product......perfect.

Getting over that mental hurdle shed a whole new light on the Talend/Java relationship and I finally began to embrace Java as the scripting language, making the whole process of understanding what "really makes Talend tick" much, much easier.

I am now a big fan of Talend.  It is pretty easy to pick up and run with and for many applications only a smattering of Java would needed (Talend does provide some functions out of the box).  It has a huge range of connectors to many databases and business packages, a thriving community and it is being extended day on day.  For many organisations it would probably integrate across their whole heterogeneous environment without any customisation required, but should customisation be needed there is the whole, vast, Java language to pick and choose from, which is no bad thing at all.

I have to confess that I have only told part of the story at this point in that I have only touched very lightly on Talend within this blog. 

When referring to Talend, I have been referring to the company's integration offerings (open studio and integration studio).  In the last year Talend has also added Data Quality and Master Data Management tools to their portfolio of developed products and recently acquired Sopera to extend their product into the middleware space.

So Talend is so much more diverse than I expected, both within the integration product and without and has all the capabilities I would expect from any serious player in the data integration market.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2011/01/adventures_with_talend.php http://www.b-eye-network.co.uk/blogs/barton/archives/2011/01/adventures_with_talend.php Fri, 07 Jan 2011 14:51:05 +0000
Migration and Integration Last week I attended the Data Migration Matters conference http://datamigrationmatters.co.uk/ in London and what I learned was that there are differences in approach between integration and migration, however there are common factors, two of which I will cover in this blog.

The first is customer involvement.  The data in an organisation is utilised by the business, defined by the business and ultimately informs business decisions, so any project, IT or otherwise that is required to make changes to data ultimately needs business buy in and involvement.

The second is understanding the data.  Many projects have failed because of incorrect assumptions and gaps in knowledge that only manifest themselves when a project is in full flight.  It is imperative that the source data is mapped out and understood prior to coding.

In some ways these two requirements go hand in hand.  To understand and make sense of the data, you need the business to add their experience of using it to the mix.  To involve the business, you have to be able to deliver technical information in such a way that it becomes possible to interpret the data in a non-technical way.

This is where data profiling and quality tools come into their own.  These tools analyse the source data and present the user with a high level view of the data, enabling the user to view patterns and statistics and relationships at both file and field level. 

Profiling information is often a revelation for business users.  Operationally the data works, so is deemed fit for purpose, however when profiled it is not uncommon to see genuine problems with the data, such as duplicate records, missing fields and records and often just plain incorrect values.  The ability also for drill down to the actual data is imperative in order to show that the statistics marry up to real data on real systems within the organisation.

It is often at this point, when the illusion of "perfect data" evaporates, that the business buys into the project and begins to understand why the ownership of the data and the associated business rules fall squarely within their domain.  It is surprising how showing people their own data "warts and all" can have a profound effect on their involvement in a project. 

How often have we heard the term "if it isn't broken, don't fix it" and for many users their opinion is that their data isn't broken, so it is perhaps hard to understand why IT make such a fuss during a data migration. 

The truth is that for many organisations their data is, to varying degrees, somewhere between broken and fixed and it is only when it is utilised en masse, say for reporting or migration, that problems suddenly begin to appear.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2010/05/migration_and_integration.php http://www.b-eye-network.co.uk/blogs/barton/archives/2010/05/migration_and_integration.php Mon, 17 May 2010 09:30:05 +0000
The reducing cost of data warehousing

It's been a while since my last blog, mainly because I've been busying myself with setting up a new business, edexe (www.edexe.com).  The mission for edexe relates in many ways to this blog post: performance, price and productivity.

As a result of a recent activity, I am stunned by how much the cost of data warehousing has reduced in the last couple of years.

MPP (massively parallel processing) technologies have fallen in price; on the ETL/DI front and on the database front, and database automation tools are becoming more popular as a result.

Competition is hot with regards to MPP in the database/appliance market.  Teradata and Netezza have been the dominant players for the last 5-6 years, however a host of new appliances, cluster based and columnar databases have hit the market in the last 2-3 years.  The increased competition from the likes of Oracle Exxadata, HP Neoview, Aster, Green Plum, Kognitio, Kickfire and Vertica is rapidly bringing the price of MPP database processing down to sub £100k (and below) for entry level databases.

This is bringing the MPP database well within the reaches of mid-tier companies, enabling enterprise performance on a budget.

On the data integration front, expressor's parallel processing engine delivers speeds to compare against or even beat the most established DI vendors such as Ab Initio, Informatica and Datastage, yet remains priced sub £50k for an entry level DI product.  Talend also deliver an MPP option to the market with their MPx version of their Integration suite.

Again high performance DI/ETL products are now available for the mid-tier company.

So I've discussed price and performance, what about productivity?

Well the database products deliver significant benefits in terms of productivity over the traditional OLAP databases such as Oracle and SQL Server.  The MPP database standard was set by Netezza, delivering MPP performance, with minimal DBA activity.  Even in large organisations, Netezza DBAs may only spend 1 day a week maintaining the system.  Within the MPP space this is pretty much standard, with self-organising databases being the norm.

Between DI tools, there is little to choose in terms of productivity, however when compared to SQL or other hand cranked coding languages, the productivity gains are huge in comparison (anywhere from 50-80% reduction in coding time). 

Staying with productivity, one tool that has really impressed me is BIReady.  It is a database automation solution that really does deliver on productivity on two main fronts: changes to the data model do not necessarily require changes to the data structures, since data is automatically organised in a model independent normalised schema; key assignments are managed within BIReady, so need not be maintained by the ETL solution.  This is a significant productivity gain in terms of reducing DBA activity (like the MPP databases) and simplifying the ETL process and shortening development times by taking away the need for key management.  What's more BIReady pricing also fits comfortably into mid-tier budgets.

So there we have it, price, performance and productivity.  It is now possible to purchase a low maintenance, high performance end to end MPP warehousing technology for under £300k.  The nature of this beast also means that the delivery and maintenance of the solution is reduced.

High performance datawarehousing is finally in reach of the mid-market companies.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2010/04/the_reducing_cost_of_data_ware.php http://www.b-eye-network.co.uk/blogs/barton/archives/2010/04/the_reducing_cost_of_data_ware.php Wed, 28 Apr 2010 15:04:40 +0000
Can change be eliminated?

An interesting statistic in a recent survey is that:

"A third of businesses in this country said that keeping project requirements as constant as possible, without allowing too many changes, was key to cutting costs."

I absolutely agree with this statement, however it is quite difficult to achieve.  How many of us 100% understand what we want from a delivery months before go live.  How many of us in the technical world can produce a design that absolutely reflects the business requirements 100%?  How many people within the business can sign off a technical requirements in the certain knowledge that it delivers 100% to their requirement.

I can tell you the answer to these questions, none of us.  Certainty is not a word that is often used in the development lifecycle, especially around translation between business requirements and technical specification. 

So is there a way that project requirements can be kept as constant as possible?  I think there are approaches that can help mitigate against change, or others that allow for change, but none that can eliminate it altogether.

Technology can help but needs to be more business friendly.  It is widely accepted that the more that the business can do in a project, the more successful that project will be, so the easier a technology is to use and understand, the more likely it is that they will use it proactively. 

Agile can help in that it is an approach that allows for change, however adopting Agile can be difficult and does require a step change in thought processes.

Ultimately the answer rests with communication.  IT and business need to find ways of communicating requirements and potential solutions in a manner that breeds mutual understanding. 

My personal view is that effective prototyping is the key.  It is standard to produce "mock ups" in many design functions in many industries, so perhaps we in the data world should take a leaf out of their book and find ways to share technical ideas in a much more "tactile" fashion.

 
]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/12/can_change_be_eliminated.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/12/can_change_be_eliminated.php Fri, 11 Dec 2009 16:31:40 +0000
X88 breaking new ground I'm really looking forward to an X88/expressor/emunio webinar next week.  X88 is a new entrant into the data profiling market, but it does much, much more than just profiling. 

 

For a start the product will profile and automatically cross reference all data loaded into their database and it can handle very high volumes.  That's great as it stands, however what it allows you to do with that data is even better!

 

Using the loaded data X88 provides a business friendly interface to prototype transformation rules in real time so that the effect of transformations can be viewed in real time.  This provides a visual, real time and iterative method of understanding the effects of your business transformation and cleansing rules.

 

And it keeps on getting better......

 

Once you have completed your prototyping you can then a) produce a specification for your ETL tool detailing the transformations and b) create a file of the transformed data.  This file can then be used to test your ETL code (if it doesn't match the ETL output then the code is wrong!) and to test your target systems to check the correctness of the prototype.

 

All of this is done without the need for any traditional coding, which means that timescales for business rules and timescales for testing can be aggressively reduced.

 

Add to this expressor's semantic DI capability for the ETL, which incidentally has the effect of compressing development timescales, and this solution has a significant impact along the whole development lifecycle.

 

If you want to find out more then register for the webinar here

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/11/x88_breaking_new_ground.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/11/x88_breaking_new_ground.php Fri, 13 Nov 2009 14:04:01 +0000
Data Virtualisation - one way to improve your DW delivery timescales I promised in last weeks blog and one of my earlier blogs that I would provide detail on how a DW can be enhanced to enable faster and more effective delivery (as opposed to 3 months+) and I'm finally delivering on that promise in this entry.

Essentially the answer is to use a data virtualisation tool to provide a "fast track" data delivery mechanism to rapidly integrate new data into a virtual schema.

So what problems will this approach solve?  I have listed some of the more obvious ones below:

  • The business needs data next week but the delivery lifecycle requires 3 months
  • The business has a one-off data feed but adding it to the warehouse is the only option
  • The business has a proliferation of uncontrolled data marts, built because they did not have time to wait for the delivery lifecycle
  • The DW is not up to date enough for the business need
  • DW and ETL designs are just plain wrong because the business rules are wrong
  • The DW and ETL process are bloated with unnecessary data (often these one-off feeds!)

All these problems lead to raised costs and loss of competitive advantage, so how can virtualisation help?

  • The tools are simple to use by end user - new views are quick to create
  • Complete subject areas can be virtualised, such that new dimensions can be added very quickly
  • Queries can be run in real time
  • Business rules and data relationships can be prototyped and understood prior to instantiation into DW schema and ETL code
  • The tables and the data do not physically exist so when you are done with them there is nothing to clean up

Virtualisation should not be seen as a replacement for a warehouse or for ETL however.  Federated queries can impact on source systems, so balancing needs against source system impact is key, which is why I said this is a complement to the existing DW at the top of the post.

What I will say though is that virtualisation can help give the user the flexibility they need and the DW team breathing space to ensure that DW and ETL changes are more correct and focused on genuine long term DW additions.

For those who are interested in this method, composite software have recently issued a paper focused on how virtualisation can complement the DW.  It's available off their website (www.compositesoftware.com)

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/10/data_virtualisation_-_one_way.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/10/data_virtualisation_-_one_way.php Sat, 03 Oct 2009 09:44:03 +0000
Impressive Twin Fin I'm back off my holidays and feeling very refreshed and raring to go.  I'd like to say thanks to Phil for covering for me and will be following on from his theme in a later blog.  (For those following; I know I have made a promise such as this before in a blog about warehouse delivery times and haven't yet, but please hang on in there.  It is on my guilt list and due soon!).

Anyway on to the business at hand,  I had the pleasure of attending the Enzee Universe in London last week.  It was a fabulous affair and the new twin fin appliance was unveiled.  With twin fin Netezza have again raised the bar from the price/performance perspective.

 

Although I have only had a brief overview of the new architecture the key difference is the move to commodity hardware in the form of Intel based blades, making the technology much more approachable.  The FPGA technology is still present thus retaining Netezza's "secret ingredient" and one of the key components that deliver the outstanding performance.

 

Netezza have also been busy building a set of "on the box" match ups with other vendors and one I am particularly enamoured of is the new KONA platform.  KONA is a marriage of Kalido and Netezza and for anyone considering a new warehouse or re-architecting a legacy platform this is very much worth a look.  Kalido's ease of set up and long term maintenance and Netezza's performance is a potent mix indeed.   It is in effect a whole data warehouse in a single box which is very attractive prospect from a management perspective.  At the moment this offering is geared towards specific verticals, however I would anticipate it being opened out in time.

 

So all in all a great day and once again Netezza are continuing to demonstrate the forward thinking that has made them the market leader in appliance technologies.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/09/impressive_twin_fin.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/09/impressive_twin_fin.php Mon, 28 Sep 2009 14:18:29 +0000
Watch out for Data Integration Silver Bullets

The market for data integration tools has never been either as strong or rich as it is today. 10+ years ago today's giants in the marketplace offered the classic Extract, Transform and Load capability, and have been evolving their platforms ever since to provide the current batch of feature rich, high performance packages that now extend to cover data migration, data quality (profile, monitor and cleanse), web services and enterprise metadata management.

 

A recent Gartner paper (available courtesy of Syncsort here), reveals the market for Data Management and Integration tools is nearly $1.7 billion, and is expected to grow to $2.7 billion by 2013.  This growth is driven by an awareness of the high cost of delivering data centric projects by using the manually intensive programming techniques of the 3GL languages, and businesses are increasingly attracted to the productivity gains offered by Data Integration tools.  At the same time businesses listen intently to the marketing machines and accept at face value that the data integration tool they end up buying really will be the silver bullet to solve all their data problems.

 

Wrong, wrong, wrong.

 

Choosing a data integration technology is just the first step on a journey to improving productivity and responsiveness within a business, making it work over the long term is a little more difficult.  Having worked on countless data integration projects over the last 12 years, my biggest source of frustration is when the customer has been set an unrealistic expectation about how easy it is to work with the technology. 

 

Yes, DI tools are certainly an order of magnitude easier than hand cranking code, but the architecture will not take care of itself, and the out of the box settings almost always last little more than a few months before progress falters - it may even halt until things are fixed.

 

Why does this happen?

 

Most DI tools find their way into a business following a Proof of Concept project. Proof of concepts are just that, they prove something.  And the intent is prove the concept as quickly as possible.  They don't usually provide production ready code, and the environment isn't usually set up at this stage to support wide scale usage.  It is usually impressive in terms of results but can also be very fast and dirty. 

 

It also helps to understand the business model of the vendor, and therefore their ultimate motivation. Some vendors don't have professional services, and rely on licenses and maintenance as their main income stream. This leads to the possibility that the immediate sale becomes the focus - at the expense of making it last

 

Credit to the vendor presales  who perform the POCs.  They are highly skilled and deliver at tremendous speeds.  The problem however manifests itself in the fact that they make it all seem very easy.........too easy.  Remember, because the vendor does it with consummate ease and the tool is shiny, it doesn't mean that even your brightest technical people can achieve the same feat the day after returning from the training course.  Your team will need support to get to the apex of that learning curve as quickly as possible - particularly around implementing a stable and scalable architecture.

 

So ask your vendor what their business model is. If they're not geared up to do professional services, you probably need to find a partner to help you with the transition of your delivery team from newbies to experts - and the sooner the partner gets involved, the fewer long term problems you're likely to see. 

 

Any integration partner worth his or her salt will start with a discovery phase during which they audit the environment and map the business needs to a technology strategy and plan. Many times I've been here, and in very short time I often find that the software has been poorly configured so it won't scale, in one or more of performance, complexity or enterprise growth. What follows is a difficult conversation with the business sponsor who still can't believe that the wonder tool has failed to deliver.

 

The moral of the tale is this: when you start evaluating a data integration tool, begin to evaluate an integration partner that you can trust and work closely with. Get them in early, preferably when you're doing the proof of concept with the vendor, so they can make ensure a smooth transition between vendor and internal team. They can help with your evaluation criteria, your architecture and governance and help you avoid pain later in your delivery programme.

 

It's often said that there's no substitute for experience, and that's never truer than when applied to data integration projects; a few days consulting early in the project lifecycle can save tens or even hundreds of thousands of pounds down the line.

 

 

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/09/watch_out_for_data_integration_silver_bullets.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/09/watch_out_for_data_integration_silver_bullets.php Mon, 07 Sep 2009 09:05:46 +0000
Introducing Phil I am going on holiday for a couple of weeks (well earned....and needed!), so will not be blogging for that period.  In the meantime on of my colleagues at Emunio, Phil Watt has voluntereed to stand in for me for the fortnight. 

Phil is one of our Principal Consultants and has a deep experience in all things DI. 

 

Thanks for covering for me Phil and I'm looking forward to reading your entries when I come back.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/08/introducing_phil.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/08/introducing_phil.php Tue, 11 Aug 2009 18:53:48 +0000
Another tool in the kit bag One of my favourite quotations is from Abraham Maslow, who said "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail".  Ever since I first heard someone use this quotation it has resonated with me, especially since I have seen examples of this behaviour first hand on many occasions and in many different problem contexts.  For this blog though I'm going to concentrate on one specific example that can help to improve DI delivery.

 DI tools are great.  They are faster and easy to use than many other technologies, hence their success in the marketplace.  But let's not forget what they are designed for.  The clue is in the name "Data Integration" where "data" is the key word.  They are great at dealing with data.  But the DI tool is just one tool in the kit bag.  Another, often overlooked tool, is the Operating system that the DI tool resides on. 

Operating systems are very good at handling files, directory structures and hardware.  It is their raison d'etre and it is therefore no surprise that O/S vendors have developed their own tools to effectively handle these resources.  Some of these may not be as pretty as the DI tools, but used properly they can be more effective than any other tools to hand when dealing with files and file names.

So for example, lets take a directory containing many files, all of which have been landed overnight with the wrong date (e.g. infile001_20090701, infile002_20090701 etc.).  The batch process has failed to update the target system and the pressure is on from the business to get the data, so here is my solution (written in Korn Shell Script)

for i in `ls infile*_20090701`; do j=`echo $i | cut -d"_" -f1 `_20090702; mv $i $j;done

One line of shell script!  It took me about 15 minutes to write this.  (Please note that I'm a little rusty now, but any one of Emunio's employees could do this in much less time.) 

In my experience this simple task would take orders of magnitude longer using a DI tool, because this small problem is not the nail to the DI tool hammer.

My point here is that every DI developer has an additional tool in their kit bag.  Those technicians that take time out to understand their O/S and to practice with the supplied tools will be much more effective than those that don't. 

This also applies at the organisational level.  DI Managers should encourage a culture of learning and development such that the practitioners have the ability to determine when and when not to use their tool of choice.  This should also be reflected in sensible standards for DI tool use; they should recognise the power of the O/S supplied tools, encourage their use and in a small way increase the productivity of each member of the DI team.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/another_tool_in_the_kit_bag.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/another_tool_in_the_kit_bag.php Fri, 31 Jul 2009 16:15:18 +0000
What next for the data model? Following on from my last blog about what next for DI, I have been wondering what the effects of current technologies could have on the data model in the near future. 

There are a few technology areas in particular that I see eroding the need for a data model namely; 

Massively Parallel Processing (MPP) databases; these databases have redefined the speed that queries can be executed with net effect of this change is that the database is more "forgiving" of less than perfectly modelled data.

Data virtualisation; virtualisation enables queries that will access many input data sources without the need for the data to be instantiated, therefore removing the need for a formal model.  Instead tailored views are created for the particular end user requirement.  In addition there is a class of tools that while not quite virtualisation tools do enable rapid access of flat file data via SQL.  These are more specialised, however worthy of note.

Dynamic warehousing;   These products store the data independently of the reporting model, such that changes to the model do not require changes to the underlying  tables.  A good example of this is Kalido's Dynamic Information Warehouse (DIW) technology.    In addition Kalido also drive the product via a semantic business model, rather than a traditional data model.

Profiling;  Many data profilers can infer relationships between fields in the same or even different files, thereby enabling keys to be identified across the data.  Composite software have an interesting product (Discovery) that not only enables the keys to be identified but also then to "fix" those relationships such that they can then be used within queries.

One thought I have had, and one which is very interesting is a match up between profiling and MPP databases; an intuitive warehouse, if you will.  In this technology a new data source is simply added to the database as a new table.  This table is then profiled against the existing tables and the key relationships are then discovered.  The relationships are confirmed and then the table is then usable for querying.  The need for the MPP database is that this "unstructured model" would be more inefficient than a traditional "designed" model, so the MPP engine's horsepower would be needed in order to provide result sets in a timely fashion.

So as you can see, current technologies are allowing a relaxation of the rules around data modelling and are challenging the accepted warehouse solution stack by allowing rapid query design without the use of a model. 

I don't think the data model is going anywhere just yet, but I do think it's place at the very heart of the warehouse will be challenged in the coming years. 

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/what_next_for_the_data_model.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/what_next_for_the_data_model.php Mon, 27 Jul 2009 20:22:27 +0000
What next for DI? I've been giving some thought to the next generation of DI tools and the features that I believe would set them apart from the first generation tools.

On a technical level the tools will need to add additional levels of abstraction from the executable code.  The first-gen tools reduced complexity of hand written code by using graphical boxes and lines.  This removed the need for developers to code their own "structures", allowing more time for design and compressing the time and number of people required to create code.  The next-gen tools must achieve a similar step change.  In my view this will mean delivering tools to the business are simple to use and deliver executable code directly from the business rules with minimal technical intervention.  One cannot expect that every scenario can be catered for, however if the majority of transformations can be instantiated as executable code by the business analysts, then this will be a significant step forward for DI.

The tools must also provide test functionality to support the business users creating rules.  Testing the rules at inception is vastly more efficient and cost effective than at later test stages.  If pre-defined data quality rules are also linked in at this stage then the business user can confirm that rules also meet agreed quality definitions prior to any code being generated.

Speaking of quality, functions such as data validation, data profiling, data quality checking and data cleansing should become an integrated part of the day to day execution.  Data sources can then be quality checked and profiled in flight.  Additional configuration would determine if processes should fail when quality thresholds are exceeded.  It is well documented that quality improvements deliver value in terms of cost and reduced errors.  Taking a "day by day" approach will mean that quality improvements need not be considered as a massive undertaking.

Also coupled to business usability is the concept of "macro-components".  By this I mean pre-built "wire frames" that enable simple configuration and therefore automation of common tasks, such as creation of dimensions, working with headers and trailers, validating dated input files etc.    Driving processes through metadata is becoming more and more common and DI tools should make this as easy as possible to achieve.

Next on the list is that the DI tools will need to comply with industry standards for metadata.  Not just in terms of data formats, but also in terms of business and transformation rules.  As data legislation increases, so must the auditability of data and metadata to ensure compliance needs are met.  As more organisations seek to govern and manage their metadata at an enterprise level, tools that "play nicely together" by adhering to a common standards will be preferred over those that don't.  I hasten to add though that this should apply to all products that transform data, not just DI tools, after all databases, data mining tools and reporting tools also transform data.

In conclusion I would like to see the next-gen DI tools adding more value but with much less effort.  The tool vendors need to combine business usability with enterprise functionality; simplicity with visibility.  DI is central to the movement and transformation of data within an organisation, so DI tools are well positioned to provide much higher levels of centralised functionality in terms of enterprise metadata, re-usable functionality, data quality statistics and standardised formats.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/what_next_for_di.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/what_next_for_di.php Mon, 20 Jul 2009 16:05:44 +0000
Agile - three thought provoking features I must confess to being a relative newbie when it comes to Agile development and I put this down to the fact that the majority of our customers and potential customers insist on classic waterfall delivery cycles.

After a recent meeting with an Agile evangelist it seemed to me that the methodology has a sound foundation based upon common sense.  Many of the key messages resonated with my personal experiences, albeit in a waterfall environment.

To set the scene, Agile is an iterative delivery methodology based upon a few key tenets:

  • Mixed teams including business representatives, testers, analysts and developers to enable rapid decision making and delivery
  • Short delivery iterations to enable functionality to be delivered at regular intervals and "learning" to be fed back in (closing the loop).
  • The use of reusable (or even automated) test routines to enable shorter test cycles and therefore shorter release cycles.

In particular I was struck by three very thought provoking messages and concepts;

  • Maximum Marketable Feature (MMF)
  • The right to change your mind
  • The right to stop when you want to

Identifying MMFs is part of the planning and prioritisation process that ensures that the most important and high value functionality is identified.  It is a means of separating "need" from "want",  "must haves" from "nice to haves".  In this way the delivery team will then be focusing on features that can be usable by the business in a relatively short timescale.

Agile enables fluidity in the decision making process because it does not expect all variables to be known at the outset.  It is acknowledged that it is very difficult to define inherently complex systems up front as would be expected within waterfall.  Being able to change ones mind mid-iteration cuts out the waste associated with change management; raising changes, haggling over price, approval processing, respecification etc.

The right to stop when you want to is also very interesting.  Businesses priorities change during projects as does user perceptions.  If core functionality is in place and usable, doesn't it make sense that the user can say "actually this is good enough" at any point in time?  After all features that may have seemed necessary month ago may now be an expensive luxury that isn't really required.

As I mentioned at the beginning of this post much of Agile is common sense, which is why I like it.  I like the flexibility and the focus on automation where possible and the pragmatic acknowledgement that it is very difficult to set rules in stone at the outset, before hard experience kicks in.

To add some balance to the argument, I'm also sure that there are many new challenges to face that are peculiar to Agile that I don't yet understand, however I will now be taking time to find out more about this methodology.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/agile_-_three_thought_provokin.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/agile_-_three_thought_provokin.php Mon, 13 Jul 2009 12:37:14 +0000
Three month data feed I've always like to think of Data Warehouse environments as fast moving and agile, driving value for the business through rapid delivery of important new information, but it seems that this is often not so.  In recent months I have been told of organisations where it takes three months to deliver new data feeds to the warehouse. 

Personally this feels very wrong and I'd really like to understand how prevalent this is.  I'm sure that there are occasions when three months is correct, perhaps for extremely complex feeds, but on a regular basis?

It also begs the question as to whether overlong "time to market" is one of the causes of organisations having too many uncontrolled data marts or huge analytic environments full of ETL  code.    After all how much competitive advantage can be gained from data that is so long overdue?  Is it any surprise therefore that alternatives to the warehouse are being sought out?

It is absolutely correct that the business shouldn't accept these levels of delay, but short term workarounds to the problem will not help anyone in the long term.  Over time the additional complexity will simply confuse the data and reporting landscape.  After all it's hard to get a single version of the truth, when everyone has their own personal favourite!

So what is the answer?  I hate to oversimplify, but the answer lies both with IT and the business.  I know it's an old chestnut, but it's still the truth.

Both parties need to commit to reducing the time to delivery, such that the warehouse can be agile and responsive.  IT should examine processes, methodologies, capability, design and SLAs, and business should support and participate in initiatives to ensure raised levels of understanding and governance relating to business rules, data and metadata. 

Technology can also help, and in some instances I believe that alternatives to the warehouse are the right option, but I'm keeping my powder dry on this one for this blog at least!  Another blog for another day I'm afraid, so stay tuned in the coming weeks and I'll elaborate in one of my future posts.

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/three_month_data_feed.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/07/three_month_data_feed.php Sun, 05 Jul 2009 19:58:30 +0000
Data Integration tools saving the planet? I was driving home last week listening to the radio, when an article came on discussing alternative energy sources.  A stray right brain neuron must have fired and I began wondering how what we do in DI impacts on the environment.  After all almost every other technology has some kind of environmental grading.  Certainly my fridge and my car do, and every company is aware of the impact of their hardware consumption, but there seems to be little consideration as to whether software can have an impact on carbon footprint.

It is generally recognised that code (correctly) created using DI tools is more performant than hand cranked code.  This means that less hardware is used to perform the same actions.  Less hardware means lower carbon footprint, means more environmentally friendly.

This is even more profound when one considers the MPP technologies that can dramatically increase performance and scalability.  If one can negate the need for new hardware or even remove existing infrastructure then the impact is not only positive from an environmental perspective, but also from a cost perspective.

And so I have arrived at the conclusion that DI tools are at the positive end of the green scale and perhaps performance should be given a higher priority when evaluating software.

Extending this argument slightly, should software vendors be touting their green credentials?  Should there perhaps be benchmarks that could provide an "energy rating" for software?

If  this did happen then one would hope we would see a return to leaner code and more efficient programming practices, meaning that we are all less likely to hear the words "just throw some more tin at it"

]]>
http://www.b-eye-network.co.uk/blogs/barton/archives/2009/06/data_integration_tools_saving.php http://www.b-eye-network.co.uk/blogs/barton/archives/2009/06/data_integration_tools_saving.php Tue, 30 Jun 2009 12:10:15 +0000