Blog: Rick Barton Subscribe to this blog's RSS feed!

Rick Barton

Hello and welcome to my blog. I am delighted to blog for the BeyeNETWORK, and I'm really looking forward to sharing some of my thoughts with you. My main focus will be data integration, although I don't plan to restrict myself just to this topic. Data integration is a very exciting space at the moment, and there is a lot to blog about. But there is also a big, wide IT and business world out there that I'd like to delve into when it seems relevant. I hope this blog will stimulate, occasionally educate and, who knows, possibly even entertain. In turn, I wish to expand my own knowledge and hope to achieve this through feedback from you, the community, so if you can spare the time please do get involved. Rick

About the author >

Rick is the director (and founder) at edexe. He has more than 20 years of IT delivery experience, ranging from complex technical development through to strategic DI consultancy for FTSE 100 companies. With more than 10 years of DI experience from hands-on development to architecture and consulting and everything in between, Rick’s particular passion is how to successfully adopt and leverage DI tools within an enterprise setting. Rick can be contacted directly at rick.barton@edexe.com.

Happy New Year to all

It's been quite a while since my last post, a some of which is due to me having to spend time getting to grips technically with my latest data integration interest; Talend, who are the market leaders in open source data integration.

When I first began to seriously look at Talend early last year, I expected that it would not take too long to master, given that I have a long history of using data integration tools.  The fact of the matter though is that it took me much longer than expected, not because of the usability of the product (it is in fact simple to use) or because it is different to what I was used to (to some degree) but because of its scale.  It has a huge number of connectors, components and orchestration methods, each of which can be configured in a number of ways, making it very, very flexible and very big!

Talend is also very extensible.  Being Java based, Talend allows developers to "reach into" the world of Java in order to create new code fragments and shared routines that are then exposed within the product as additional features.

It is also possible to create new connectors and components using JavaJet and whilst this is not required for many projects it is a very useful means of creating re-usable code.  Testament to this is that there are at least as many downloadable community components as there are within the main product release.

To be honest, the whole Java thing took me a while to come to terms with and I kept asking "why do I have to learn Java in order to use a data integration tool?".  It just didn't feel right, but then eventually it dawned on me.  All DI tools have some internal, often proprietary, scripting language that has to be used for handling more complex requirements, so why not Java?  It is open, incredibly rich, well established and it has a thriving community that is constantly extending the product......perfect.

Getting over that mental hurdle shed a whole new light on the Talend/Java relationship and I finally began to embrace Java as the scripting language, making the whole process of understanding what "really makes Talend tick" much, much easier.

I am now a big fan of Talend.  It is pretty easy to pick up and run with and for many applications only a smattering of Java would needed (Talend does provide some functions out of the box).  It has a huge range of connectors to many databases and business packages, a thriving community and it is being extended day on day.  For many organisations it would probably integrate across their whole heterogeneous environment without any customisation required, but should customisation be needed there is the whole, vast, Java language to pick and choose from, which is no bad thing at all.

I have to confess that I have only told part of the story at this point in that I have only touched very lightly on Talend within this blog. 

When referring to Talend, I have been referring to the company's integration offerings (open studio and integration studio).  In the last year Talend has also added Data Quality and Master Data Management tools to their portfolio of developed products and recently acquired Sopera to extend their product into the middleware space.

So Talend is so much more diverse than I expected, both within the integration product and without and has all the capabilities I would expect from any serious player in the data integration market.


Posted January 7, 2011 2:51 PM
Permalink | No Comments |

Last week I attended the Data Migration Matters conference http://datamigrationmatters.co.uk/ in London and what I learned was that there are differences in approach between integration and migration, however there are common factors, two of which I will cover in this blog.

The first is customer involvement.  The data in an organisation is utilised by the business, defined by the business and ultimately informs business decisions, so any project, IT or otherwise that is required to make changes to data ultimately needs business buy in and involvement.

The second is understanding the data.  Many projects have failed because of incorrect assumptions and gaps in knowledge that only manifest themselves when a project is in full flight.  It is imperative that the source data is mapped out and understood prior to coding.

In some ways these two requirements go hand in hand.  To understand and make sense of the data, you need the business to add their experience of using it to the mix.  To involve the business, you have to be able to deliver technical information in such a way that it becomes possible to interpret the data in a non-technical way.

This is where data profiling and quality tools come into their own.  These tools analyse the source data and present the user with a high level view of the data, enabling the user to view patterns and statistics and relationships at both file and field level. 

Profiling information is often a revelation for business users.  Operationally the data works, so is deemed fit for purpose, however when profiled it is not uncommon to see genuine problems with the data, such as duplicate records, missing fields and records and often just plain incorrect values.  The ability also for drill down to the actual data is imperative in order to show that the statistics marry up to real data on real systems within the organisation.

It is often at this point, when the illusion of "perfect data" evaporates, that the business buys into the project and begins to understand why the ownership of the data and the associated business rules fall squarely within their domain.  It is surprising how showing people their own data "warts and all" can have a profound effect on their involvement in a project. 

How often have we heard the term "if it isn't broken, don't fix it" and for many users their opinion is that their data isn't broken, so it is perhaps hard to understand why IT make such a fuss during a data migration. 

The truth is that for many organisations their data is, to varying degrees, somewhere between broken and fixed and it is only when it is utilised en masse, say for reporting or migration, that problems suddenly begin to appear.


Posted May 17, 2010 9:30 AM
Permalink | No Comments |

It's been a while since my last blog, mainly because I've been busying myself with setting up a new business, edexe (www.edexe.com).  The mission for edexe relates in many ways to this blog post: performance, price and productivity.

As a result of a recent activity, I am stunned by how much the cost of data warehousing has reduced in the last couple of years.

MPP (massively parallel processing) technologies have fallen in price; on the ETL/DI front and on the database front, and database automation tools are becoming more popular as a result.

Competition is hot with regards to MPP in the database/appliance market.  Teradata and Netezza have been the dominant players for the last 5-6 years, however a host of new appliances, cluster based and columnar databases have hit the market in the last 2-3 years.  The increased competition from the likes of Oracle Exxadata, HP Neoview, Aster, Green Plum, Kognitio, Kickfire and Vertica is rapidly bringing the price of MPP database processing down to sub £100k (and below) for entry level databases.

This is bringing the MPP database well within the reaches of mid-tier companies, enabling enterprise performance on a budget.

On the data integration front, expressor's parallel processing engine delivers speeds to compare against or even beat the most established DI vendors such as Ab Initio, Informatica and Datastage, yet remains priced sub £50k for an entry level DI product.  Talend also deliver an MPP option to the market with their MPx version of their Integration suite.

Again high performance DI/ETL products are now available for the mid-tier company.

So I've discussed price and performance, what about productivity?

Well the database products deliver significant benefits in terms of productivity over the traditional OLAP databases such as Oracle and SQL Server.  The MPP database standard was set by Netezza, delivering MPP performance, with minimal DBA activity.  Even in large organisations, Netezza DBAs may only spend 1 day a week maintaining the system.  Within the MPP space this is pretty much standard, with self-organising databases being the norm.

Between DI tools, there is little to choose in terms of productivity, however when compared to SQL or other hand cranked coding languages, the productivity gains are huge in comparison (anywhere from 50-80% reduction in coding time). 

Staying with productivity, one tool that has really impressed me is BIReady.  It is a database automation solution that really does deliver on productivity on two main fronts: changes to the data model do not necessarily require changes to the data structures, since data is automatically organised in a model independent normalised schema; key assignments are managed within BIReady, so need not be maintained by the ETL solution.  This is a significant productivity gain in terms of reducing DBA activity (like the MPP databases) and simplifying the ETL process and shortening development times by taking away the need for key management.  What's more BIReady pricing also fits comfortably into mid-tier budgets.

So there we have it, price, performance and productivity.  It is now possible to purchase a low maintenance, high performance end to end MPP warehousing technology for under £300k.  The nature of this beast also means that the delivery and maintenance of the solution is reduced.

High performance datawarehousing is finally in reach of the mid-market companies.


Posted April 28, 2010 3:04 PM
Permalink | No Comments |

An interesting statistic in a recent survey is that:

"A third of businesses in this country said that keeping project requirements as constant as possible, without allowing too many changes, was key to cutting costs."

I absolutely agree with this statement, however it is quite difficult to achieve.  How many of us 100% understand what we want from a delivery months before go live.  How many of us in the technical world can produce a design that absolutely reflects the business requirements 100%?  How many people within the business can sign off a technical requirements in the certain knowledge that it delivers 100% to their requirement.

I can tell you the answer to these questions, none of us.  Certainty is not a word that is often used in the development lifecycle, especially around translation between business requirements and technical specification. 

So is there a way that project requirements can be kept as constant as possible?  I think there are approaches that can help mitigate against change, or others that allow for change, but none that can eliminate it altogether.

Technology can help but needs to be more business friendly.  It is widely accepted that the more that the business can do in a project, the more successful that project will be, so the easier a technology is to use and understand, the more likely it is that they will use it proactively. 

Agile can help in that it is an approach that allows for change, however adopting Agile can be difficult and does require a step change in thought processes.

Ultimately the answer rests with communication.  IT and business need to find ways of communicating requirements and potential solutions in a manner that breeds mutual understanding. 

My personal view is that effective prototyping is the key.  It is standard to produce "mock ups" in many design functions in many industries, so perhaps we in the data world should take a leaf out of their book and find ways to share technical ideas in a much more "tactile" fashion.

 

Posted December 11, 2009 4:31 PM
Permalink | No Comments |

I'm really looking forward to an X88/expressor/emunio webinar next week.  X88 is a new entrant into the data profiling market, but it does much, much more than just profiling. 

 

For a start the product will profile and automatically cross reference all data loaded into their database and it can handle very high volumes.  That's great as it stands, however what it allows you to do with that data is even better!

 

Using the loaded data X88 provides a business friendly interface to prototype transformation rules in real time so that the effect of transformations can be viewed in real time.  This provides a visual, real time and iterative method of understanding the effects of your business transformation and cleansing rules.

 

And it keeps on getting better......

 

Once you have completed your prototyping you can then a) produce a specification for your ETL tool detailing the transformations and b) create a file of the transformed data.  This file can then be used to test your ETL code (if it doesn't match the ETL output then the code is wrong!) and to test your target systems to check the correctness of the prototype.

 

All of this is done without the need for any traditional coding, which means that timescales for business rules and timescales for testing can be aggressively reduced.

 

Add to this expressor's semantic DI capability for the ETL, which incidentally has the effect of compressing development timescales, and this solution has a significant impact along the whole development lifecycle.

 

If you want to find out more then register for the webinar here


Posted November 13, 2009 2:04 PM
Permalink | No Comments |

I promised in last weeks blog and one of my earlier blogs that I would provide detail on how a DW can be enhanced to enable faster and more effective delivery (as opposed to 3 months+) and I'm finally delivering on that promise in this entry.

Essentially the answer is to use a data virtualisation tool to provide a "fast track" data delivery mechanism to rapidly integrate new data into a virtual schema.

So what problems will this approach solve?  I have listed some of the more obvious ones below:

  • The business needs data next week but the delivery lifecycle requires 3 months
  • The business has a one-off data feed but adding it to the warehouse is the only option
  • The business has a proliferation of uncontrolled data marts, built because they did not have time to wait for the delivery lifecycle
  • The DW is not up to date enough for the business need
  • DW and ETL designs are just plain wrong because the business rules are wrong
  • The DW and ETL process are bloated with unnecessary data (often these one-off feeds!)

All these problems lead to raised costs and loss of competitive advantage, so how can virtualisation help?

  • The tools are simple to use by end user - new views are quick to create
  • Complete subject areas can be virtualised, such that new dimensions can be added very quickly
  • Queries can be run in real time
  • Business rules and data relationships can be prototyped and understood prior to instantiation into DW schema and ETL code
  • The tables and the data do not physically exist so when you are done with them there is nothing to clean up

Virtualisation should not be seen as a replacement for a warehouse or for ETL however.  Federated queries can impact on source systems, so balancing needs against source system impact is key, which is why I said this is a complement to the existing DW at the top of the post.

What I will say though is that virtualisation can help give the user the flexibility they need and the DW team breathing space to ensure that DW and ETL changes are more correct and focused on genuine long term DW additions.

For those who are interested in this method, composite software have recently issued a paper focused on how virtualisation can complement the DW.  It's available off their website (www.compositesoftware.com)


Posted October 3, 2009 9:44 AM
Permalink | No Comments |

I'm back off my holidays and feeling very refreshed and raring to go.  I'd like to say thanks to Phil for covering for me and will be following on from his theme in a later blog.  (For those following; I know I have made a promise such as this before in a blog about warehouse delivery times and haven't yet, but please hang on in there.  It is on my guilt list and due soon!).

Anyway on to the business at hand,  I had the pleasure of attending the Enzee Universe in London last week.  It was a fabulous affair and the new twin fin appliance was unveiled.  With twin fin Netezza have again raised the bar from the price/performance perspective.

 

Although I have only had a brief overview of the new architecture the key difference is the move to commodity hardware in the form of Intel based blades, making the technology much more approachable.  The FPGA technology is still present thus retaining Netezza's "secret ingredient" and one of the key components that deliver the outstanding performance.

 

Netezza have also been busy building a set of "on the box" match ups with other vendors and one I am particularly enamoured of is the new KONA platform.  KONA is a marriage of Kalido and Netezza and for anyone considering a new warehouse or re-architecting a legacy platform this is very much worth a look.  Kalido's ease of set up and long term maintenance and Netezza's performance is a potent mix indeed.   It is in effect a whole data warehouse in a single box which is very attractive prospect from a management perspective.  At the moment this offering is geared towards specific verticals, however I would anticipate it being opened out in time.

 

So all in all a great day and once again Netezza are continuing to demonstrate the forward thinking that has made them the market leader in appliance technologies.


Posted September 28, 2009 2:18 PM
Permalink | No Comments |

The market for data integration tools has never been either as strong or rich as it is today. 10+ years ago today's giants in the marketplace offered the classic Extract, Transform and Load capability, and have been evolving their platforms ever since to provide the current batch of feature rich, high performance packages that now extend to cover data migration, data quality (profile, monitor and cleanse), web services and enterprise metadata management.

 

A recent Gartner paper (available courtesy of Syncsort here), reveals the market for Data Management and Integration tools is nearly $1.7 billion, and is expected to grow to $2.7 billion by 2013.  This growth is driven by an awareness of the high cost of delivering data centric projects by using the manually intensive programming techniques of the 3GL languages, and businesses are increasingly attracted to the productivity gains offered by Data Integration tools.  At the same time businesses listen intently to the marketing machines and accept at face value that the data integration tool they end up buying really will be the silver bullet to solve all their data problems.

 

Wrong, wrong, wrong.

 

Choosing a data integration technology is just the first step on a journey to improving productivity and responsiveness within a business, making it work over the long term is a little more difficult.  Having worked on countless data integration projects over the last 12 years, my biggest source of frustration is when the customer has been set an unrealistic expectation about how easy it is to work with the technology. 

 

Yes, DI tools are certainly an order of magnitude easier than hand cranking code, but the architecture will not take care of itself, and the out of the box settings almost always last little more than a few months before progress falters - it may even halt until things are fixed.

 

Why does this happen?

 

Most DI tools find their way into a business following a Proof of Concept project. Proof of concepts are just that, they prove something.  And the intent is prove the concept as quickly as possible.  They don't usually provide production ready code, and the environment isn't usually set up at this stage to support wide scale usage.  It is usually impressive in terms of results but can also be very fast and dirty. 

 

It also helps to understand the business model of the vendor, and therefore their ultimate motivation. Some vendors don't have professional services, and rely on licenses and maintenance as their main income stream. This leads to the possibility that the immediate sale becomes the focus - at the expense of making it last

 

Credit to the vendor presales  who perform the POCs.  They are highly skilled and deliver at tremendous speeds.  The problem however manifests itself in the fact that they make it all seem very easy.........too easy.  Remember, because the vendor does it with consummate ease and the tool is shiny, it doesn't mean that even your brightest technical people can achieve the same feat the day after returning from the training course.  Your team will need support to get to the apex of that learning curve as quickly as possible - particularly around implementing a stable and scalable architecture.

 

So ask your vendor what their business model is. If they're not geared up to do professional services, you probably need to find a partner to help you with the transition of your delivery team from newbies to experts - and the sooner the partner gets involved, the fewer long term problems you're likely to see. 

 

Any integration partner worth his or her salt will start with a discovery phase during which they audit the environment and map the business needs to a technology strategy and plan. Many times I've been here, and in very short time I often find that the software has been poorly configured so it won't scale, in one or more of performance, complexity or enterprise growth. What follows is a difficult conversation with the business sponsor who still can't believe that the wonder tool has failed to deliver.

 

The moral of the tale is this: when you start evaluating a data integration tool, begin to evaluate an integration partner that you can trust and work closely with. Get them in early, preferably when you're doing the proof of concept with the vendor, so they can make ensure a smooth transition between vendor and internal team. They can help with your evaluation criteria, your architecture and governance and help you avoid pain later in your delivery programme.

 

It's often said that there's no substitute for experience, and that's never truer than when applied to data integration projects; a few days consulting early in the project lifecycle can save tens or even hundreds of thousands of pounds down the line.

 

 


Posted September 7, 2009 9:05 AM
Permalink | No Comments |

I am going on holiday for a couple of weeks (well earned....and needed!), so will not be blogging for that period.  In the meantime on of my colleagues at Emunio, Phil Watt has voluntereed to stand in for me for the fortnight. 

Phil is one of our Principal Consultants and has a deep experience in all things DI. 

 

Thanks for covering for me Phil and I'm looking forward to reading your entries when I come back.


Posted August 11, 2009 6:53 PM
Permalink | No Comments |

One of my favourite quotations is from Abraham Maslow, who said "It is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail".  Ever since I first heard someone use this quotation it has resonated with me, especially since I have seen examples of this behaviour first hand on many occasions and in many different problem contexts.  For this blog though I'm going to concentrate on one specific example that can help to improve DI delivery.

 DI tools are great.  They are faster and easy to use than many other technologies, hence their success in the marketplace.  But let's not forget what they are designed for.  The clue is in the name "Data Integration" where "data" is the key word.  They are great at dealing with data.  But the DI tool is just one tool in the kit bag.  Another, often overlooked tool, is the Operating system that the DI tool resides on. 

Operating systems are very good at handling files, directory structures and hardware.  It is their raison d'etre and it is therefore no surprise that O/S vendors have developed their own tools to effectively handle these resources.  Some of these may not be as pretty as the DI tools, but used properly they can be more effective than any other tools to hand when dealing with files and file names.

So for example, lets take a directory containing many files, all of which have been landed overnight with the wrong date (e.g. infile001_20090701, infile002_20090701 etc.).  The batch process has failed to update the target system and the pressure is on from the business to get the data, so here is my solution (written in Korn Shell Script)

for i in `ls infile*_20090701`; do j=`echo $i | cut -d"_" -f1 `_20090702; mv $i $j;done

One line of shell script!  It took me about 15 minutes to write this.  (Please note that I'm a little rusty now, but any one of Emunio's employees could do this in much less time.) 

In my experience this simple task would take orders of magnitude longer using a DI tool, because this small problem is not the nail to the DI tool hammer.

My point here is that every DI developer has an additional tool in their kit bag.  Those technicians that take time out to understand their O/S and to practice with the supplied tools will be much more effective than those that don't. 

This also applies at the organisational level.  DI Managers should encourage a culture of learning and development such that the practitioners have the ability to determine when and when not to use their tool of choice.  This should also be reflected in sensible standards for DI tool use; they should recognise the power of the O/S supplied tools, encourage their use and in a small way increase the productivity of each member of the DI team.


Posted July 31, 2009 4:15 PM
Permalink | No Comments |