Oops! The input is malformed! Unstructured Data Processing – Why Textual ETL? by Krish Krishnan - BeyeNETWORK UK


 

Unstructured Data Processing – Why Textual ETL?

Originally published 21 November 2011

Until the last decade, organizations relied on legacy systems, enterprise applications and market data gathered by analysts to make decisions for the business. To make any and all detailed, operational, up-to-the-second decisions, the systems that were in place worked just fine. To take care of any and all detailed analysis and reporting, the data warehouse and data marts were implemented.

As time progressed, analytics and key performance indicators (KPIs) were needed by organizations – not only on data across all of these systems, but also beyond these from sources including the Internet, content management platforms and more. We discovered that we cannot simply process unstructured data (text, documents, policies, PDFS, contracts), semi-structured data (email, forms) or content management systems (Sharepoint, Documentum) with processing techniques from the existing systems.

Structured ETL

Structured extract, transform and load (ETL) is used to transform data from corporate and legacy applications so that the data – once transformed into a uniform, corporate structure – can be examined and analyzed consistently. Structured ETL addresses data integration – transformation, encoding, formatting, DBMS conversions, dimensions of attributes and more.

An example of ETL processing is as follows: Data representing gender is encoded in the input data in the form of (male/female), (m/f), (x/y), and (1/0) from different applications across the enterprise. Once processed, the output for gender is converted and specified simply as (m/f). Another example is dimensions of data attributes that are found in the legacy or applications environment. The dimensions will include lengths that are measured by (inches), (centimeters), or (feet). As output of ETL, data is converted and length is measured uniformly (for example, in centimeters).

Enter Unstructured Data

Nearly all legacy data is structured. Structured data is repetitive and is defined by attributes and keys that recur over and over. But not all data is structured. There is unstructured data as well. Much textual, unstructured data is found in the corporation. In fact, it is estimated that 80% or more of the data in the corporation is in the form of unstructured text.

Textual data comes in many forms and from many places. Forms of textual data include email of different types; corporate contracts with multiple vendors, employees, customers and more; human resource files; medical records, financial reports; and corporate memos.

How will you read any or all of this data in a given circumstance? Trying to read and analyze textual data without first integrating the text is simply an exercise in futility. There are many reasons why raw text must be integrated before it is useful for analysis. While standardization of data is wonderful in the structured world, as you start looking into the unstructured world, you will quickly realize the challenges that exist in standardizing that data.

Technology advances in the last five years have given us platforms such as Hadoop, NoSQL, Map Reduce and Ruby. These platforms have been engineered to solve the problems existing with current infrastructure such as elastic scalability, compute on demand, self tuning and redundancy. The platforms have created a very robust infrastructure for solving the Internet workload demands and paved way for Facebook, MySpace, Twitter, Groupon and many such new business ventures that create and process large volumes of data on a daily/hourly basis. One can argue that using MapReduce platform, we can solve the unstructured data integration problem. While this is a true statement, this brings along with it enormous problems, including:
  • You are going to be relying on application programming to solve the data problem.

  • The amount of programmatic code needed to solve the data processing and text mining in the unstructured world will require an army of developers.

  • You will need to integrate taxonomies and rules for processing text or semi-structured data, which will bring about its own complexities.

  • You will need to maintain the custom code.
In a nutshell, any unstructured data processing effort of writing code in any programming language will create  dependency on IT and will rob the business users’ ability to navigate the data per their contextual analysis.  Additionally, writing code will not solve the real problem. Business users will be able to analyze data in one dimension, as they would any structured data, but they won’t be able to interrogate the data on English-like business rules. Thus, this approach will end up as an attempt, albeit a good one, to mitigate the risks.

The reason for this sentiment is that processing any kind of data in the enterprise is a process that is defined and owned by the business, as they own the lifecycle of data for the enterprise. When it comes to processing unstructured data, the only people in any enterprise who can own and define the rules for this unstructured data are the business users. But business users cannot write ETL or Hadoop code. This is where you will need textual ETL.

Textual ETL, as the name suggests, is a processing technique to solve the problems of unstructured data processing; but unlike other software or rules engines, it is a multi-step process that guides a business user to define the rules for processing any form of unstructured data. Let me explain this with an example of a emerging rules engine Forest Rim Textual ETL.TM

Toxic chemicals – Toxic chemicals can affect you anywhere and any day. Nobody can accurately anticipate and prepare for toxic chemical attacks. Imagine a cloud-based app that can provide vital information on basic toxins and antidotes as well as potential combinations of toxins and their antidotes. Such a thing is possible when you use textual ETL. You can process all types of text on toxins, including images and videos. Then with enriched metadata and the availability of a taxonomy, this app can be run from your smartphone or tablet anywhere in the world, providing potentially life-saving information.

When you want to create this app, you need a few things:
  • Lightweight interface

  • Ability to access data on demand

  • Ability to parse and navigate the data

  • Quick performance

  • User scalability and concurrency

In summary, you need to be able to create a Google-like behavior, but highly subject oriented, integrated, time variant and non-volatile.

This is where a product like Textual ETLTM is useful. It allows you define your business rules in plain English and process your documents through the engine. If you add more rules as you discover insights, you can reprocess documents any number of times. The engine has a built-in machine learning capability that will capture rules and enable you to process data over and over. The output from the engine is a highly usable metadata-based set of information that is ready for consumption with the associated contexts. On top of this, you can simply add a search appliance and you are ready to start exploration. This is something that you cannot get done in a small time frame if you take the usual route of coding. With a textual ETL product, you will have satisfied all the conditions, and yet have a very flexible and scalable architecture.

This is why you need textual ETL processing, and this is where the success and failure of unstructured data or big data processing happens. With this approach, you enable the business user to be the owner of creating the business rules to interrogate this data and process it multiple times for each rule condition and context, which will resolve content disambiguation. This process will also resolve the ownership question of whether IT or business is the responsible owner for unstructured data processing in the enterprise.

Remember that processing unstructured data is a solution architecture that will use a myriad of technologies, but a product like Textual ETLTM will simplify the processing portion of that solution architecture and will put the power of defining the business rules for data interrogation in the hands of the business users.

SOURCE: Unstructured Data Processing – Why Textual ETL?

  • Krish KrishnanKrish Krishnan
    Krish Krishnan is a worldwide-recognized expert in the strategy, architecture, and implementation of high-performance data warehousing solutions and big data. He is a visionary data warehouse thought leader and is ranked as one of the top data warehouse consultants in the world. As an independent analyst, Krish regularly speaks at leading industry conferences and user groups. He has written prolifically in trade publications and eBooks, contributing over 150 articles, viewpoints, and case studies on big data, business intelligence, data warehousing, data warehouse appliances, and high-performance architectures. He co-authored Building the Unstructured Data Warehouse with Bill Inmon in 2011, and Morgan Kaufmann will publish his first independent writing project, Data Warehousing in the Age of Big Data, in August 2013.

    With over 21 years of professional experience, Krish has solved complex solution architecture problems for global Fortune 1000 clients, and has designed and tuned some of the world’s largest data warehouses and business intelligence platforms. He is currently promoting the next generation of data warehousing, focusing on big data, semantic technologies, crowdsourcing, analytics, and platform engineering.

    Krish is the president of Sixth Sense Advisors Inc., a Chicago-based company providing independent analyst, management consulting, strategy and innovation advisory and technology consulting services in big data, data warehousing, and business intelligence. He serves as a technology advisor to several companies, and is actively sought after by investors to assess startup companies in data management and associated emerging technology areas. He publishes with the BeyeNETWORK.com where he leads the Data Warehouse Appliances and Architecture Expert Channel.

    Editor's Note: More articles and resources are available in Krish's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Krish Krishnan



 

Comments

Want to post a comment? Login or become a member today!

Be the first to comment!