Database Technology for the Web: Part 3 – Hadoop

Originally published 25 August 2009

This month, in my continuing series of articles on Web database technologies, I take a look at Hadoop, an open source Apache project that provides a Java software framework for building distributed applications involving petabytes of data stored on large clusters of commodity hardware. 

Hadoop Origins

Hadoop was created by Doug Cutting who recently left Yahoo to join Cloudera (a company that provides commercial support and professional training for Hadoop). The Wikipedia Hadoop entry notes that Cutting named the project after his child’s stuffed elephant!His blog traces the history of Hadoop back to five years ago when he was working on the Apache Nutch open source web search engine with Mike Cafarella from the University of Washington. Inspired by Google’s technical papers on the distributed Google File System (GFS) and the MapReduce distributed data processing model, Cutting and Cafarella built a distributed file system that could run Nutch in parallel on twenty machines with near-linear scaling.

Cutting then joined Yahoo to work with a team of developers to extend the Hadoop distributed file system to support billions of web pages. The Yahoo team moved the Nutch distributed computing code into the Apache Hadoop project, added new features to Hadoop, and improved its scalability, performance and reliability. Today, Hadoop is used to generate and support Yahoo’s web search index and holds the record for the big data sort benchmark by sorting 100 terabytes of information in 173 minutes. Facebook also uses Hadoop to support its 2.5 petabyte system, which ingests some 15 terabytes of data per day.

Yahoo continues to be a large contributor to the Apache Hadoop project. IBM and Google also have a major initiative to use Hadoop to support university courses in distributed computer programming.

What is Apache Hadoop?

The Apache Hadoop project consists of several subprojects:
  • HDFS is a distributed file system that stores and replicates large files across multiple machine nodes without the need for RAID storage. Nodes can communicate with each other to move and rebalance data and to keep a high level of data replication and availability.

  • MapReduce is a software framework for distributing the processing of large data files across multiple machine clusters. For more information on MapReduce, see my BeyeNETWORK article "Database Technology for the Web: Part 1 – The MapReduce Debate."

  • HBase is a column-oriented, distributed database modeled after Google's BigTable storage system and is written in Java. It runs on top of HDFS.

  • Hive provides tools for batch-style query, analysis and summarization of large Hadoop data files. It provides a query language called QL that is based on SQL and that also allows MapReduce developers to plug in custom map and reduce programs to do more sophisticated analysis. It was developed at Facebook and contributed to the Apache project.

  • Pig is a platform that provides at alternative to MapReduce for analyzing large Hadoop data files. It consists of a high-level language for expressing data analysis programs and a parallel processing framework for the running of those programs.

  • Avro is a data serialization system that can be integrated with dynamic scripting languages.

  • ZooKeeper is a centralized service for maintaining and coordinating configuration, naming and other types of information in a Hadoop distributed processing environment.

  • Chukwa collects and analyzes data about Hadoop operations.

  • Hadoop Common is a set of common utilities that support other Hadoop subprojects.

Why Hadoop?

A web search on the pluses and minuses of Hadoop results in a large number of entries that demonstrate very differing viewpoints depending largely on whether the writer is coming from a traditional enterprise IT environment involving relational DBMS products or from an environment where there is a strong presence of web open source technology. Two more balanced perspectives areSQL and Hadoop by Michael Wexler of Yahoo and The Commoditization of Massive Data Analysis by Joseph Hellerstein, a professor of computer science at UC Berkeley. I have reproduced a table from the latter article below. The table provides a good starting point for positioning Hadoop with respect to relational DBMS technology.

Relational Databases  MapReduce (Hadoop)
Multipurpose: useful for analysis and data update, batch and interactive tasks
Designed for large clusters: 1000+ computers
High data integrity via ACID transactions Very high availability, keeping long jobs running efficiently even when individual computers break or slow down
Lots of compatible tools (e.g., for loading, management, reporting, data visualization and mining) Data is accessed in "native format" from a filesystem – no need to transform data into tables at load time
Support for SQL, the most widely-used language for data analysis No special query language; programmers use familiar languages like Java, Python, and Perl
Automatic SQL query optimization, which can radically improve performance Programmers retain control over performance, rather than counting on a query optimizer
Integration of SQL with familiar programming languages via connectivity protocols, mapping layers and user-defined functions The open-source Hadoop implementation is funded by corporate donors, and will mature over time as Linux and Apache did

Source: Joseph Hellerstein, UC Berkeley

Leaving the programming aspects aside, which I have already covered in my earlier BeyeNETWORK article on MapReduce, three other key Hadoop considerations are integrity, availability and performance.

In general, relational DBMSs provide high transaction integrity and support complex data relationships, whereas Hadoop provides high availability through software, which enables the use of low-cost commodity hardware. Of course, integrity and availability features are moving targets. The same can be said for performance. For those interested in benchmark data, the following reports provide useful performance data:

Hadoop will outperform a relational DBMS when processing huge amounts of data in batch on very large computer configurations using mass parallelism. This brute force approach is especially applicable in situations where it is difficult for the relational optimizer to exploit the more advanced data management techniques offered by relational DBMS products. An example here is in the processing of large unstructured data files. For more traditional query and interactive processing against structured data, however, relational DBMS products will usually outperform Hadoop. 

Query performance is not the only thing that must be taken into consideration. The cost associated with extracting data and loading into a data store must also be taken into account. Extracting data from several hundred log files, for example, is a time-consuming process. Also, the log data may be generated in such volumes that it may not be possible for traditional relational DBMS transformation and load programs to keep up. So while in some cases Hadoop query times may be slower, data access may be improved. Hadoop can also be used to preprocess and filter data before it is loaded into a relational DBMS. 

Who Uses Hadoop?

A wide range of companies are offering Hadoop software solutions and deploying Hadoop applications. A list of some of the key ones can be found on the Apache Hadoop website. Examples include Amazon, Adobe, AOL, Cloudera, Facebook, Fox Audience Network, Google, IBM, New York Times and Yahoo. Several companies are also offering hybrid relational DBMS and Hadoop solutions. Examples include Aster Data, Greenplum and Vertica. A really interesting open source hybrid is HadoopDB developed at Yale University. HadoopDB includes PostgreSQL, Hadoop and Hive, and an interface that accepts queries in MapReduce or SQL. It generates query plans that are processed partly in Hadoop and partly in PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines.

Where Next?

Hadoop (and associated technologies such as MapReduce) are getting a lot of attention and are gaining in popularity. However, Hadoop is not an alternative to a relational DBMS for many enterprise applications. Hadoop is useful, however, for certain specialized data-intensive applications that relational DBMSs may not be able to handle such as the batch processing of huge amounts of web, event, log or scientific data.

Given the above, where is the marketplace heading? I think Joseph Hellerstein summarizes it very well in his blog: “The technical advantages of Hadoop are not intrinsically hard to replicate in a relational database engine; the main challenge will be to manage the expectations of database users when playing tricks like trading off data integrity for availability on certain subsets of the database. Greenplum and Aster will undoubtedly push to stay one step ahead of the bigger database companies, and it would not surprise me to see product announcements on this topic from the more established database vendors within the year.” In general, I agree with this prediction, but of course some organizations may still prefer to use free open source offerings. It is also important to realize that Hadoop involves more than just supporting MapReduce.

SOURCE: Database Technology for the Web: Part 3 – Hadoop

  • Colin WhiteColin White

    Colin White is the founder of BI Research and president of DataBase Associates Inc. As an analyst, educator and writer, he is well known for his in-depth knowledge of data management, information integration, and business intelligence technologies and how they can be used for building the smart and agile business. With many years of IT experience, he has consulted for dozens of companies throughout the world and is a frequent speaker at leading IT events. Colin has written numerous articles and papers on deploying new and evolving information technologies for business benefit and is a regular contributor to several leading print- and web-based industry journals. For ten years he was the conference chair of the Shared Insights Portals, Content Management, and Collaboration conference. He was also the conference director of the DB/EXPO trade show and conference.

    Editor's Note: More articles and resources are available in Colin's BeyeNETWORK Expert Channel. Be sure to visit today!

Recent articles by Colin White



Want to post a comment? Login or become a member today!

Be the first to comment!