Cloud Journal

 

 



New Big Data Project Drill Joins Apache Incubator To Make Hadoop Faster


Written by  Sudheer Raju | 22 August 2012
E-mail PDF

apache drillJust like Google MapReduce inspired Apache to come with Hadoop opensource project, Google Dremel paper has inspired MapR and Apache to add their new project Drill to Apache Incubator. Drill is a distributed system for interactive analysis of large-scale datasets. With a design goal to scale to 10,000 servers or more and process petabytes of data in seconds its being designed to efficiently process nested data.

It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations.

Drill will integrate closely with Apache Hadoop with the data living in Hadoop. That is, Drill will support Hadoop FileSystem implementations and HBase. Hadoop-related data formats will be supported (eg, Apache Avro, RCFile) and MapReduce-based tools will be provided to produce column-based formats. Drill tables can be registered in HCatalog. Finally, Hive is being considered as the basis of the DrQL implementation. Check out these slides for more info.

Project Drill architecture constitutes of four key components:

  • Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. With initial support to SQL-like language used by Dremel and Google BigQuery, it will scale to other languages and programming models, such as the Mongo Query Language, Cascading or Plume.
  • Execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers.
  • Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML.

Championed by none other than Hadoop fame Ted Dunning, the initial committers are employees of MapR Technologies, Drawn to Scale and Concurrent Inc. MapR, a hadoop distributor is the leading player in the inception of Drill which offers a commercial version of its own hadoop. “We’ve spent quite a few months talking to lots of organisations and potential users of Drill and to our customer base as well,” said Shiran, who is a founding member of the Drill project. “We wanted to put this out there as an open-source project, rather than just keep it within MapR for our use alone.”

[Image Source]

Sudheer Raju

Sudheer Raju

Founder of ToolsJournal, a technology journal on software tools and services. Sudheer has overall accountability for the webiste product development and is responsible for Sales and Marketing. With a flair to write, Sudheer himself writes for toolsjournal across all journal categories.


blog comments powered by Disqus