Cloud Journal



Facebook Opensources Its Version Of Hadoop Big Data Platform

Written by  Sudheer Raju | 12 November 2012
E-mail PDF

Facebook CoronaThe day since google has demonstrated concept of MapReduce, the innovative and open source world has been on a roll resulting in variety of big data solutions. However at heart of all such big data innovation is Apache Hadoop. Facebook has opensourced its version of MapReduce platform Corona, a new scheduling framework that separates cluster resource management from job coordination, has a dedicated job tracker and uses push-based scheduling as opposed to pull.

Facebook said that it employed the MapReduce implementation from Apache Hadoop as the foundation of its big data infrastructure, and that served them well for several years. But the limitations within Hadoop MapReduce came to light soon with growing data by early 2011. The notable limitations being lack of scalability and optimal cluster utilization, lower latency for small jobs and in-ability to upgrade without down times.

Apache Hadoop has since announced its next generation MapReduce known as "YARN" however Facebook required much more than YARN to resolve its mammoth data analysis needs. YARN is still not declared for public use.

With a new cluster manager, a dedicated job tracker and push based job scheduling along with a finer segregation of roles for each resource, Corona offered an alternative solution at Facebook providing capability to process over 100 PB of data within Facebook's single largest cluster. While Hadoop MapReduce is still supported by Corona, Facebook claims based on some early reports of its tests and production data samples that there is a clear improvement as compared to MapReduce as follows.

  • 17% increase in average time to refill a slot (time a map or reduce slot remains idle on a task tracker)
  • Resource scheduling fairness drop from 14.3% in Hadoop MapReduce to 3.6% in Corona
  • Average job latencies improved by half (25 seconds now from original 50 seconds)

Although Corona is opensourced for use, it only works with Facebook version of Hadoop currently. However its a right step towards involving developer community at this stage by making Corona's source available on GitHub. Its just matter of time when Facebook version of Hadoop will be replaced by the Apache's version used across enterprises. While i would not say its a war between Apache YARN and Facebook CORONA yet, its just another alternative until atleast Apache launches its next major version of Hadoop.

Sudheer Raju

Sudheer Raju

Founder of ToolsJournal, a technology journal on software tools and services. Sudheer has overall accountability for the webiste product development and is responsible for Sales and Marketing. With a flair to write, Sudheer himself writes for toolsjournal across all journal categories.

blog comments powered by Disqus