Facebook said that it employed the MapReduce implementation from Apache Hadoop as the foundation of its big data infrastructure, and that served them well for several years. But the limitations within Hadoop MapReduce came to light soon with growing data by early 2011. The notable limitations being lack of scalability and optimal cluster utilization, lower latency for small jobs and in-ability to upgrade without down times.
Apache Hadoop has since announced its next generation MapReduce known as "YARN" however Facebook required much more than YARN to resolve its mammoth data analysis needs. YARN is still not declared for public use.
With a new cluster manager, a dedicated job tracker and push based job scheduling along with a finer segregation of roles for each resource, Corona offered an alternative solution at Facebook providing capability to process over 100 PB of data within Facebook's single largest cluster. While Hadoop MapReduce is still supported by Corona, Facebook claims based on some early reports of its tests and production data samples that there is a clear improvement as compared to MapReduce as follows.
- 17% increase in average time to refill a slot (time a map or reduce slot remains idle on a task tracker)
- Resource scheduling fairness drop from 14.3% in Hadoop MapReduce to 3.6% in Corona
- Average job latencies improved by half (25 seconds now from original 50 seconds)
Although Corona is opensourced for use, it only works with Facebook version of Hadoop currently. However its a right step towards involving developer community at this stage by making Corona's source available on GitHub. Its just matter of time when Facebook version of Hadoop will be replaced by the Apache's version used across enterprises. While i would not say its a war between Apache YARN and Facebook CORONA yet, its just another alternative until atleast Apache launches its next major version of Hadoop.