In the blog post, Sriram Krishnan writes that the company's Hadoop based data warehouse is rapidly growing beyond the petabyte-scale. But Netflix needs to constantly improve and evolve its homemade architecture to keep up with the growth. He also reveals the company's in-house Hadoop Platform as a Service(PaaS) titled Genie.
To get a sense of just how much data Netflix typically plays around with, let's take a look at some specifics - Netflix has more than 25 million users, about 30 million titles are played every single day (their system captures every time someone rewinds, skips forward or pauses a movie), over 2 billion hours of streaming video were consumed during the last 3 months of 2011 alone, over 3 million searches performed every day, device tracking information and geo-location data, metadata from third-party sources and social media statistics from popular social networks such as Facebook and Twitter.
Netflix computes all this data and even more for both business analytics and to build end user services. The feedback system helps both the customers as well as the company. Hadoop provides for the storage capacity and serves as a processing engine for most of these computations. For the company, Hadoop is more than just a platform. Netflix already manages over 500 clusters of Elastic MapReduce instances on Amazon's web services platform. Netflix also uses these for experimenting with new service and features.
The company possesses an interesting Hadoop architecture. Sriram Krishnan explains that the company makes use of Amazon S3 for storage rather than Hadoop Distributed File System(HDFS). This enables Netflix to ensure that its clusters can run separately while sharing same data sets. They only use HDFS at some specific points where necessary because of slow data retrieval times involved with S3.
Netflix has built its own Platform as a Service type of layer for the Amazon Elastic MapReduce. The service is called Genie. It helps maintain a layer of abstraction by letting engineering post individual Hadoop, Hive and Pig jobs via REST API, without being made to understand the infrastructure within. Genie offers a host of resource management features to the company as well. Krishnan further explains that the company couldn't find an existing solution and hence it was driven by the need of the hour - to build its own solution.
It's clear how Netflix has become a great example of the intersection of cloud computing and big data. The company is keen on building its own big data facilities to harness the power of cloud computing. In the coming years it will be interesting to see how Netflix scales its operations as it goes global. The company did however went through a short outage during Christmas Eve last year, which was subsequently repaired just in time.