The most relevant technical strategy is of course parallelism, and while we have been spending a lot of effort as an industry on parallelizing computation, data parallelism remains a challenge and is the focal point of most current projects. Additionally, it is becoming apparent that in many cases compute grids are bottlenecking on data access. And therefore the pattern of moving compute tasks to the data rather than moving large amounts of data over the network is becoming more and more prevalent.
Several technical approaches combine these strategies, parallelizing both data management and computation, while bringing compute tasks close to the data:
Engineered machines integrate software and hardware mechanisms, combining data and compute parallelization with partitioning, compression and a high-bandwidth backplane to provide very high throughput data processing capabilities.
Oracle Exadata is the earliest engineered machine offered by Oracle, designed to optimize and accelerate The Oracle RAC database. It is actually able to delegate query and analytics execution to the nodes that hold the data, thus radically minimizing data movement. It does this by replacing the traditional database SAN with intelligent storage nodes that can do much more that simple I/O operations, use the optimized ASM filesystem, and are connected to the RAC compute nodes with a high-throughput Infiniband fabric.
Whether using engineered machines or not, the concept of performing analytics right on the Oracle RDBMS is a very powerful one, again following the philosophy of moving computation to the data rather than the other way around. Whether it’s OLAP, predictive, or statistical analytics, Oracle 11gR2 is capable of doing a lot of computation right where the data is stored, since it integrates the RAC data parallelism mechanisms with analytical engines such as R, so that analytical tasks are parallelized along the same principles.
The combination of high throughput analytics with engineered machines has enabled several financial firms to dramatically reduce the time it takes to run analytical workloads. Whether it’s EOD batch processing, on-demand risk calculation, or pricing and valuations, firms are able to do a lot more in much less time, directly affecting the business by enabling continuous, on-demand data processing.
Unlike compute grids. data grids focus on the challenge of data parallelism by distributing in memory data object across a large, horizontally scaled cluster.
Oracle Coherence (formally known as Tangosol) was built from the ground up as an highly scalable and performant data grid, which uses its own efficient protocol for inter-node communication and object serialization. A unique feature of Coherence is that it also provides the ability to ship compute tasks to the nodes holding the data in memory, rather than sending data to compute nodes as most compute grids do. Again, this is based on the principle that it’s cheaper to ship a compute task than it is to move large amounts of data across the wire.
Several firms have been using Coherence to aggregate market data as well as positions data across desks and geographies. And some go even further by continuously executing certain analytics right on the nodes where this data is being held, achieving a real-time view of exposures, P&L, and other calculated metrics.
The concept of schema-less data management has been gaining momentum in recent years. At its core is the notion that developers can be more productive by circumventing the need for complex schema design during the development lifecycle of data-intensive applications, especially when the data lends itself to key-value modeling (e.g. time series data). NoSQL DB is Oracle’s Key Value store implementation, which utilizes the BerkelyDB storage engine – a mature and robust technology.
Despite being based on different principles than a data grid, NoSQL DB follows a similar philosophy, distributing the data horizontally across many nodes to achieve high performance and scalability. It also enables the execution of compute tasks close to the data in order to minimize data movement over the network. The main difference is that it NoSQL DB utilizes the filesystem to store high volumes of data and models it as Key-Value pairs, rather than in an object or relational manner, and therefore lends itself to storing key-value maps rather then complex object graphs the way a data grid would. It excels in very fast lookups of keyed data, such as customer profiles, which are prominent in deep customer analytics and targeting (see use case section)
It is important to keep in mind that despite the name, NoSQL DB is not antithetical to an RDBMS. In fact they both become much more powerful when combined with traditional data warehousing and business intelligence tools. These technologies should therefore be viewed as a continuum rather than in dialectic opposition.
A related technology which is quickly becoming a de-facto standard within the IT world, Hadoop is a complete open-source stack for storing and analyzing massive amounts of data, with several distributions available. Just like the technologies mentioned before, the Hadoop framework achieves its massive scalability by sending compute tasks to the nodes storing the data – in this case within the HDFS filesystem (rather than in-memory in the case of Coherence, or in ASM in the case of Exadata). The filesystem itself is also architected to offer high scalability and availability.
Some organization have started looking at Hadoop as an alternative to the more traditional compute grid, which uses shared storage between compute nodes. This typical architecture provides great compute performance, but can sometimes choke on data access. Hadoop, just like Coherence, takes the opposite approach by shipping compute tasks to the data, therefore achieving better performance and scalability when processing very large volumes of data. The main difference between them is the form in which the data is stored – Objects in memory in the case of Coherence, files on the HDFS filesystem in the case of Hadoop.
Oracle has partnered with Cloudera, the leader in Apache Hadoop-based distributions, which has the enterprise quality and expertise to offer mature, reliable, and innovative products. The Cloudera Hadoop distribution is a key component of another Oracle engineered machine:
Big Data Appliance
The Oracle Big Data Appliance is an engineered machine optimized to run Big Data applications using Hadoop, NoSQL and R. It offers unsurpassed performance, as well as pre-integrated and pre-configured application and hardware platform components which are all engineered to work together (which no small feat in the case of Hadoop). It uses an Infiniband backbone to connect the cluster nodes together, enabling unsurpassed Map Reduce performance without the need to setup and administer an Inifiniband network.
Infiniband is also used to communicate with other engineered machines, enabling high speed data integration between structured and unstructured sources and speed of thought analytics across them.
This end-to-end continuum of Big Data management technologies is a key point to keep in mind, and which I will expand upon in following posts.
[Image Credits: Conservation Magazine]