Cloud Journal



Getting The Most Of Cloud Big Data With HDInsight Azure Vault Storage

Written by  Angelo Racoma | 06 February 2013
E-mail PDF

bigdataCloud storage is not just about storing data in a shareable configuration. First off, there is not much use for cloud computing if the data is not big. If it was, it would be easier to install network attached storage (NAS) or storage area networks (SAN).  It is also about using that data with software as a service (SaaS) and having it accessible on-demand.

The solution set becomes even more complex when big data on the cloud is in use. Microsoft's HDInsight using Azure Vault Storage is a set of tools and strategy for computing big data on the cloud. It promises to be a viable strategy which allows for multiple users, instances and processes without bringing the whole infrastructure to its knees.

Big data computation is the next step up, after data warehousing and data mining. It is no longer a simple matter of accumulating data and finding relevant information in the data, but also computing for relevant and timely information from the whole data set. Among many problems with early data mining efforts is that a process might need the entire computing power of the computer cluster to finish a single job. This would be specially true when it comes to real-time data.

Distributed computing strategy

HDInsight provides a solution based on a distributed computing strategy, which is designed to keep all processes running. Like all big data computing, it uses a lot of tools to make it work. Two distinct and disparate file systems are used for completely different functions. Azure Vault Storage (ASV) is used by HDInsight as it provides high scalability and high availability shareable long term storage. On the other hand, HDInsight uses Hadoop Distributed File System (HDFS) on Hadoop clusters and optimized for Map/Reduce (M/R) computational tasks on the data.

Instead of keeping data on Hadoop clusters where the computations would be done, HDInsight would do the computations on ASV cluster compute nodes. Once the computations are done, the HDInsight clusters are dropped and the ASV compute nodes are ready for another task. Due to the nature of the computational problems for which HDInsight was designed to solve, it is necessary to create and drop Hadoop clusters as the need arises, freeing up resources.

Since HDFS runs on Java, there is only a small footprint for the clusters. Additionally, there is no need to create the storage volumes using low-level tools which would be time consuming for the amount of data involved. It is also possible to run these solutions on Windows Azure virtual machines. However, that strategy is unsupported, but a viable alternative if the servers are running on non-Windows operating systems. Typically, the solution calls for a HDIsight Server which may or may not be running under Azure.

Using Azure Vault Storage in HDInsight, with the help of Hadoop clusters, can prove to be a much simpler solution compared to the competition. It is also robust with the components designed to be highly scalable and with capable of high availability services.

Angelo Racoma

Angelo Racoma

J. Angelo Racoma is a journalist and community manager with a keen eye for emerging standards and technologies. Angelo writes for ToolsJournal covering Technology and Startups. Besides ToolsJournal he covers startups for, Android and Google at Android Authority, the APAC tech scene for Tech Wire Asia, and enterprise news at CMSWire.

blog comments powered by Disqus