The solution set becomes even more complex when big data on the cloud is in use. Microsoft's HDInsight using Azure Vault Storage is a set of tools and strategy for computing big data on the cloud. It promises to be a viable strategy which allows for multiple users, instances and processes without bringing the whole infrastructure to its knees.
Big data computation is the next step up, after data warehousing and data mining. It is no longer a simple matter of accumulating data and finding relevant information in the data, but also computing for relevant and timely information from the whole data set. Among many problems with early data mining efforts is that a process might need the entire computing power of the computer cluster to finish a single job. This would be specially true when it comes to real-time data.
Distributed computing strategy
HDInsight provides a solution based on a distributed computing strategy, which is designed to keep all processes running. Like all big data computing, it uses a lot of tools to make it work. Two distinct and disparate file systems are used for completely different functions. Azure Vault Storage (ASV) is used by HDInsight as it provides high scalability and high availability shareable long term storage. On the other hand, HDInsight uses Hadoop Distributed File System (HDFS) on Hadoop clusters and optimized for Map/Reduce (M/R) computational tasks on the data.
Instead of keeping data on Hadoop clusters where the computations would be done, HDInsight would do the computations on ASV cluster compute nodes. Once the computations are done, the HDInsight clusters are dropped and the ASV compute nodes are ready for another task. Due to the nature of the computational problems for which HDInsight was designed to solve, it is necessary to create and drop Hadoop clusters as the need arises, freeing up resources.
Since HDFS runs on Java, there is only a small footprint for the clusters. Additionally, there is no need to create the storage volumes using low-level tools which would be time consuming for the amount of data involved. It is also possible to run these solutions on Windows Azure virtual machines. However, that strategy is unsupported, but a viable alternative if the servers are running on non-Windows operating systems. Typically, the solution calls for a HDIsight Server which may or may not be running under Azure.
Using Azure Vault Storage in HDInsight, with the help of Hadoop clusters, can prove to be a much simpler solution compared to the competition. It is also robust with the components designed to be highly scalable and with capable of high availability services.