Sunday, June 23, 2013

Windows Apps and Hadoop - Interoperability Options

Windows offers the largest and richest set of enterprise-wide end-user business tools.These tools can be used for analysis, visualization and for slicing and dicing the underlying data. As more and more enterprises embrace big data and Hadoop, the need for interoperability between the two is also growing. As you may note that Hadoop is from the Linux stable. And it should not come as a surprise to anyone that it can be daunting to  inter-operate the two together. This post presents some basics of your interoperability options.

The diagram below depicts the relationship between various elements of the technology stacks for Windows, Hadoop and deployment cloud architectures.
Windows Hadoop Interop

Quick takeaways

  • If you have only Windows/.NET skilled developers and administrators, then you should go with HDInsight product using Windows Azure cloud.
  •  If you have Linux admins but Windows/.NET developers then you have the option of either having a private Linux cloud or use a service like Rackspace to host your Hadoop cluster. 
  • There is no one-size-fits-all approach here. You will need to look around your company and decide.
Assumption here is that all of your business analysts and data science folks are familiar with Windows tools.

Source: Microsoft.com

Interoperability Options

APIs and Drivers

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of Hadoop.
  • Microsoft Hive ODBC Driver. This driver enables the connection from Hive to Excel.
  • Microsoft .NET Map Reduce API For Hadoop Recently Microsoft released the .NET API to connect with Map/Reduce functionality of Hadoop Streaming.

Data Explorer Plugin for Excel

This is the easiest way to slice, dice and visualize data sitting in a Hadoop cluster. Please see my earlier post on Data Explorer Plugin.

Support for Third Party Tools

Many businesses are finding that they need to use third party software that may not support Hadoop ecosystem natively. This is where you want to use tools such as Talend, Pentaho and others. With these tools you can either import data to databases that are supported by your favorite tool or comma separated value files (CSVs) can always be relied upon.