Thursday, August 22, 2013

Comparison table of interactive analysis offerings for big data

Comparison chart for ad-hoc query tools for interactive analysis of large data sets. They are all good and serve their special needs well. Having said that, Apache drill allows a user to deal with many different types of data sources (large sized) and hence obsoletes the need for expensive and error prone ETL. (Extract-Transform-Load) of data.

Please also see my earlier post on NoSQL database comparison.

Apache Drill Apache Hive BigQuery CitusDB Hadapt HAWQ Impala Phoenix
Owner Community Community Google CitusData Hadapt Greenplum Cloudera Salesforce
Low-latency Yes No Yes Yes Yes Yes Yes Yes
Operational mode On-premise On-premise Hosted, SaaS offering On-premise On-premise Part of Pivotal HD appliance On-premise On-premise
Data shapes Nested, tabular Nested, tabular Nested, tabular Nested, tabular Tabular Tabular Tabular Tabular
Data sources Extensible, incl. HDFS, HBase, Cassandra, MongoDB, RDBMS, etc. HDFS, HBase N/A PostgreSQL, MongoDB, HDFS HDFS/RDBMS HDFS, HBase HDFS, HBase HDFS, HBase
Hadoop dependent No Yes No No Yes No Yes Yes
Schema Optional Required Required Required Required Required Required Required
License Apache 2.0 Apache 2.0 ToS/SLA Commercial Commercial Commercial Apache 2.0/Open Source Proprietary
Source code Open Open Closed Closed Closed Closed Open Open
Query languages Extensible, incl. SQL 2003, MongoQL, DSL, etc. HiveQL SQL subset SQL SQL subset SQL subset SQL/HiveQL subset SQL subset
Columnar storage Yes Possible Yes No No Yes Yes No