Wednesday, August 7, 2013

Is Hadoop usage in it's adolescence phase?

"An analysis of Hadoop usage in scientific workloads" by Kai Ren, YongChul Kwon, Magdalena Balazinska, Bill Howe attempts to shed some light on usage maturity of the Hadoop ecosystem.

The authors analyzed Hadoop workloads from three different research clusters from a user-centric perspective. The goal they set out was to to better understand data scientists’ use of the system and how well the use of the system matches its design.

Some of the findings reported by the authors:
  • saw under-use of Hadoop features, extensions, and tools
  • saw significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem
  • found significant opportunities for optimizations of these workloads
  • found job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems
  • lack of good debugging tools
Based on the above findings the authors conclude is that the use of Hadoop for academic research is still in its adolescence. Easing that use, especially for sophisticated applications, and improving the system to tolerate workload diversity is suggested for future work.

The study analyzed a lot of workloads to reach the conclusions. From the paper:

"Our analysis is based on Hadoop workloads collected over
periods of five to 20 months in three different clusters. Our
traces comprise a total of more than 100,000 Hadoop jobs.
The clusters that we study come from academic institutions.
Our data scientists are 113 domain experts from various
disciplines..."
 
Even though the paper focused on the use of Hadoop by research community, I suspect the same conclusions hold true for commercial usage (at least based on my experience).  Will love to hear your comments/views.


Image Credit: By Petrovsky; digitally edited by W.[w.]] [CC-BY-3.0], via Wikimedia Commons