Some of the findings reported by the authors:
- saw under-use of Hadoop features, extensions, and tools
- saw significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem
- found significant opportunities for optimizations of these workloads
- found job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems
- lack of good debugging tools
The study analyzed a lot of workloads to reach the conclusions. From the paper:
"Our analysis is based on Hadoop workloads collected over
periods of five to 20 months in three different clusters. Our
traces comprise a total of more than 100,000 Hadoop jobs.
The clusters that we study come from academic institutions.
Our data scientists are 113 domain experts from various
Even though the paper focused on the use of Hadoop by research community, I suspect the same conclusions hold true for commercial usage (at least based on my experience). Will love to hear your comments/views.
Image Credit: By Petrovsky; digitally edited by W.[w.]] [CC-BY-3.0], via Wikimedia Commons