"An analysis of Hadoop usage in scientific workloads" by Kai Ren, YongChul Kwon, Magdalena Balazinska, Bill Howe attempts to shed some light on usage maturity of the Hadoop ecosystem.
The authors analyzed Hadoop workloads from three different research clusters from a user-centric perspective. The goal they set out was to to better understand data scientists’ use of the system and how well the use of the system matches its design.
Some of the findings reported by the authors:
The study analyzed a lot of workloads to reach the conclusions. From the paper:
"Our analysis is based on Hadoop workloads collected over
periods of five to 20 months in three different clusters. Our
traces comprise a total of more than 100,000 Hadoop jobs.
The clusters that we study come from academic institutions.
Our data scientists are 113 domain experts from various
disciplines..."
Even though the paper focused on the use of Hadoop by research community, I suspect the same conclusions hold true for commercial usage (at least based on my experience). Will love to hear your comments/views.
Image Credit: By Petrovsky; digitally edited by W.[w.]] [CC-BY-3.0], via Wikimedia Commons
The authors analyzed Hadoop workloads from three different research clusters from a user-centric perspective. The goal they set out was to to better understand data scientists’ use of the system and how well the use of the system matches its design.
Some of the findings reported by the authors:
- saw under-use of Hadoop features, extensions, and tools
- saw significant diversity in resource usage and application styles, including some interactive and iterative workloads, motivating new tools in the ecosystem
- found significant opportunities for optimizations of these workloads
- found job customization and configuration are used in a narrow scope, suggesting the future pursuit of automatic tuning systems
- lack of good debugging tools
The study analyzed a lot of workloads to reach the conclusions. From the paper:
"Our analysis is based on Hadoop workloads collected over
periods of five to 20 months in three different clusters. Our
traces comprise a total of more than 100,000 Hadoop jobs.
The clusters that we study come from academic institutions.
Our data scientists are 113 domain experts from various
disciplines..."
Even though the paper focused on the use of Hadoop by research community, I suspect the same conclusions hold true for commercial usage (at least based on my experience). Will love to hear your comments/views.
Image Credit: By Petrovsky; digitally edited by W.[w.]] [CC-BY-3.0], via Wikimedia Commons