Role of Apache Kafka in the Hadoop ecosystem
|Kafka Deployment Model(Source: kafka.apache.org)|
Kafka is designed to be used as plumbing for low-latency, very high throughput pipeline that handles all messaging, tracking, logging, and metrics data. This unified pipeline can provide data feeds to Hadoop clusters as well as to a diverse set of real-time stream processing applications.
It is sort of like ActiveMQ and other JMS based message queuing implementations. There are many differences though and some of them are:
- Acknowledgment not needed for every message, and hence can deliver very high throughput but with less reliability
- Written in Scala
- No message IDs
- Implements Pull-Model
- Performance Stats Collection
- Logs Aggregation
- Real time Events Processing
From Apache/Kafka site
"Apache Kafka is a distributed publish-subscribe messaging system. It is designed to support the following
- Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
- High-throughput: even with very modest hardware Kafka can support hundreds of thousands of messages per second.
- Explicit support for partitioning messages over Kafka servers and distributing consumption over a cluster of consumer machines while maintaining per-partition ordering semantics.
- Support for parallel data load into Hadoop.