These days, massively scalable pub/sub messaging is virtually synonymous with Apache Kafka. Apache Kafka continues to be the rock-solid, open-source, go-to choice for distributed streaming applications, whether you’re adding something like Apache Storm or Apache Spark for processing or using the processing tools provided by Apache Kafka itself. But Kafka isn’t the only game in town.
Developed by Yahoo and now an Apache Software Foundation project, Apache Pulsar is going for the crown of messaging that Apache Kafka has worn for many years. Apache Pulsar offers the potential of faster throughput and lower latency than Apache Kafka in many situations, along with a compatible API that allows developers to switch from Kafka to Pulsar with relative ease.
How should one choose between the venerable stalwart Apache Kafka and the upstart Apache Pulsar? Let’s look at their core open source offerings and what the core maintainers’ enterprise editions bring to the table.
Apache Kafka
Developed by LinkedIn and released as open source back in 2011, Apache Kafka has spread far and wide, pretty much becoming the default choice for many when thinking about adding a service bus or pub/sub system to an architecture. Since Apache Kafka’s debut, the Kafka ecosystem has grown considerably, adding the Scheme Registry to enforce schemas in Apache Kafka messaging, Kafka Connect for easy streaming from other data sources such as databases to Kafka, Kafka Streams for distributed stream processing, and most recently KSQL for performing SQL-like querying over Kafka topics. (A topic in Kafka is the name for a particular channel.)
The standard use-case for many real-time pipelines built over the past few years has been to push data into Apache Kafka and then use a stream processor such as Apache Storm or Apache Spark to pull in data, perform and processing, and then publish output to another topic for downstream consumption. With Kafka Streams and KSQL, all of your data pipeline needs can be handled without having to leave the Apache Kafka project at any time, though of course, you can still use an external service to process your data if required.