Author: Pulkit Bindal
Basics of Kafka
Apache Kafka is an open-source distributed streaming platform that is designed to handle real-time data feeds. It is a high-throughput low-latency system that can process and deliver large amounts of data quickly and efficiently.
At its core, Kafka is a distributed messaging system that allows you to send and receive data between different applications or systems. It uses a publish-subscribe model where data is produced by publishers and consumed by subscribers. Kafka is known for its scalability and fault tolerance which makes it a popular choice for handling large-scale data streams. It can be used for a variety of use cases including real-time data processing, data integration, and data analysis.
Overall, Kafka is a powerful and flexible tool for handling real-time data streams. Whether you are building a data processing pipeline, integrating multiple systems, or performing real-time data analysis, Kafka can help you achieve your goals efficiently and effectively.
Key Components of Kafka
Here are some key components of Kafka. Depending on the use case, we may need to use some or all of these components to build a Kafka-based data processing pipeline.
1. Topics: Kafka organizes data into topics which are identified by a name. Topics are further divided into partitions which allow for parallel processing of data.
2. Producers: Producers are applications that write data to Kafka topics. They can send data to specific partitions or let Kafka handle partitioning.
3. Consumers: Consumers are applications that read data from Kafka topics. They can read data from one or more partitions and can be configured to start reading data from a specific offset.
4. Brokers: Kafka brokers are the servers that make up the Kafka cluster. They are responsible for storing and serving data as well as handling administrative tasks like creating and deleting topics.
5. Connectors: Kafka Connect is a framework for building and running connectors that move data between Kafka and other systems. Connectors can be used to integrate Kafka with databases, file systems, and other messaging systems.
6. Streams: Kafka Streams is a library for building real-time stream processing applications. It allows you to process data as it is received, rather than storing and processing it later.
7. ZooKeeper: Kafka relies on Apache ZooKeeper for maintaining configuration information, providing distributed synchronization, and electing leaders for partitions. ZooKeeper is a distributed coordination service that helps Kafka operate as a distributed system.
8. Cluster: A Kafka cluster is a group of Kafka brokers working together to serve Kafka topics. A cluster typically consists of multiple brokers spread across multiple machines, with one broker acting as the leader for each partition. Kafka’s distributed nature allows it to scale horizontally, handling large volumes of data across many machines.
9. Consumers Groups: Kafka allows multiple consumers to read from the same topic but it also allows you to group consumers together into a consumer group. Each consumer group can have multiple consumers, and each consumer within a group reads from a unique subset of the partitions. This allows for parallel processing of data within a group, while also ensuring that each message is processed by only one consumer within a group.
Why we need Kafka Cluster
A Kafka cluster is a tool that helps businesses handle large amounts of data that are generated from various sources such as social media platforms, financial systems, and IoT devices. This tool enables companies to process and analyze data in real time, making it easier to make quick decisions and respond to events as they happen. Kafka’s distributed architecture ensures that the system remains reliable and available even if one or more nodes fail. This makes it suitable for handling mission-critical applications that require high availability and scalability. Therefore, businesses that need to process large amounts of data quickly and efficiently should consider using a Kafka cluster.
How the problem arises
Here is a real-time use case for running a Kafka cluster locally using Docker:
Suppose you are a developer working on a project that requires processing large volumes of real-time data. You have decided to use Kafka as your messaging system, but you don’t want to set up a full-blown Kafka cluster on your local machine. To solve this problem, you can use Docker to run a Kafka cluster locally. This allows you to test and develop your Kafka-based application without having to set up and manage a complex cluster.
Here is a step-by-step tutorial for using Docker and the included docker-compose.yml file to set up a Kafka cluster locally:
- Activate Docker on your computer if you haven’t already.
- On your own computer, create a new directory and save the given docker-compose.yml file there.
Complete Source Code: KAFKA-BROKER
version: '3' services: zookeeper: image: confluentinc/cp-zookeeper:7.0.1 container_name: zookeeper ports: - "2181:2181" environment: ZOOKEEPER_CLIENT_PORT: 2181 ZOOKEEPER_TICK_TIME: 2000 broker: image: confluentinc/cp-kafka:7.0.1 container_name: broker depends_on: - zookeeper ports: - "29092:29092" - "9092:9092" - "9101:9101" environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092 KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1 KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1 KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0 KAFKA_JMX_PORT: 9101 KAFKA_JMX_HOSTNAME: localhost
- Go to the location where you saved the docker-compose.yml file in a terminal window after opening one.
- Start the Kafka cluster by using the following command:
docker compose up
Using this command. The Zookeeper and Kafka broker containers will launch in detached mode.
- Run the following command to see if the Kafka cluster is up and running:
docker compose ps
The status of the Zookeeper containers and Kafka broker will be displayed via the ps command.
You are done now! With Docker, you were able to successfully build up a Kafka cluster locally. Your Kafka apps can now be tested.
We described how to build up a Kafka cluster locally in this blog using Docker and the included docker-compose.yml file. Without having to set up a dedicated Kafka cluster on our local PC, we can quickly build a local Kafka cluster using Docker and test our Kafka applications. With this configuration, it is simple to begin developing real-time data pipelines and streaming apps.