Learning Apache Kafka - My Journey into Event Streaming
I recently spent time learning Apache Kafka, and I wanted to share my notes and learnings. If you're working with distributed systems or need to handle real-time data streams, Kafka is a game-changer.
What is Apache Kafka?
Kafka is an event streaming platform that acts like the central nervous system for real-time data flows. Think of it as a highly scalable message broker that can handle massive amounts of data in real-time.
Real-World Use Cases
Here's where I've seen Kafka shine:
- Financial Services: Real-time transaction processing in stock exchanges and banks
- Logistics: Tracking vehicles and shipments in real-time
- IoT: Capturing and analyzing sensor data from thousands of devices
- E-commerce: Reacting to customer interactions as they happen
- Healthcare: Monitoring patient data for timely treatment
- Microservices: Building event-driven architectures
How Kafka Works
Kafka has two main components:
Servers (Brokers): These form the storage layer, continuously importing/exporting data and ensuring high availability through replication.
Clients: Allow distributed applications to read, write, and process events at scale.
Key Concepts
- Event: A record of something that happened (key, value, timestamp, and optional metadata)
- Producers: Applications that publish events to Kafka
- Consumers: Applications that subscribe to and process events
- Topics: Organized storage for events, partitioned for scalability
- Consumer Groups: Multiple consumers working together to process messages in parallel
- Partitions: Allow topics to scale horizontally across multiple brokers
Setting Up Kafka with Docker
Instead of dealing with complex installations, I used Docker to get Kafka running quickly. Here's my setup:
version: '2'
services:
zookeeper:
image: wurstmeister/zookeeper:latest
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka:latest
ports:
- "9092:9092"
expose:
- "9093"
environment:
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_CREATE_TOPICS: "tomyum-topic:1:1"
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
To start Kafka:
docker-compose up -d
To check status:
docker ps
To stop:
docker-compose down
Using KafkaJS with TypeScript
For my Node.js projects, I used KafkaJS - a modern JavaScript client for Kafka with excellent TypeScript support.
Setting Up
npm install kafkajs
npm install --save-dev @types/kafkajs typescript
Connecting to Kafka
import { Kafka } from 'kafkajs';
const kafka = new Kafka({
clientId: 'my-app',
brokers: ['localhost:9092']
});
const consumer = kafka.consumer({ groupId: 'test-group' });
Consuming Messages
Here's a basic consumer that reads messages from a topic:
const run = async () => {
await consumer.connect();
await consumer.subscribe({ topic: 'test-topic', fromBeginning: true });
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
console.log({
partition,
offset: message.offset,
value: message.value.toString(),
});
},
});
};
run().catch(console.error);
Error Handling
KafkaJS provides robust error handling mechanisms:
consumer.on('consumer.crash', async event => {
console.error('Consumer crashed', event);
});
consumer.on('consumer.disconnect', async () => {
console.log('Consumer disconnected');
});
consumer.on('consumer.connect', async () => {
console.log('Consumer connected');
});
Graceful Shutdown
Always handle graceful shutdown to prevent data loss:
process.on('SIGINT', async () => {
try {
await consumer.disconnect();
} finally {
process.exit(0);
}
});
TypeScript Type Safety
One thing I love about KafkaJS is the built-in TypeScript support. You can define types for your messages:
interface KafkaMessage {
key: string;
value: string;
headers?: Record<string, any>;
}
const run = async () => {
await consumer.connect();
await consumer.subscribe({ topic: 'test-topic', fromBeginning: true });
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const kafkaMessage: KafkaMessage = {
key: message.key?.toString(),
value: message.value.toString(),
headers: message.headers,
};
console.log(kafkaMessage);
},
});
};
Consumer Groups and Partitions
This is where Kafka gets really powerful. Understanding consumer groups and partitions is crucial for scaling your applications.
Consumer Groups
If you want each consumer to process all messages independently, assign different group IDs:
// Consumer 1
const consumer1 = kafka.consumer({ groupId: 'consumer1-group' });
// Consumer 2
const consumer2 = kafka.consumer({ groupId: 'consumer2-group' });
Both consumers will receive all messages independently.
Partitions for Parallel Processing
If you want consumers in the same group to work in parallel, ensure your topic has enough partitions:
// Both consumers share the same group ID
const consumer1 = kafka.consumer({ groupId: 'consumer-group' });
const consumer2 = kafka.consumer({ groupId: 'consumer-group' });
Kafka will assign different partitions to each consumer, allowing parallel processing while maintaining message order within each partition.
Important: In a consumer group, only one consumer instance processes a particular partition at a time. This ensures message ordering and prevents duplication.
Kafka Streams
Kafka Streams is a powerful library for building real-time, event-driven applications. It allows you to process data streams with transformations, aggregations, and joins.
Key Concepts
- Stream: An unbounded, continuously updating sequence of records
- Table: A snapshot of the latest state (like a database table)
- KStream: Represents an unbounded stream of data
- KTable: Represents a changelog stream with updates to keys
- Topology: The logical representation of your stream processing application
Common Operations
- Transformation: Mapping, filtering, grouping, and aggregating data
- Joining: Combining streams or a stream with a table based on keys
- Windowing: Grouping records into time-based windows for processing
What I Learned
Working with Kafka taught me several important lessons:
-
Scalability comes from design: Partitions and consumer groups give you horizontal scalability without code changes.
-
Event-driven thinking: Instead of thinking about request-response, think about events flowing through your system.
-
Message ordering matters: Kafka guarantees ordering within a partition, but not across partitions. Design your partition keys carefully.
-
Monitoring is crucial: With distributed systems, you need good observability into your consumers and producers.
-
Start simple: Kafka can be complex, but starting with basic producers and consumers helps you understand the fundamentals before diving into streams and advanced features.
Next Steps
If you're interested in learning Kafka, I'd recommend:
- Start with Docker to avoid installation headaches
- Build a simple producer-consumer application
- Experiment with consumer groups and partitions
- Try Kafka Streams for real-time processing
- Check out the official Kafka documentation
You can find all my code examples and setup instructions in my GitHub repository.
Kafka has been an eye-opener for building scalable, event-driven systems. Whether you're building microservices, processing IoT data, or handling real-time analytics, it's worth adding to your toolkit.