Learning Apache Kafka - My Journey into Event Streaming

July 14, 2024 · 5 min read

I recently spent time learning Apache Kafka, and I wanted to share my notes and learnings. If you're working with distributed systems or need to handle real-time data streams, Kafka is a game-changer.

What is Apache Kafka?

Kafka is an event streaming platform that acts like the central nervous system for real-time data flows. Think of it as a highly scalable message broker that can handle massive amounts of data in real-time.

Real-World Use Cases

Here's where I've seen Kafka shine:

Financial Services: Real-time transaction processing in stock exchanges and banks
Logistics: Tracking vehicles and shipments in real-time
IoT: Capturing and analyzing sensor data from thousands of devices
E-commerce: Reacting to customer interactions as they happen
Healthcare: Monitoring patient data for timely treatment
Microservices: Building event-driven architectures

How Kafka Works

Kafka has two main components:

Servers (Brokers): These form the storage layer, continuously importing/exporting data and ensuring high availability through replication.

Clients: Allow distributed applications to read, write, and process events at scale.

Key Concepts

Event: A record of something that happened (key, value, timestamp, and optional metadata)
Producers: Applications that publish events to Kafka
Consumers: Applications that subscribe to and process events
Topics: Organized storage for events, partitioned for scalability
Consumer Groups: Multiple consumers working together to process messages in parallel
Partitions: Allow topics to scale horizontally across multiple brokers

Setting Up Kafka with Docker

Instead of dealing with complex installations, I used Docker to get Kafka running quickly. Here's my setup:

version: '2'

services:
  zookeeper:
    image: wurstmeister/zookeeper:latest
    ports:
      - "2181:2181"

  kafka:
    image: wurstmeister/kafka:latest
    ports:
      - "9092:9092"
    expose:
      - "9093"
    environment:
      KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
      KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_CREATE_TOPICS: "tomyum-topic:1:1"
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

To start Kafka:

docker-compose up -d

To check status:

docker ps

To stop:

docker-compose down

Using KafkaJS with TypeScript

For my Node.js projects, I used KafkaJS - a modern JavaScript client for Kafka with excellent TypeScript support.

Setting Up

npm install kafkajs
npm install --save-dev @types/kafkajs typescript

Connecting to Kafka

import { Kafka } from 'kafkajs';

const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['localhost:9092']
});

const consumer = kafka.consumer({ groupId: 'test-group' });

Consuming Messages

Here's a basic consumer that reads messages from a topic:

const run = async () => {
  await consumer.connect();
  await consumer.subscribe({ topic: 'test-topic', fromBeginning: true });

  await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
      console.log({
        partition,
        offset: message.offset,
        value: message.value.toString(),
      });
    },
  });
};

run().catch(console.error);

Error Handling

KafkaJS provides robust error handling mechanisms:

consumer.on('consumer.crash', async event => {
  console.error('Consumer crashed', event);
});

consumer.on('consumer.disconnect', async () => {
  console.log('Consumer disconnected');
});

consumer.on('consumer.connect', async () => {
  console.log('Consumer connected');
});

Graceful Shutdown

Always handle graceful shutdown to prevent data loss:

process.on('SIGINT', async () => {
  try {
    await consumer.disconnect();
  } finally {
    process.exit(0);
  }
});

TypeScript Type Safety

One thing I love about KafkaJS is the built-in TypeScript support. You can define types for your messages:

interface KafkaMessage {
  key: string;
  value: string;
  headers?: Record<string, any>;
}

const run = async () => {
  await consumer.connect();
  await consumer.subscribe({ topic: 'test-topic', fromBeginning: true });

  await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
      const kafkaMessage: KafkaMessage = {
        key: message.key?.toString(),
        value: message.value.toString(),
        headers: message.headers,
      };

      console.log(kafkaMessage);
    },
  });
};

Consumer Groups and Partitions

This is where Kafka gets really powerful. Understanding consumer groups and partitions is crucial for scaling your applications.

Consumer Groups

If you want each consumer to process all messages independently, assign different group IDs:

// Consumer 1
const consumer1 = kafka.consumer({ groupId: 'consumer1-group' });

// Consumer 2
const consumer2 = kafka.consumer({ groupId: 'consumer2-group' });

Both consumers will receive all messages independently.

Partitions for Parallel Processing

If you want consumers in the same group to work in parallel, ensure your topic has enough partitions:

// Both consumers share the same group ID
const consumer1 = kafka.consumer({ groupId: 'consumer-group' });
const consumer2 = kafka.consumer({ groupId: 'consumer-group' });

Kafka will assign different partitions to each consumer, allowing parallel processing while maintaining message order within each partition.

Important: In a consumer group, only one consumer instance processes a particular partition at a time. This ensures message ordering and prevents duplication.

Kafka Streams

Kafka Streams is a powerful library for building real-time, event-driven applications. It allows you to process data streams with transformations, aggregations, and joins.

Key Concepts

Stream: An unbounded, continuously updating sequence of records
Table: A snapshot of the latest state (like a database table)
KStream: Represents an unbounded stream of data
KTable: Represents a changelog stream with updates to keys
Topology: The logical representation of your stream processing application

Common Operations

Transformation: Mapping, filtering, grouping, and aggregating data
Joining: Combining streams or a stream with a table based on keys
Windowing: Grouping records into time-based windows for processing

What I Learned

Working with Kafka taught me several important lessons:

Scalability comes from design: Partitions and consumer groups give you horizontal scalability without code changes.
Event-driven thinking: Instead of thinking about request-response, think about events flowing through your system.
Message ordering matters: Kafka guarantees ordering within a partition, but not across partitions. Design your partition keys carefully.
Monitoring is crucial: With distributed systems, you need good observability into your consumers and producers.
Start simple: Kafka can be complex, but starting with basic producers and consumers helps you understand the fundamentals before diving into streams and advanced features.

Next Steps

If you're interested in learning Kafka, I'd recommend:

Start with Docker to avoid installation headaches
Build a simple producer-consumer application
Experiment with consumer groups and partitions
Try Kafka Streams for real-time processing
Check out the official Kafka documentation

You can find all my code examples and setup instructions in my GitHub repository.

Kafka has been an eye-opener for building scalable, event-driven systems. Whether you're building microservices, processing IoT data, or handling real-time analytics, it's worth adding to your toolkit.

What is Apache Kafka?​

Real-World Use Cases​

How Kafka Works​

Key Concepts​

Setting Up Kafka with Docker​

Using KafkaJS with TypeScript​

Setting Up​

Connecting to Kafka​

Consuming Messages​

Error Handling​

Graceful Shutdown​

TypeScript Type Safety​

Consumer Groups and Partitions​

Consumer Groups​

Partitions for Parallel Processing​

Kafka Streams​

Key Concepts​

Common Operations​

What I Learned​

Next Steps​

What is Apache Kafka?

Real-World Use Cases

How Kafka Works

Key Concepts

Setting Up Kafka with Docker

Using KafkaJS with TypeScript

Setting Up

Connecting to Kafka

Consuming Messages

Error Handling

Graceful Shutdown

TypeScript Type Safety

Consumer Groups and Partitions

Consumer Groups

Partitions for Parallel Processing

Kafka Streams

Key Concepts

Common Operations

What I Learned

Next Steps