Log segments

What are Kafka Log Segments?#

Kafka Log Segments are a powerful mechanism that allows Kafka to efficiently manage and store vast amounts of streaming data. By breaking down large logs into smaller, configurable segments, Kafka ensures high performance, manageability, and robust data retention policies.

All messages published by producers to a Kafka topic are stored within Kafka logs.

These logs are the primary location where messages reside, playing a vital role in enabling communication between producers and consumers via the Kafka cluster.

Traditionally, one might imagine all messages for a topic's partition being stored in a single, ever-growing log file. However, Kafka takes a more efficient approach:

Instead of creating one single, large log file for a particular partition, Kafka creates several smaller files to store all messages.
These small, individual files within a partition on a server are called segments.

Why Segments?#

Imagine a very large book that keeps growing infinitely. If you needed to find a specific page, or if the book became corrupted, managing one massive file would be incredibly difficult and inefficient.

Kafka segments address this by:

Managing Large Volumes of Data: By breaking down a single massive log into smaller, manageable segments, Kafka can handle terabytes or petabytes of data more effectively.
Efficient Retention Policies: Older segments can be easily deleted or archived without affecting the active segments where new messages are being appended.
Improved Recovery: In case of corruption or failure, smaller segments are faster to recover or replicate.

How New Segments are Created?#

Messages are continuously appended to the currently active log segment in a given partition. Kafka is configured with a maximum size limit for each log segment file. Once the current segment file reaches this configured size (in bytes), Kafka automatically creates a new, empty log segment file for subsequent messages. This ensures that no single log file becomes excessively large.

Why Segmentation#

Kafka doesn't write all messages into a single, ever-growing log file. This would become unwieldy and inefficient for operations like deletion or replication. Instead, Kafka divides its log files into multiple smaller segments.

This segmentation is controlled by the log.segment.bytes property. When a log segment reaches a configured size limit (e.g., 2000 bytes as set in a demo), Kafka closes the current segment and starts a new one. Each segment has its own .log, .index, and .timeindex files.

File Naming Convention

A key pattern to observe in Kafka's segmented logs is how the files are named.

Each log file, its corresponding .index file, and .timeindex file within a partition directory will share a common name prefix. This prefix is actually the starting offset of the first message contained within that log segment.

Examples:

00000000000000000000.log indicates the segment starts from offset 0. 00000000000000000027.log indicates the segment starts from offset 27. 00000000000000000090.log indicates the segment starts from offset 90.

This naming convention is crucial for quickly identifying which segment contains a particular message.

How Lookup Works with Multiple Segments#

When a consumer requests a message by offset in a multi-segment environment, Kafka follows a three-step process:

Locate the Segment File (by filename): The Kafka broker first determines which log segment contains the requested offset. It does this by checking the file names in the partition directory. Since file names indicate the starting offset of each segment, the broker can quickly identify the correct .log file without opening any files. For example, if offset 100 is requested, the broker knows it must be in the 00000000000000000090.log file because messages start from offset 90 in this segment, and the next segment starts from offset 109.
Lookup in the Segment's .index File: Once the correct log segment (.log file) is identified, the broker then goes to its corresponding .index file (e.g., 00000000000000000090.index). It performs a binary search within this specific .index file to find the nearest offset and its byte position.
Scan within the Segment's .log File: Finally, with the approximate byte position from the index file, the broker navigates to that position within the actual .log file and starts scanning from there to find the exact message(s) requested.

This multi-step approach ensures that even with hundreds or thousands of gigabytes of messages, Kafka can locate any message with minimal disk I/O and latency.

Dumping and Reading Contents of Log, Index, or TimeIndex Files#

This command helps you view the structured content of these binary files.

kafka-run-class.bat kafka.tools.DumpLogSegments --files [path_to_log_file.log] --print-data-log

Example for .log file:
kafka-run-class.bat kafka.tools.DumpLogSegments --files C:\kafka\kafka-logs\my-topic-0\00000000000000000000.log --print-data-log

Example for .index file:
kafka-run-class.bat kafka.tools.DumpLogSegments --files C:\kafka\kafka-logs\my-topic-0\00000000000000000000.index --print-data-log

Example for .timeindex file:
kafka-run-class.bat kafka.tools.DumpLogSegments --files C:\kafka\kafka-logs\my-topic-0\00000000000000000000.timeindex --print-data-log