Index & Timeindex

When you set up a Kafka cluster, all the Kafka and ZooKeeper logs are stored in a dedicated kafka-logs folder. Within this folder, each topic and its partitions have their own directories.

For instance, a topic named my-topic with one partition (partition 0) will have a folder named my-topic-0.

Inside these partition-specific folders, Kafka stores all its messages in log files. These messages are appended sequentially to the log files, meaning new messages are always added to the end. This append-only design is fundamental to Kafka's performance.

However, alongside these main log files (often with a .log extension), you'll notice other files with .index and .timeindex extensions. These are not where the actual message data resides, but they play a critical role in making message retrieval highly efficient.

The Challenge of Large Log Files#

Imagine a scenario where a Kafka log file grows to be enormously large as more and more messages are produced. If a consumer requests messages starting from a specific offset (e.g., offset 1500), the Kafka broker would have to scan the entire log file from the beginning until it finds the requested offset. This full-file scanning is highly inefficient, especially with high message throughput and retention. It's analogous to searching for a specific record in a traditional database without any indexes – you'd have to read every single row until you find what you're looking for. To avoid such inefficiencies, databases use indexing. Kafka implements a similar concept for its log files.

The Role of `.index` Files (Offset Index)#

Purpose and Structure: The .index files are Kafka's solution to the large log file scanning problem. Their primary purpose is to help the Kafka broker quickly find the exact position (byte offset) of a message for a given offset within a log file.

An .index file contains a mapping of offset to position (byte offset) within the corresponding .log file.

Example Content of an .index file:

offset: 831 position: 17165
offset: 925 position: 19165
offset: 1480 position: 30165
offset: 1587 position: 32165

This indicates that the message with offset 831 is located at byte position 17165 in the actual log file.

How it Works

When a consumer requests messages from a specific offset (e.g., offset 1500).
The Kafka broker first consults the .index file.
Since the offsets in the index file are stored in sorted (ascending) order, the broker can perform a binary search on the offset values.
This binary search quickly identifies the range where the requested offset should be. For offset 1500, the broker would find that it falls between offset 1480 and offset 1587.
Knowing the byte position for offset 1480 (which is 30165), the broker knows it only needs to scan the actual log file from that approximate position onwards, drastically reducing the search area instead of starting from the beginning of the file.
Configuration: log.index.interval.bytes Kafka doesn't write an entry into the .index file for every single message. Instead, it writes entries periodically based on the accumulated data size.

The default configuration property that controls this behavior is log.index.interval.bytes. Its default value is 4096 bytes. This means that roughly every 4096 bytes of new data accumulated in the Kafka topic, a new offset and its corresponding position will be written to the .index file.

Relative Offsets for Efficiency

To save space and improve efficiency, the .index file stores relative offsets instead of absolute offsets.

Every log segment (which we'll discuss next) has a base offset – the starting offset of messages within that segment. In the .index file, the offsets are stored as the difference (or "shift") from this base offset. For example, if a segment starts at offset 90 and a message has an absolute offset of 108, the index file would store 18 (108 - 90) as the offset value. When the broker performs a lookup, it adds this relative offset to the segment's base offset to get the actual absolute offset. However, when you use tools like kafka-run-class.bat to inspect an index file, the tool performs this calculation in the backend and displays the absolute offsets for readability.

The Role of `.timeindex` Files (Time Index)#

Purpose and Structure: The .timeindex files complement the .index files by allowing Kafka to efficiently locate messages based on their timestamp. This is particularly useful for business requirements where consumers need messages published after a certain point in time.

A .timeindex file contains a mapping of timestamp to offset.

Example Content of a .timeindex file:

timestamp: 1678886400000 offset: 925
timestamp: 1678886401000 offset: 1587

How it Works

When a consumer requests messages published after a specific timestamp.
The broker consults the .timeindex file.
Similar to the offset index, timestamps in the .timeindex are sorted, allowing for a binary search to quickly find the approximate offset corresponding to the requested timestamp.
Once an approximate offset is found from the .timeindex (e.g., offset 925 for a timestamp).
The broker then uses this offset to perform a lookup in the .index file (offset index).
The .index file provides the exact byte position in the log file where that offset begins.
Finally, the broker starts scanning the actual log file from that byte position, checking the message timestamps (which are part of the message payload) to ensure they meet the time requirement.

Index & Timeindex

The Challenge of Large Log Files#

The Role of .index Files (Offset Index)#

The Role of .timeindex Files (Time Index)#

The Role of `.index` Files (Offset Index)#

The Role of `.timeindex` Files (Time Index)#