Content Defined Chunk (CDC)
This article was created with the assistance of AI.
Why Use Chunking?
By chunking data, we isolate changes. When a file is modified, only the modified chunks need to be updated.
How to Chunk?
Fixed-Length Chunking
Currently, files are chunked into fixed lengths. For example, suppose the file content is abcdefg
. By dividing it into chunks of four bytes, we get abcd|efg
. If a character is added at the beginning, making the content 0abcdefg
, the chunks become 0abc|defg
. Both chunks differ completely from the previous ones. This means that, when syncing the modification to a network file system, both chunks must be re-uploaded.
Content-Defined Variable-Length Chunking
If chunks are defined based on content, using d
as a breakpoint, chunks are formed as 0abcd|efg
. Here, only one chunk differs from the previous set, which significantly improves efficiency compared to fixed-length chunking.
Problem
There is an extremely low probability of creating multiple short chunks, e.g., dddd
would be divided as d|d|d|d
. This situation leads to excessive chunks, making it hard to manage. Obviously, we cannot always use the same content as the breakpoint. For instance, if the file content is dd...d
, it would be chunked into d|d|...|d|
, with each chunk containing just one character, wasting space and going against the original intent of chunking.
To address this issue, we need a method to randomly select breakpoints such that chunks have an average size while maintaining certain properties for the breakpoints.
Hashing
Hashing maps an input of any length to a fixed-length output with the following properties:
- Fast in the forward direction
- Difficult to reverse
- Sensitive to input changes
- Avoids collisions
- Rolling hash
Objective: Optimizing String Matching with Hashing
By matching strings based on their hash values, given a pattern string of length , we can take substrings of the matching string of length , compute their hashes, and compare with the pattern hash. With high probability, a hash collision implies the substrings match the pattern. However, brute-forcing all combinations is inefficient, so optimization is required.
Using a rolling hash, a sliding window of length calculates the hash of each substring by removing the effect of the old character and adding the effect of the new character, significantly reducing computational complexity.
Rolling Hash Formula
Let the substring within the window before sliding be . If is a prime polynomial, the hash is:
After sliding, the hash for is:
Thus, the recurrence relation is:
This allows computation for the next hash value, resulting in complexity for substrings.
Rabin-Karp Algorithm
The Rabin-Karp algorithm implements string matching using rolling hash. It relies on an efficient hash function, specifically the Rabin Fingerprint.
Rabin Fingerprint
The Rabin fingerprint is a polynomial hash on a finite field , e.g., , represented as in binary.
Addition and subtraction are XOR operations, simplifying computation by avoiding carry-over concerns. However, multiplication and division require complexity (where is the polynomial’s degree).
Implementation Example
For a polynomial of degree :
uint64_t poly = 0xbfe6b8a5bf378d83LL;
The recurrence relation is:
Key optimizations include:
- Multiplication : Precompute terms involving .
- Efficient Modulo Operations: Precompute values for modular reduction using , simplifying modulo operations.
Code for Finding Polynomial Degree
Below is the C++ implementation of finding the highest degree of a polynomial, equivalent to finding the most significant bit of a binary number:
uint32_t RabinChecksum::find_last_set(uint32_t value) {
if (value & 0xffff0000) {
if (value & 0xff000000)
return 24 + byteMSB[value >> 24];
else
return 16 + byteMSB[value >> 16];
} else {
if (value & 0x0000ff00)
return 8 + byteMSB[value >> 8];
else
return byteMSB[value];
}
}
uint32_t RabinChecksum::find_last_set(uint64_t v) {
uint32_t h = v >> 32;
if (h)
return 32 + find_last_set(h);
else
return find_last_set((uint32_t)v);
}
Using this, polynomial multiplication can be implemented efficiently.