Documentation / Paired Chunk, Pairing
Paired Chunk
(com.betterdiff.api.pairing.PairedChunk)
PairedChunk is a Chunk with added attributes:
- Identification - It's a unique number (hash) for the original chunk.
If different chunks have the same identification, their content is considered
similar.
- Weight - Denotes the priority among other Paired Chunks. The PairedChunk with
greater weight has a higher priority for the Alignment phase.
Identification should be based on the chunk content, but it's left to the implementation detail.
Pairing
(com.betterdiff.api.pairing.Pairing)
Pairing is one of the core Phases. During this phase, the Chunks identified in the Preparation phase
are given an identification number to mark the similarity and a weight to mark the priority for the Alignment phase.
While the algorithm used for assigning identification numbers and weight to Chunks
is left to the specific implementation of Pairing processor, generally
this phase consists of following steps:
- Definition of String similarity algorithm - During this step, it should be decided,
what algorithm is going to be used to generate identification numbers based on the
Chunk's content similarity.
- Definition of Dimensionality reduction algorithm - Because Chunks are stored as
a 2-dimensional array, so the PairedChunks are. However, this is true for any number
of Witnesses. But the first step generally produces an n-dimensional array, where
n is the number of Witnesses. To transform the result to a 2-dimensional array, the
dimensionality reduction algorithm should be used.
Example:
The preparation phase identified following Chunks:
- Chunk 1: This is a long sentence.
- Chunk 2: This is a dog.
- Chunk 3: This sentence is long.
- Chunk 4: This is a cat.
Based on the pairing implementation, the following similar Chunks were identified:
- Chunk 1 and Chunk 3
- Chunk 2 and Chunk 4
The weight to each Chunk was assigned based on their length. Therefore, following
PairedChunks were created:
- PairedChunk 1: originalChunk = Chunk 1, id = 1, weight = 24
- PairedChunk 2: originalChunk = Chunk 2, id = 2, weight = 14
- PairedChunk 3: originalChunk = Chunk 3, id = 1, weight = 22
- PairedChunk 4: originalChunk = Chunk 4, id = 2, weight = 14
PairedChunks created during this phase should not be modified after the phase has been finished.