Documentation / Paired Chunk, Pairing

Paired Chunk

(com.betterdiff.api.pairing.PairedChunk)

PairedChunk is a Chunk with added attributes:

  • Identification - It's a unique number (hash) for the original chunk. If different chunks have the same identification, their content is considered similar.
  • Weight - Denotes the priority among other Paired Chunks. The PairedChunk with greater weight has a higher priority for the Alignment phase.

Identification should be based on the chunk content, but it's left to the implementation detail.

Pairing

(com.betterdiff.api.pairing.Pairing)

Pairing is one of the core Phases. During this phase, the Chunks identified in the Preparation phase are given an identification number to mark the similarity and a weight to mark the priority for the Alignment phase.

While the algorithm used for assigning identification numbers and weight to Chunks is left to the specific implementation of Pairing processor, generally this phase consists of following steps:

  • Definition of String similarity algorithm - During this step, it should be decided, what algorithm is going to be used to generate identification numbers based on the Chunk's content similarity.
  • Definition of Dimensionality reduction algorithm - Because Chunks are stored as a 2-dimensional array, so the PairedChunks are. However, this is true for any number of Witnesses. But the first step generally produces an n-dimensional array, where n is the number of Witnesses. To transform the result to a 2-dimensional array, the dimensionality reduction algorithm should be used.

Example:

The preparation phase identified following Chunks:

  • Chunk 1: This is a long sentence.
  • Chunk 2: This is a dog.
  • Chunk 3: This sentence is long.
  • Chunk 4: This is a cat.

Based on the pairing implementation, the following similar Chunks were identified:

  • Chunk 1 and Chunk 3
  • Chunk 2 and Chunk 4

The weight to each Chunk was assigned based on their length. Therefore, following PairedChunks were created:

  • PairedChunk 1: originalChunk = Chunk 1, id = 1, weight = 24
  • PairedChunk 2: originalChunk = Chunk 2, id = 2, weight = 14
  • PairedChunk 3: originalChunk = Chunk 3, id = 1, weight = 22
  • PairedChunk 4: originalChunk = Chunk 4, id = 2, weight = 14

PairedChunks created during this phase should not be modified after the phase has been finished.