Documentation / Preparation

Preparation

(com.betterdiff.api.preparation.Preparation)

Preparation is one of the core phases. During this phase, the content of Witnesses in an Evidence is analyzed and split into Chunks.

While the algorithm used for creating Chunks is left to the specific implementation of PreparationProcessor, generally this phase consists of following steps:

  • Definition of Chunk - During this step, it should be decided, what is actually considered a smallest unit across all Witnesses.
  • Splitting Witnesses into chunks - The content of all Witnesses should be analyzed and split into chunks accordingly.

Example 1:

The Witness (ordinalNumber = 1 in the Evidence) contains a sentence

This is a dog

The smallest unit of a Witness was defined as a word, ignoring white characters between them. Therefore the Chunks for such a Witness should be following:

  • Chunk (1): ordinalNumber = 1, startIndex = 1, endIndex = 4, yAxis = 1
  • Chunk (2): ordinalNumber = 1, startIndex = 6, endIndex = 7, yAxis = 2
  • Chunk (3): ordinalNumber = 1, startIndex = 9, endIndex = 9, yAxis = 3
  • Chunk (4): ordinalNumber = 1, startIndex = 11, endIndex = 13, yAxis = 4

Example 2:

The Witness (ordinalNumber = 1 in the Evidence) contains a sentence

This is a dog

The smallest unit of a Witness was defined as a word including a trailing white characters if applicable. Therefore the Chunks for such a Witness should be following:

  • Chunk (1): ordinalNumber = 1, startIndex = 1, endIndex = 5, yAxis = 1
  • Chunk (2): ordinalNumber = 1, startIndex = 6, endIndex = 8, yAxis = 2
  • Chunk (3): ordinalNumber = 1, startIndex = 9, endIndex = 10, yAxis = 3
  • Chunk (4): ordinalNumber = 1, startIndex = 11, endIndex = 13, yAxis = 4

Example 3:

The Witness (ordinalNumber = 1 in the Evidence) contains a sentence

This is a dog

The smallest unit of a Witness was defined as a word and any uninterrupted sequence of white characters is considered a separated Chunk. Therefore the Chunks for such a Witness should be following:

  • Chunk (1): ordinalNumber = 1, startIndex = 1, endIndex = 4, yAxis = 1
  • Chunk (2): ordinalNumber = 1, startIndex = 5, endIndex = 5, yAxis = 2
  • Chunk (3): ordinalNumber = 1, startIndex = 6, endIndex = 7, yAxis = 3
  • Chunk (4): ordinalNumber = 1, startIndex = 8, endIndex = 8, yAxis = 4
  • Chunk (5): ordinalNumber = 1, startIndex = 9, endIndex = 9, yAxis = 5
  • Chunk (6): ordinalNumber = 1, startIndex = 10, endIndex = 10, yAxis = 6
  • Chunk (7): ordinalNumber = 1, startIndex = 11, endIndex = 13, yAxis = 4

Example 4:

The Witness (ordinalNumber = 1 in the Evidence) contains a sentence

There   are   many   spaces

The smallest unit of a Witness was defined as a word including any leading and trailing white characters if applicable. Therefore the Chunks for such a Witness should be following:

  • Chunk (1): ordinalNumber = 1, startIndex = 1, endIndex = 8, yAxis = 1
  • Chunk (2): ordinalNumber = 1, startIndex = 6, endIndex = 14, yAxis = 2
  • Chunk (3): ordinalNumber = 1, startIndex = 12, endIndex = 21, yAxis = 3
  • Chunk (4): ordinalNumber = 1, startIndex = 19, endIndex = 27, yAxis = 4

In this case, Chunks are overlapping themselves. While this is allowed in a general BetterDiff process, it should be addressed properly when constructing an Output.

Chunks created during this phase should not be modified after the phase has been finished.