About the project

Better Diff is a modular, extendable, and scalable framework in Java that provides functionality to find differences between 2 or more text files, and provides list of modifications (additions, deletions, transpositions, partial / full mutations) for each pair based on the full alignment. It naturally supports both baseless and hyparchetype textual criticism, as well as sequence alignment for unlimited level of performed sub-alignments (verses, lines, words, letters etc.)

The output (apart from the framework itself) is in a form of special commands so it can be easily used by non-Java clients, and makes things like assembling critical edition easily achievable.

It can also be used for sequence alignment of 2 or more nucleic acid sequences or protein sequences. However, the alignment doesn't guarantee neither local nor global optimum at this moment.

				betterdiff ~ $ java -jar betterdiff-lyrics-client.jar -S -f file1.txt -f file2.txt -f file3.txt --verses-weight 80 --lines-weight 60
			  

Download: Libraries (Java 16+)

  • all
				
					Update 2022-05-08: 
com.betterdiff.core.* (2.0.0) (jar)
com.betterdiff.lyrics.* (2.0.0) (jar)
com.betterdiff.core.utils.* (2.0.0) (jar)

Documentation

Architecture

Core
(com.betterdiff.core.*)
Utils
(com.betterdiff.core.utils.*, com.betterdiff.‎ <extension_name>‎ .utils.*)
Extensions
(com.betterdiff.‎ <extension_name> ‎.*)
Clients
(com.betterdiff. <extension_name> .client.*)

Core module provides API for all phases, fields, steps and other elements of the whole process. This API should be always used by both Extensions and Utils so they are compatible with every other extension or utility package.

It also provides extendable framework for Preparation phase and Pairing phase, and full implementation for Alignment phase and Identification phase.

Code module is not dependent on any other module within this framework.

Extensions are modules that extend Core module to some specific purpose. They may or may not be dependent on other extensions, and also may or may not extend other extension.

They should always use API from Core modules, even for classes that not necessarily extend original classes from the Core module. Otherwise they won't be generally compatible with other extensions or with Utils modules.

Extensions shouldn't provide any client specific code and shouldn't have main method implemented, therefore shouldn't be able to run on their own.

Utils are modules that provide functionality on the output of Core module - be it Chunks, PairedChunks, or AlignedChunks - or on the Protocol itself.

They should provide text based operations with original texts and provide the result in computer readable form.

Utils should not be dependent on Extensions or on any other Utils module, but they may provide context-related operations for specific Extensions. This way they stay compatible with every other Util module and may be used by alien Extensions where applicable. The may, however, be dependent on Core Utils module, if needed.

Utils shouldn't provide any client specific code and shouldn't have main method implemented, therefore shouldn't be able to run on their own.

Clients are any front end applications that provide functionality for users. They can also act as middle-men for non-Java applications that want to use this framework. They can be small front end layers to provide easy access for Utils methods, or full blown desktop applications with their own architecture and multi level modularity, extensions etc.

There is no expected compatibility among other clients. Clients are also not expected to be extendable or reusable.

Phases

Preparation -> Pairing -> Alignment -> Identification

The whole process consists of 4 phases. When files are compared, all these phases should be processed in this order at least once. For some levels (see examples) some of these phases can be processed multiple times, but they should always be processed in this order, because the output of Preparation is used by Pairing, its output is used by Alignment, and its output by Identification.

Preparation phase is a phase where chunks should be identified. Chunk is a meaningful part of a text that will be aligned with other chunks. For example, in lyrics, chunk can be a whole verse, or a line, or a word. For protein sequences, a chunk can be a single protein, or a sub-sequence of proteins, or any other part of the whole sequence that should be aligned with other sequences.

Pairing phase is a phase where chunks should be compared with each other. Those chunks that have the same or similar content should have the same id. However, what is considered the same or similar is left to the implementation detail.

Alignment is a phase where chunks are aligned to their final position. In Core module Falling algorithm (c) is used to do this, but it can be extended or replaced by other algorithms that provide more accurate results (for example Smith-Waterman for nucleic acid sequences) or different context-related alignment for alien Extensions.

Identification is a phase where mutations are identified. Alignment implementation shouldn't affect this phase, but some different modes (for example hyparchetype textual criticism) can alter the output.

Elements

Elements are data structures that are calculated inside phases and used for communication between them.

Falling language (Protocol)

Protocol is a sequence of commands that leads from original texts to the final alignment including identification of mutations. Protocol can be used to reproduce the result without the need to calculate all phases again. It can also be used on different text sources with the same structure to reproduce the desired result, and can formally act as a template.

text <ordinal_number>
Request a text on the input.
<text_number> - Ordinal number of inputed text.

chunk <ordinal_number>,[<start_index>,<end_index>] -> [<x_axis>,<y_axis>]
Identify a chunk of text bounded by start and end index in a given text and put the chunk in the alignment matrix.
<ordinal_number> - Ordinal number of inputed text where the chunk has been identified.
<start_index> - Ordinal number of a character in given text where the chunk starts. This character is included in the chunk.
<end_index> - Ordinal number of a character in given text where the chunk ends. This character is included in the chunk.
<x_axis> - X position of the chunk in the alignment matrix.
<x_axis> - Y position of the chunk in the alignment matrix.

Notes.
Every white space including new lines (\n, \n\r, \r) is counted as 1 character.
Start Index and End Index are different from routines like substring.

Example:
This is a dog.
If we split the sentence into words we get 4 chunks:
chunk 1-4 (This)
chunk 6-7 (is)
chunk 9-9 (a)
chunk 11-13 (dog)

match [<x_axis>,<y_axis>] -> <id>
Assign an ID to the specified position in the alignment matrix.
<x_axis> - X position of the chunk in the alignment matrix.
<y_axis> - Y position of the chunk in the alignment matrix.
<id> - Identification number of the chunk.

move <shift_size>,[<x_axis>,<y_axis>]
Move all positions in the alignment matrix down by given shift size. Positions are moved only in the column specified by x_axis and only on start position or below specified by y_axis.
<shift_size> - Total size of the performed shift.
<x_axis> - X position of the starting chunk in the alignment matrix.
<x_axis> - Y position of the starting chunk in the alignment matrix.

Example:
[Chunk 1] [Chunk 2]
[Chunk 3] [Chunk 4]
[Chunk 5] [Chunk 6]

Command: move 2,[1,2]

Result:
[Chunk 1] [Chunk 2]
[(empty)] [Chunk 4]
[(empty)] [Chunk 6]
[Chunk 3] [(empty)]
[Chunk 5] [(empty)]

pick <shift_size>,[<x_axis>,<y_axis>]
Move a single position down by given shift size.
<shift_size> - Total size of the performed shift.
<x_axis> - X position of the chunk in the alignment matrix.
<x_axis> - Y position of the chunk in the alignment matrix.

Example:
[Chunk 1] [Chunk 2]
[(empty)] [Chunk 4]
[Chunk 5] [Chunk 6]

Command: pick 1,[1,1]

Result:
[(empty)] [Chunk 2]
[Chunk 1] [Chunk 4]
[Chunk 5] [Chunk 6]

fin [<x_axis>,<y_axis>]
Mark the given position as finished. It means that the position reached its final alignment and doesn't have to be aligned anymore.
<x_axis> - X position of the chunk in the alignment matrix.
<y_axis> - Y position of the chunk in the alignment matrix.

local
Change the scope of further alignments to the detail. Be aware that there is no way to go back, so the alignment on current level must be finished first before going down the level.

row <sub_row_detail>
Change the row of alignment for the current level to the given row. The scope must in local first.
<sub_row_detail> - Row detail of the current level.

Example:
〚[Chunk 1] 〚[Chunk 2]
[(Chunk 3)]〛 [Chunk 4]〛
〚[Chunk 5] 〚[Chunk 6]
[(Chunk 7)]〛 [Chunk 8]〛

Commands:
move 1,[1,1]
local
row 2
move 1,[1,1]

Result:
〚(empty) 〚[Chunk 2]
[(empty)]〛 [Chunk 4]〛
〚[(empty)] 〚[Chunk 6]
[Chunk 1] [Chunk 8]
[Chunk 3]〛 [(empty)]〛
〚[Chunk 5] 〚[(empty)]
[(Chunk 7)]〛 [(empty)]〛

mut <mutation_type>,[<origianl_x_axis>,<origianl_y_axis>] x [<target_x_axis>,<target_y_axis>]
Mark mutation between two chunks.
<mutation_type> - Mutation type, there are these mutations:
= - Equality
PM - Partial mutation
FM - Full mutation
T - Transposition
A - Addition
D - Deletion
Note. Please bear in mind that in case of baseless comparison the mutations are symmetrical. In such case only one mutation is listed and the symmetrical one is omitted.
<origianl_x_axis> - X position of the original chunk in the alignment matrix.
<origianl_y_axis> - Y position of the original chunk in the alignment matrix.
<target_x_axis> - X position of the mutated chunk in the alignment matrix.
<target_y_axis> - Y position of the mutated chunk in the alignment matrix.

How To

Example

		  
	module SimpleClient {
		requires com.betterdiff.core;
		requires com.betterdiff.core.utils;
		requires com.betterdiff.lyrics;
	}
	
	/*******************************************************/

	package com.betterdiff.example.simpleClient;

	import java.util.HashMap;
	import java.util.List;
	import java.util.Map;
	import java.util.logging.Level;
	import java.util.stream.Collectors;

	import com.betterdiff.core.Callback;
	import com.betterdiff.core.alignment.Alignment;
	import com.betterdiff.core.identification.Identification;
	import com.betterdiff.core.pairing.Pairing;
	import com.betterdiff.core.preparation.Preparation;
	import com.betterdiff.core.protocol.MutationType;
	import com.betterdiff.core.protocol.command.Mutation;
	import com.betterdiff.core.utils.output.BetterDiffOutput;
	import com.betterdiff.core.utils.printer.simpleConsolePrinter.OutputFilter;
	import com.betterdiff.core.utils.printer.simpleConsolePrinter.PrinterMode;
	import com.betterdiff.core.utils.printer.simpleConsolePrinter.SimpleConsolePrinter;
	import com.betterdiff.core.utils.printer.simpleConsolePrinter.SimpleConsolePrinterParameters;
	import com.betterdiff.lyrics.pairing.LCSPairing;
	import com.betterdiff.lyrics.preparation.LinesPreparation;

	public class SimpleClient {

		private class EmptyCallback extends Callback {
			public EmptyCallback() {
				super();
			}
			
			@Override
			protected void log(Level level, String message) {
				// nothing - you can add some logging here
			}
		}
		
		public static void main(String[] args) {
			// Input strings as if they were read from external source (file / stream / user input...).
			String file1 = "AAA\nBBB\nFFF";
			String file2 = "BBB\nFFF";
			String file3 = "AAA\nCBB\nFFF";
			
			// Create empty callback. Callback is useful for debugging or logging.
			SimpleClient simpleClient = new SimpleClient();
			Callback callback = simpleClient.new EmptyCallback();
			
			// Preparation phase
			Preparation preparation = new LinesPreparation(callback);
			preparation.addText(file1);
			preparation.addText(file2);
			preparation.addText(file3);
			preparation.process();
			
			// Pairing phase
			Pairing pairing = new LCSPairing(preparation, callback, 50);
			pairing.process();
			
			// Alignment phase
			Alignment alignment = new Alignment(pairing, callback);
			alignment.process();
			
			// Identification phase
			Identification identification = new Identification(alignment, callback);
			identification.process();
			
			// Prepare starting indexes for output purpose
			Map<Integer, Integer> startIndex = new HashMap<>();
			startIndex.put(1, 1);
			startIndex.put(2, 1);
			startIndex.put(3, 1);
			
			// Register all elements to the output object
			BetterDiffOutput betterDiffOutput = new BetterDiffOutput(callback);
			betterDiffOutput.registerAlignment(alignment, startIndex, 0, 0);
			// Filter out equalities as they are not important in this example
			List<Mutation> mutationsToRegister = identification.getMutations().stream().filter(e ->
				e.getMutationType() != MutationType.EQUALITY
			).collect(Collectors.toList());
			betterDiffOutput.registerMutations(mutationsToRegister, alignment.getAlignedChunks());
			
			// Create a printer and print out the result
			SimpleConsolePrinterParameters parameters = new SimpleConsolePrinterParameters(
					PrinterMode.SIDE_BY_SIDE,
					SimpleConsolePrinter.DEFAULT_OUTPUT_DIMENSIONS,
					false,
					OutputFilter.MAIN_ALIGNMENT);
			SimpleConsolePrinter spc = new SimpleConsolePrinter(parameters, callback);
			spc.process(betterDiffOutput);
		}
	}
	
	/*******************************************************/
	
	Output:
	
	1|AAA    [2+] |               | 1|AAA    [2+]
	2|BBB    [3o] | 1|BBB    [3o] | 2|CBB [1o,2o]
	3|FFF         | 2|FFF         | 3|FFF        
		  
		  

JavaDoc

com.betterdiff.core.*
com.betterdiff.lyrics.*
com.betterdiff.core.utils.*

License, Contact

Ladislav Asenbrener, troomar@gmail.com
License: CC BY-NC-ND 3.0
https://creativecommons.org/licenses/by-nc-nd/3.0/legalcode
https://creativecommons.org/licenses/by-nc-nd/3.0/