Library for edit distance calculating. Allowed generic sequences. Provide classes for easiest work with big dataset. Supported edit distances: Levenshtein distance, Hamming distance, Jaro distance, Jaro-Winkler distance. Supported length: Longest common substring length, Longest common subsequence length. Bonus: Conditional Entropy.
After build
tag:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
And to dependencies
:
<dependency>
<groupId>com.github.ondrej-nemec</groupId>
<artifactId>metrics</artifactId>
<version>v4.0-beta</version>
</dependency>
Note: in this text, generic type 'S' is always type of what you compare - if you wont compare Strings (words), you will use List
Each distance could be calculated with two ways - quick and info. Quick mode provide only distance, info mode give you more information about calculation.
Each distance in quick mode implement DistanceQuick
interface and has method:
Number calculate(List<S> sequenceFrom, List<S> sequenceTo);
DistanceInfo
- provide distance, description and 'final sequences', which are given sequences edited to same length.
In this mode metrics have this method:
DistanceResult<S, T> calculate(List<S> sequenceFrom, List<S> sequenceTo);
'T' is type of structure which you could get from result set. More about DistanceResult. This type is defined in each metric, so you must define only 'S'.
This is data object. From this class you can get informations about previous distance calculation in 'more-info' mode.
//return edited first sequence
List<S> getFinalSequenceFrom();
//return edited second sequence
public List<S> getFinalSequenceTo();
//return description of calculation
String getDescription();
//return String represent each used operations
String getOperations();
//return calculated distance
Number getDistance();
//return used structure
T getStructure();
Each length has two ways - info and quick - too.
Classes in length package implement LengthQuick
interface. This interface provide:
Number calculate(List<S> sequenceFrom, List<S> sequenceTo);
LengthInfo
- provide length, description and more informations.
In this mode metrics have this method:
LengthResult<S, T> calculate(List<S> sequenceFrom, List<S> sequenceTo);
'T' is type of structure which you could get from result set. More about LengthResult. This type is defined in each metric, so you must define only 'S'.
This is data object. From this class you can get informations about previous distance calculation in 'more-info' mode.
//return description of calculation
String getDescription();
//return calculated length
Number getLength();
//return used structure
T getStructure();
//return all common sub(string/sequence/...) which have calculated length
public Collection<List<S>> getSubs();
For calculating use Entropy
class. Remember size of both array and sizes of first.get(i)
and second.get(i)
must be same or new RuntimeException
is throwed. This class is immutable, so when you create new entropy, everything is calculated.
//create new entropy and calculate
new Entropy(List<List<S>> first, List<List<S>> second);
//return entropy for first sequence
public Double getEntropyFrom();
//return entropy for second sequence
public Double getEntropyTo();
//return every 'S' from first sequence with counts of occurrence
public List<Tuple2<S, Integer>> getFonemsFromWithCount();
//return every 'S' from second sequence with counts of occurrence
public List<Tuple2<S, Integer>> getFonemsToWithCount();
//return every 'S twins' from first and second sequence with counts of occurrence
public List<Tuple3<S, S, Integer>> getFonemsTwinsWithCounts();