Skip to content

Commit

Permalink
Large docs refactor
Browse files Browse the repository at this point in the history
Signed-off-by: solonovamax <solonovamax@12oclockpoint.com>
  • Loading branch information
solonovamax committed Oct 3, 2023
1 parent 8615617 commit 45f8471
Show file tree
Hide file tree
Showing 7 changed files with 407 additions and 318 deletions.
236 changes: 236 additions & 0 deletions kt-string-similarity/dokka/includes/edit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
# Package ca.solostudios.stringsimilarity.edit

This package contains the edit-based string measure implementations.

## Algorithms

### [Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein]

The [Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] distance between two words is the minimum number of
single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).

#### Example

```kotlin
val levenshtein = Levenshtein()

println(levenshtein.distance("My string", "My \$tring")) // prints 1.0
```

### [Normalized Levenshtein][ca.solostudios.stringsimilarity.edit.NormalizedLevenshtein]

This is computed as the [levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein]
normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).

#### Example

```kotlin
val normLevenshtein = NormalizedLevenshtein()

println(normLevenshtein.distance("My string", "My \$tring")) // prints 0.10526315789473684
```

### [Damerau-Levenshtein][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]

Similar to the [Levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein],
the [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein] with transposition
(also sometimes calls unrestricted Damerau-Levenshtein distance) is the minimum number of operations needed to transform
one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character,
or a **transposition of two adjacent characters**.

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).

This is not to be confused with the optimal string alignment distance, which is an extension where no substring can be
edited more than once.

#### Example

```kotlin
val damerau = DamerauLevenshtein()

println(damerau.distance("ABCDEF", "ABDCEF")) // prints 1.0

// 2 substitutions
println(damerau.distance("ABCDEF", "BACDFE")) // prints 2.0

// 1 deletion
println(damerau.distance("ABCDEF", "ABCDE")) // prints 1.0
println(damerau.distance("ABCDEF", "BCDEF")) // prints 1.0
println(damerau.distance("ABCDEF", "ABCGDEF")) // prints 1.0

// All different
println(damerau.distance("ABCDEF", "POIU")) // prints 6.0

// Transpose
println(damerau.distance("CA", "ABC")) // prints 2.0
```

### [Normalized Damerau-Levenshtein][ca.solostudios.stringsimilarity.edit.NormalizedDamerauLevenshtein]

This is computed as the [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).

#### Example

```kotlin
val damerau = NormalizedDamerauLevenshtein()

println(damerau.distance("ABCDEF", "ABDCEF")) // prints 0.15384615384615385

// 2 substitutions
println(damerau.distance("ABCDEF", "BACDFE")) // prints 0.2857142857142857

// 1 deletion
println(damerau.distance("ABCDEF", "ABCDE")) // prints 0.16666666666666666
println(damerau.distance("ABCDEF", "BCDEF")) // prints 0.16666666666666666
println(damerau.distance("ABCDEF", "ABCGDEF")) // prints 0.14285714285714285

// All different
println(damerau.distance("ABCDEF", "POIU")) // prints 0.75

// Transpose
println(damerau.distance("CA", "ABC")) // prints 0.5714285714285714
```

### [Optimal String Alignment][ca.solostudios.stringsimilarity.edit.OptimalStringAlignment]

The [Optimal String Alignment distance][ca.solostudios.stringsimilarity.edit.OptimalStringAlignment] variant
of [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
(sometimes called the restricted edit distance) computes the number of edit operations needed
to make the strings equal under the condition that **no substring is edited more than once**,
whereas the true the [Damerau-Levenshtein distance][ca.solostudios.stringsimilarity.edit.DamerauLevenshtein]
presents no such restriction.
The difference from the algorithm for the [Levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein] is the
addition of one recurrence for the transposition operations.

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).

#### Example

```kotlin
val osa = OptimalStringAlignment()

println(osa.distance("ABCDEF", "ABDCEF")) // prints 1.0

// 2 substitutions
println(osa.distance("ABCDEF", "BACDFE")) // prints 2.0

// 1 deletion
println(osa.distance("ABCDEF", "ABCDE")) // prints 1.0
println(osa.distance("ABCDEF", "BCDEF")) // prints 1.0
println(osa.distance("ABCDEF", "ABCGDEF")) // prints 1.0

// All different
println(osa.distance("ABCDEF", "POIU")) // prints 6.0

println(osa.distance("CA", "ABC")) // prints 3.0
```

### [Normalized Optimal String Alignment][ca.solostudios.stringsimilarity.edit.NormalizedOptimalStringAlignment]

This is computed as the [Optimal String Alignment][ca.solostudios.stringsimilarity.edit.OptimalStringAlignment]
normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\).

#### Example

```kotlin
val osa = NormalizedOptimalStringAlignment()

println(osa.distance("ABCDEF", "ABDCEF")) // prints 0.15384615384615385

// 2 substitutions
println(osa.distance("ABCDEF", "BACDFE")) // prints 0.2857142857142857

// 1 deletion
println(osa.distance("ABCDEF", "ABCDE")) // prints 0.16666666666666666
println(osa.distance("ABCDEF", "BCDEF")) // prints 0.16666666666666666
println(osa.distance("ABCDEF", "ABCGDEF")) // prints 0.14285714285714285

// All different
println(osa.distance("ABCDEF", "POIU")) // prints 0.75

// Transpose
println(osa.distance("CA", "ABC")) // prints 0.75
```

### [Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS]

The [Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS] (LCS) problem consists in finding the longest
subsequence common to two (or more) sequences.
It differs from problems of finding common substrings: unlike substrings, subsequences are not required to
occupy consecutive positions within the original sequences.

It is used by the diff utility, by Git for reconciling multiple changes, etc.

The [LCS distance][ca.solostudios.stringsimilarity.edit.LCS] is equivalent
to the [Levenshtein distance][ca.solostudios.stringsimilarity.edit.Levenshtein] when only insertion and deletion is
allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion.

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\)[@ft-a].

#### Example

```kotlin
val lcs = LongestCommonSubsequence()

println(lcs.distance("AGCAT", "GAC")) // prints 4.0

println(lcs.distance("AGCAT", "AGCT")) // prints 1.0
```

### [Normalized Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.NormalizedLCS]

This is computed as the [Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS]
normalized to be in the range \\(&#91;0.0, 1.0&#93;\\).

It is a metric string distance. This class implements the dynamic programming approach,
which has a space requirement \\(O(m \\times n)\\), and computation cost \\(O(m \\times n)\\)[@ft-a].

#### Example

```kotlin
val normalizedLCS = NormalizedLCS()

println(normalizedLCS.distance("ABCDEFG", "ABCDEFHJKL")) // prints 0.45454545454545453

println(normalizedLCS.distance("ABDEF", "ABDIF")) // prints 0.3333333333333333
```

<h2 class="footnotes-header">Notes</h2>
<div class="footnotes">
<ol>
<li id="footnote-a">

K.S. Larsen proposed an algorithm that computes the length of LCS in time
\\(O(log(m) \\times log(n))\\).[@ref-4] But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not
implemented here.
</li>
</ol>
</div>

<h2 class="references-header">References</h2>
<div class="references">
<ol>
<li id="reference-1">

Larsen, K. S. (1992-10). Length of maximal common subsequences. DAIMI Report
Series, 21(426).
<https://doi.org/10.7146/dpb.v21i426.6740><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.7146/dpb.v21i426.6740)</sup>
</li>
</ol>
</div>
51 changes: 51 additions & 0 deletions kt-string-similarity/dokka/includes/interfaces.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Package ca.solostudios.stringsimilarity.interfaces

This package contains all the interfaces for string measures.

## Normalized, metric, similarity and distance

Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance.
Therefore, the library defines some interfaces to categorize them.

### (Normalized) Similarity and Distance

- [StringSimilarity][ca.solostudios.stringsimilarity.interfaces.StringSimilarity]: Implementing algorithms define a
similarity between
strings (0 means strings are completely different).
- [NormalizedStringSimilarity][ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity]: The interface
extends [StringSimilarity][ca.solostudios.stringsimilarity.interfaces.StringSimilarity].
Implementing algorithms compute a similarity that has been normalized based on the number of operations performed.
This means that for non-weighted implementations, the result will always be between 0 and 1.
[Jaro-Winkler][ca.solostudios.stringsimilarity.JaroWinkler] is an example of this.
- [StringDistance][ca.solostudios.stringsimilarity.interfaces.StringDistance]: Implementing algorithms define a distance
between strings (0 means strings are identical), like [Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] for example.
The maximum distance value depends on the algorithm.
- [NormalizedStringDistance][ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance]: This interface
extends [StringDistance][ca.solostudios.stringsimilarity.interfaces.StringDistance].
Implementing algorithms compute a distance that has been normalized based on the number of operations performed.
This means that for non-weighted implementations, the result will always be between \\(&#91;0, 1&#93;\\).
[NormalizedLevenshtein][ca.solostudios.stringsimilarity.edit.NormalizedLevenshtein] is an example of this.

Generally, algorithms that
implement [NormalizedStringSimilarity][ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity]
also implement [NormalizedStringDistance][ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance].
This is because the similarity can be computed as \\(1 - \\text{distance}\\),
and the distance can be computed as \\(1 - \\text{similarity}\\).

> Note: This is only applicable if the result is *always* between 0 and 1.
### Metric Distances

The [MetricStringDistance][ca.solostudios.stringsimilarity.interfaces.MetricStringDistance]
interface indicates that the implementing class is a metric distance,
which means that it satisfies the required axioms to be considered metric.
Read [MetricStringDistance][ca.solostudios.stringsimilarity.interfaces.MetricStringDistance] for more information.

A lot of nearest-neighbor search algorithms and indexing structures rely on the triangle inequality.
You can check "Similarity Search, The Metric Space Approach" by Zezula et al. for a survey.
These cannot be used with non-metric similarity measures.

### Edit Measures

The edit measure interfaces indicate when a specific algorithm is edit-based.
See the `edit` package for all implementors.
14 changes: 7 additions & 7 deletions kt-string-similarity/dokka/includes/kt-string-similarity.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ The "cost" columns gives an estimation of the computational/memory costs to comp

| Name | Distance | Similarity | Normalized | Metric | Memory cost | Execution cost |
|--------------------------------------------|:--------:|:----------:|:----------:|:------:|----------------------|------------------------------------|
| Levenshtein || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Damerau-Levenshtein[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Optimal String Alignment[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Longest Common Subsequence || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
| Levenshtein || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Damerau-Levenshtein[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Optimal String Alignment[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Longest Common Subsequence || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
| Normalized Levenshtein ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Normalized Damerau-Levenshtein[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Normalized Optimal String Alignment[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Normalized Longest Common Subsequence || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
| Normalized Damerau-Levenshtein[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Normalized Optimal String Alignment[@ft-c] || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] |
| Normalized Longest Common Subsequence || ||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] |
| Cosine similarity ||||| \\(O(m + n)\\) | \\(O(m + n)\\) |
| Jaccard index ||||| \\(O(m + n)\\) | \\(O(m + n)\\) |
| Jaro-Winkler ||||| \\(O(m + n)\\) | \\(O(m \\times n)\\) |
Expand Down
Loading

0 comments on commit 45f8471

Please sign in to comment.