From da9bb56dd468a9336793e3ba8db3d199bf1b3091 Mon Sep 17 00:00:00 2001 From: solonovamax Date: Fri, 29 Sep 2023 18:41:18 -0400 Subject: [PATCH] Update references in documentation, plus minor docs changes - Format all references in APA - Add references section to all relevant documentation - Includes links and DIOs for references where possible - Rework some wording - Add @see annotations to distance and similarity interfaces Signed-off-by: solonovamax --- .../dokka/includes/kt-string-similarity.md | 155 ++++++++++++------ .../ca/solostudios/stringsimilarity/Cosine.kt | 9 +- .../solostudios/stringsimilarity/Jaccard.kt | 7 + .../stringsimilarity/JaroWinkler.kt | 7 + .../ca/solostudios/stringsimilarity/NGram.kt | 15 +- .../ca/solostudios/stringsimilarity/QGram.kt | 13 +- .../stringsimilarity/RatcliffObershelp.kt | 7 +- .../stringsimilarity/ShingleBased.kt | 15 +- .../ca/solostudios/stringsimilarity/Sift4.kt | 8 +- .../stringsimilarity/SorensenDice.kt | 10 +- .../edit/DamerauLevenshtein.kt | 9 +- .../solostudios/stringsimilarity/edit/LCS.kt | 2 +- .../stringsimilarity/edit/Levenshtein.kt | 6 +- .../edit/NormalizedDamerauLevenshtein.kt | 13 +- .../stringsimilarity/edit/NormalizedLCS.kt | 13 +- .../edit/NormalizedLevenshtein.kt | 14 +- .../edit/NormalizedOptimalStringAlignment.kt | 13 +- .../edit/OptimalStringAlignment.kt | 9 +- .../interfaces/MetricStringDistance.kt | 12 +- .../interfaces/NormalizedStringDistance.kt | 15 +- .../interfaces/NormalizedStringEditMeasure.kt | 12 +- .../interfaces/NormalizedStringSimilarity.kt | 15 +- .../interfaces/StringDistance.kt | 2 + .../interfaces/StringEditMeasure.kt | 2 +- .../interfaces/StringSimilarity.kt | 2 + 25 files changed, 270 insertions(+), 115 deletions(-) diff --git a/kt-string-similarity/dokka/includes/kt-string-similarity.md b/kt-string-similarity/dokka/includes/kt-string-similarity.md index fd53ac8..052f011 100644 --- a/kt-string-similarity/dokka/includes/kt-string-similarity.md +++ b/kt-string-similarity/dokka/includes/kt-string-similarity.md @@ -4,14 +4,15 @@ Kotlin String Similarity is a Kotlin Multiplatform library for measuring and com Kotlin String Similarity implements various string similarity and distance measures. It contains over a dozen algorithms, including, but not limited to, -[Levenshtein][ca.solostudios.stringsimilarity.Levenshtein] distance (and siblings), +[Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] distance (and siblings), [Jaro-Winkler][ca.solostudios.stringsimilarity.JaroWinkler], -[Longest Common Subsequence][ca.solostudios.stringsimilarity.LongestCommonSubsequence], +[Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS], [Cosine similarity][ca.solostudios.stringsimilarity.Cosine], and many others. Check the summary table below for the complete list. -This is project contains a port of tdebatty's -[java-string-similarity](https://github.com/tdebatty/java-string-similarity) to Kotlin Multiplatform. +This is project was initially a port of tdebatty's +[java-string-similarity](https://github.com/tdebatty/java-string-similarity) to Kotlin Multiplatform, +however is now expanding upon it. ## Including @@ -20,28 +21,35 @@ You can include ${project.module} in your project by adding the following: ### Maven ```xml - - ${project.group} - ${project.module} - ${project.version} - + + + ${project.group} + ${project.module} + ${project.version} + + ``` ### Gradle Groovy DSL -```groovy -implementation '${project.group}:${project.module}:${project.version}' +```gradle +dependencies { + implementation '${project.group}:${project.module}:${project.version}' +} ``` ### Gradle Kotlin DSL ```kotlin -implementation("${project.group}:${project.module}:${project.version}") +dependencies { + implementation("${project.group}:${project.module}:${project.version}") +} ``` ### Gradle Version Catalog ```toml +[libraries] ${project.module} = { group = "${project.group}", name = "${project.module}", version = "${project.version}" } ``` @@ -51,42 +59,87 @@ The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length \\(m\\) and \\(n\\) respectively. -| Name | Similarity support | Normalized | Metric | Type | Cost | Typical usage | -|--------------------------------------|--------------------|------------|--------|---------|-------------------------------------|----------------------------------| -| Levenshtein | ☐ | ☐ | ☒ | | \\(O(m \\times n)\\) 1 | | -| Normalized Levenshtein | ☒ | ☒ | ☐ | | \\(O(m \\times n)\\) 1 | | -| Weighted Levenshtein | ☐ | ☐ | ☐ | | \\(O(m \\times n)\\) 1 | OCR | -| Damerau-Levenshtein3 | ☐ | ☐ | ☒ | | \\(O(m \\times n)\\) 1 | | -| Optimal String Alignment3 | ☐ | ☐ | ☐ | | \\(O(m \\times n)\\) 1 | | -| Jaro-Winkler | ☒ | ☒ | ☐ | | \\(O(m \\times n)\\) | typo correction | -| Longest Common Subsequence | ☐ | ☐ | ☐ | | \\(O(m \\times n)\\) 1,2 | diff utility, GIT reconciliation | -| Metric Longest Common Subsequence | ☐ | ☒ | ☒ | | \\(O(m \\times n)\\) 1,2 | | -| N-Gram | ☐ | ☒ | ☐ | | \\(O(m \\times n)\\) | | -| Q-Gram | ☐ | ☐ | ☐ | Profile | \\(O(m+n)\\) | | -| Cosine similarity | ☒ | ☒ | ☐ | Profile | \\(O(m+n)\\) | | -| Jaccard index | ☒ | ☒ | ☒ | Set | \\(O(m+n)\\) | | -| Sorensen-Dice coefficient | ☒ | ☒ | ☐ | Set | \\(O(m+n)\\) | | -| Ratcliff-Obershelp | ☒ | ☒ | ☐ | | ? | | - -1. In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the dynamic programming method, which - has a cost \\(O(m \\times n)\\). - For Levenshtein distance, the algorithm is sometimes called Wagner-Fischer algorithm ("The string-to-string correction problem", 1974). - The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes. - - If the alphabet is finite, it is possible to use the method of four russians (Arlazarov et al. "On economic construction of the - transitive - closure of a directed graph", 1970) to speedup computation. - This was published by Masek in 1980 ("A Faster Algorithm Computing String Edit Distances"). - This method splits the matrix in blocks of size \\(t \\times t\\). - Each possible block is precomputed to produce a lookup table. - This lookup table can then be used to compute the string similarity (or distance) in \\(O(\\frac{nm}{t})\\). - Usually, \\(t\\) is chosen as \\(log(m)\\) if \\(m > n\\). - The resulting computation cost is thus \\(O(\\frac{mn}{log(m)})\\). - This method has not been implemented (yet). - -2. In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time - \\(O(log(m) \\times log(n))\\). But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not implemented here. - -3. There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions (also sometimes called - unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called restricted edit distance). - For Optimal String Alignment, no substring can be edited more than once. +| Name | Distance | Similarity | Normalized | Metric | Memory cost | Execution cost | Typical usage | +|--------------------------------------------|:--------:|:----------:|:----------:|:------:|----------------------|------------------------------------|-----------------| +| Levenshtein | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | | +| Damerau-Levenshtein[@ft-c] | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | | +| Optimal String Alignment[@ft-c] | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | | +| Longest Common Subsequence | ☒ | ☐ | ☐ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] | diff, git | +| Normalized Levenshtein | ☒ | ☒ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | | +| Normalized Damerau-Levenshtein[@ft-c] | ☒ | ☐ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | | +| Normalized Optimal String Alignment[@ft-c] | ☒ | ☐ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | | +| Normalized Longest Common Subsequence | ☒ | ☐ | ☒ | ☒ | \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] | | +| Cosine similarity | ☒ | ☒ | ☒ | ☐ | \\(O(m + n)\\) | \\(O(m + n)\\) | | +| Jaccard index | ☒ | ☒ | ☒ | ☒ | \\(O(m + n)\\) | \\(O(m + n)\\) | | +| Jaro-Winkler | ☒ | ☒ | ☒ | ☐ | \\(O(m + n)\\) | \\(O(m \\times n)\\) | typo correction | +| N-Gram | ☒ | ☐ | ☒ | ☐ | | \\(O(m \\times n)\\) | | +| Q-Gram | ☒ | ☐ | ☐ | ☐ | | \\(O(m + n)\\) | | +| Ratcliff-Obershelp | ☒ | ☒ | ☒ | ☐ | \\(O(m + n)\\) | \\(O(n^3)\\) | | +| Sorensen-Dice coefficient | ☒ | ☒ | ☒ | ☐ | | \\(O(m + n)\\) | | +| Sift 4 | ☒ | ☐ | ☐ | ☐ | \\(O(m + n)\\) | \\(O(m + n)\\) | | + +

Notes

+
+
    +
  1. + +In this library, Levenshtein edit distance, LCS distance and their siblings are computed using the dynamic +programming method, which has a cost \\(O(m \\times n)\\). +For Levenshtein distance, the algorithm is sometimes called Wagner-Fischer algorithm.[@ref-1] +The original algorithm uses a matrix of size \\(m \\times n\\) to store the Levenshtein distance between string +prefixes. + +If the alphabet is finite, it is possible to use the "Four-Russians" technique[@ref-2] to speedup computation, +as shown by Masek and Paterson.[@ref-3] +This method splits the matrix in blocks of size \\(t \\times t\\). +Each possible block is precomputed to produce a lookup table. +This lookup table can then be used to compute the string similarity (or distance) in \\(O(\\frac{n \\times m}{t})\\). +Usually, \\(t\\) is chosen as \\(log(m)\\) if \\(m > n\\). +The resulting computation cost is thus \\(O(\\frac{m \\times n}{\\text{log}(m)})\\). +This method has not been implemented (yet). +
  2. +
  3. + +K.S. Larsen proposed an algorithm that computes the length of LCS in time +\\(O(log(m) \\times log(n))\\).[@ref-4] But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not +implemented here. +
  4. +
  5. + +There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions +(also sometimes called unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called +restricted edit distance). For Optimal String Alignment, no substring can be edited more than once. +
  6. +
+
+ +

References

+
+
    +
  1. + +Wagner, R. A., & Fischer, M. J. (1974-01). The string-to-string correction problem. +Journal of the ACM, 21(1), 168–173. +[[sci-hub]](https://sci-hub.st/10.1145/321796.321811) +
  2. +
  3. + +Arlazarov, V. L., Dinitz, Y. A., Kronrod, M. A., & Faradzhev, I. (1970). +An algorithm for the reduction of finite non-oriented graphs to canonical form. +*Soviet Mathematics Doklady*, *194*(3), 487-488. +
  4. +
  5. + +Masek, W. J., & Paterson, M. S. (1980-02). A faster algorithm computing string +edit distances. *Journal of Computer and System Sciences*, *20*(1), 18-31. +[[sci-hub]](https://sci-hub.st/10.1016/0022-0000(80)90002-1) +
  6. +
  7. + +Larsen, K. S. (1992-10). Length of maximal common subsequences. DAIMI Report +Series, 21(426). +[[sci-hub]](https://sci-hub.st/10.7146/dpb.v21i426.6740) +
  8. +
+
+ diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Cosine.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Cosine.kt index 219a05c..d4bcacc 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Cosine.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Cosine.kt @@ -33,10 +33,11 @@ import ca.solostudios.stringsimilarity.util.minMaxOf import kotlin.math.sqrt /** - * Implements Soft Cosine Similarity between strings. The strings are first - * transformed in vectors of occurrences of k-shingles (sequences of k - * characters). In this n-dimensional space, the similarity between the two - * strings is the Cosine of their respective vectors. + * Implements Soft Cosine Similarity between strings. + * + * The strings are first transformed in vectors of occurrences of k-shingles + * (sequences of k characters). In this n-dimensional space, the similarity + * between the two strings is the Cosine of their respective vectors. * * The Cosine similarity between strings \(X\) and \(Y\) is * the Cosine of the angle between the two strings as vectors. It is computed as: diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Jaccard.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Jaccard.kt index 1712386..b867734 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Jaccard.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Jaccard.kt @@ -32,6 +32,8 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity /** + * Implements the Jaccard index, also known as the Jaccard similarity coefficient (Jaccard, 1912). + * * Each input string is converted into a set of n-grams, the Jaccard index is * then computed as \(\frac{\lVert V_1 \cap V_2 \rVert}{\lVert V_1 \cup V_2 \rVert}\). * Like Q-Gram distance, the input strings \(X\) and \(Y\) are first converted into sets of @@ -41,6 +43,11 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity * The distance is computed as * \(1 - similarity(X, Y)\). * + * #### References + * Jaccard, P. (1912-02). The distribution of the flora in the alpine zone. + * *New Phytologist*, *11*(2), 37–50. + * [[sci-hub]](https://sci-hub.st/10.1111/j.1469-8137.1912.tb05611.x) + * * @see MetricStringDistance * @see NormalizedStringDistance * @see NormalizedStringSimilarity diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/JaroWinkler.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/JaroWinkler.kt index ba86989..8b38fec 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/JaroWinkler.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/JaroWinkler.kt @@ -35,6 +35,8 @@ import kotlin.math.max import kotlin.math.min /** + * Implements the Jaro-Winkler distance (Winkler, 1990) between strings. + * * The Jaro–Winkler distance is designed and best suited for short * strings such as person names, and to detect typos; it is (roughly) a * variation of Damerau-Levenshtein, where the substitution of 2 close @@ -47,6 +49,11 @@ import kotlin.math.min * The distance is computed as * \(1 - similarity(X, Y)\). * + * #### References + * Winkler, W. E. (1990). String comparator metrics and enhanced decision rules + * in the fellegi-sunter model of record linkage. *Proceedings of the Survey + * Research Methods Section*, 354-359. + * * @param threshold The threshold value used for adding the Winkler bonus. * * @see NormalizedStringDistance diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/NGram.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/NGram.kt index 0c4d49b..63679aa 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/NGram.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/NGram.kt @@ -35,15 +35,20 @@ import ca.solostudios.stringsimilarity.util.min import ca.solostudios.stringsimilarity.util.minMaxByLength /** - * N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance", - * String Processing and Information Retrieval, Lecture Notes in Computer - * Science Volume 3772, 2005, pp 115-126. + * Implements the N-Gram Similarity (Kondrak, 2005) between strings. * - * The algorithm uses affixing with special character '\0' to increase the + * The algorithm uses affixing with special character `'\0'` to increase the * weight of first characters. The normalization is achieved by dividing the * total similarity score the original length of the longest word. * - * [N-Gram Similarity and Distance](http://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf) + * The similarity is computed as + * \(1 - distance(X, Y)\). + * + * #### References + * Kondrak, G. (2005-11-02). N-gram similarity and distance. In String processing + * and information retrieval, lecture notes in computer science (Pages 115-126). + * Springer Berlin Heidelberg. + * [[sci-hub]](https://sci-hub.st/10.1007/11575832_13) * * @see NormalizedStringDistance * @see NormalizedStringSimilarity diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/QGram.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/QGram.kt index 5357695..e0b7ffd 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/QGram.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/QGram.kt @@ -32,10 +32,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringDistance import kotlin.math.abs /** - * Q-gram distance, as defined by - * Esko Ukkonen. Bo, "Approximate string-matching with q-grams and maximal matches", in Theoretical Computer Science, - * vol. 92, no. 1, pp. 191-211, Elsevier BV, Jan. 1992, pp. 191–211, doi: 10.1016/0304-3975(92)90143-4. - * [[sci-hub]](https://sci-hub.st/https://doi.org/10.1016/0304-3975(92)90143-4) + * Implements the Q-gram distance (Ukkonen, 1992) between strings. + * * The distance between two strings is defined as * the number of occurrences of different q-grams in each string: * \(\sum_{i=1}^n \lVert \vec{v1_i} - \vec{v2_i} \rVert\). @@ -47,9 +45,14 @@ import kotlin.math.abs * resulting in \(distance(X, Y) = 0\) where \(X \neq Y\). * However, it does respect the other 3 axioms. * + * #### References + * Ukkonen, E. (1992-01). Approximate string matching with q-grams and maximal + * matches. *Theoretical Computer Science*, *92*(1), 191–211. + * [[sci-hub]](https://sci-hub.st/10.1016/0304-3975(92)90143-4) + * * @param q The length of each q-gram. * - * @throws IllegalArgumentException if \(k \leqslant 0\) + * @throws IllegalArgumentException if \(q \leqslant 0\) * * @author Thibault Debatty, solonovamax */ diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/RatcliffObershelp.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/RatcliffObershelp.kt index 9240fa4..86ad8eb 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/RatcliffObershelp.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/RatcliffObershelp.kt @@ -31,7 +31,7 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity /** - * Implements Ratcliff/Obershelp pattern recognition, also known as Gestalt pattern matching, + * Implements Ratcliff/Obershelp pattern recognition (Ratcliff & Metzener, 1988), also known as Gestalt pattern matching, * similarity between strings. * * The similarity is defined as @@ -41,6 +41,11 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity * The distance is computed as * \(1 - similarity(X, Y)\). * + * #### References + * Ratcliff, J., & Metzener, D. E. (1988-07-01). Pattern matching: The gestalt ap- + * proach. *Dr. Dobb’s Journal*, *13*(7), 46. https://www.drdobbs.com/database/ + * pattern-matching-the-gestalt-approach/184407970?pgno=5 + * * @author [Ligi](https://github.com/dxpux), solonovamax, Ported to java from .net by denmase */ public class RatcliffObershelp : NormalizedStringSimilarity, NormalizedStringDistance { diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/ShingleBased.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/ShingleBased.kt index 24d75b3..c450b7a 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/ShingleBased.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/ShingleBased.kt @@ -45,6 +45,11 @@ package ca.solostudios.stringsimilarity * documents like e-mails, \(k = 5\) is a recommended value. For large documents, * such as research articles, \(k = 9\) is considered a safe choice. * + * #### References + * Ukkonen, E. (1992-01). Approximate string matching with q-grams and maximal + * matches. *Theoretical Computer Science*, *92*(1), 191–211. + * [[sci-hub]](https://sci-hub.st/10.1016/0304-3975(92)90143-4) + * * @param k The length of k-shingles. * * @throws IllegalArgumentException if \(k \leqslant 0\) @@ -57,12 +62,12 @@ public abstract class ShingleBased(public val k: Int = DEFAULT_K) { } /** - * Compute and return the profile of s, as defined by Ukkonen "Approximate - * string-matching with q-grams and maximal matches". - * https://www.cs.helsinki.fi/u/ukkonen/TCS92.pdf The profile is the number + * Compute and return the profile of s, as defined by Ukkonen (Ukkonen 1992). + * The profile is the number * of occurrences of k-shingles, and is used to compute q-gram similarity, - * Jaccard index, etc. Pay attention: the memory requirement of the profile - * can be up to k * size of the string + * Jaccard index, etc. + * Pay attention: the memory requirement of the profile + * can be up to \(k \times \text{size of the string}\) * * @param string * @return the profile of this string, as an unmodifiable Map diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Sift4.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Sift4.kt index d8354d1..c8a7c79 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Sift4.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/Sift4.kt @@ -34,14 +34,16 @@ import kotlin.math.max import kotlin.math.min /** - * Sift4 distance, as defined by - * Manda, Costin, [Siderite]. "Super Fast and Accurate String Distance Algorithm: Sift4." Siderite’s Blog, - * 10 Nov. 2014, https://siderite.dev/blog/super-fast-and-accurate-string-distance.html. + * Implements the Sift4 distance (Siderite, 2014) between strings. * * Note: this algorithm is asymmetric. This means that * \(distance(X, Y) \not\equiv distance(Y, X)\). * This is one of the artifacts of the linear nature of the algorithm. * + * #### References + * Costin [Siderite], M. (2014-11-10). Super fast and accurate string distance + * algorithm: Sift4. + * * @author Thibault Debatty, solonovamax */ @ExperimentalStringMeasure diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/SorensenDice.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/SorensenDice.kt index dae58b3..655b26d 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/SorensenDice.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/SorensenDice.kt @@ -31,8 +31,8 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity /** - * Sørensen-Dice coefficient, aka Sørensen index, Dice's coefficient or - * Czekanowski's binary (non-quantitative) index. + * Implements the Sørensen-Dice coefficient, also known as Sørensen index (Sørensen, 1948), Dice's coefficient, or + * Czekanowski's binary (non-quantitative) index between strings. * * The strings are first converted to boolean sets of k-shingles (sequences * of k characters), then the similarity is computed as @@ -44,6 +44,12 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity * The distance is computed as * \(1 - similarity(X, Y)\). * + * #### References + * Sørensen, T. J. (1948). A method of establishing group of equal amplitude in plant + * sociobiology based on similarity of species content and its application to + * analyses of the vegetation on danish commons. + * *Kongelige Danske Videnskabernes Selskab.* + * * @author Thibault Debatty, solonovamax */ public class SorensenDice(k: Int = DEFAULT_K) : ShingleBased(k), NormalizedStringDistance, NormalizedStringSimilarity { diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/DamerauLevenshtein.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/DamerauLevenshtein.kt index dec134f..fa87b8b 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/DamerauLevenshtein.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/DamerauLevenshtein.kt @@ -35,8 +35,8 @@ import ca.solostudios.stringsimilarity.util.min import kotlin.math.min /** - * Implementation of Damerau-Levenshtein distance with transposition (also - * sometimes calls unrestricted Damerau-Levenshtein distance). + * Implements the Damerau-Levenshtein distance (Damerau, 1964) with transposition + * (also sometimes calls unrestricted Damerau-Levenshtein distance). * It is the minimum number of operations needed to transform one string into * the other, where an operation is defined as an insertion, deletion, or * substitution of a single character, or a transposition of two adjacent @@ -52,6 +52,11 @@ import kotlin.math.min * **Note: Because this class currently implements the dynamic programming approach, * it has a space requirement \(O(m \times n)\)** * + * #### References + * Damerau, F. J. (1964-03). A technique for computer detection and correction of + * spelling errors. *Communications of the ACM*, *7*(3), 171-176. + * [[sci-hub]](https://sci-hub.st/10.1145/363958.363994) + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param substitutionWeight The weight of a substitution. Represented as \(w_s\). Must be in the range \([0, 1 \times 10^{10} ]\). diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/LCS.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/LCS.kt index 80631fc..4348efc 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/LCS.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/LCS.kt @@ -35,7 +35,7 @@ import ca.solostudios.stringsimilarity.interfaces.StringSimilarity import kotlin.math.min /** - * The Longest Common Subsequence (LCS) problem consists in finding the longest + * Implements the Longest Common Subsequence (LCS) problem consists in finding the longest * subsequence common to two (or more) sequences. It differs from problems of * finding common substrings: unlike substrings, subsequences are not required * to occupy consecutive positions within the original sequences. diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/Levenshtein.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/Levenshtein.kt index 57f67a5..db09032 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/Levenshtein.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/Levenshtein.kt @@ -35,7 +35,7 @@ import ca.solostudios.stringsimilarity.interfaces.StringSimilarity import ca.solostudios.stringsimilarity.util.min /** - * The Levenshtein distance, or edit distance, between two words is the + * Implements the Levenshtein distance (Levenshtein, 1966), or edit distance, between two words is the * minimum number of single-character edits (insertions, deletions, or * substitutions) required to change one word into the other. * @@ -56,6 +56,10 @@ import ca.solostudios.stringsimilarity.util.min * **Note: Because this class currently implements the dynamic programming approach, * it has a space requirement \(O(m \times n)\)** * + * #### References + * Levenshtein, V. I. (1966-02). Binary codes capable of correcting deletions, + * insertions and reversals. *Soviet Physics Doklady*, *10*, 707-710. + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param substitutionWeight The weight of a substitution. Represented as \(w_s\). Must be in the range \([0, 1 \times 10^{10} ]\). diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedDamerauLevenshtein.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedDamerauLevenshtein.kt index 6068ae8..d4947dc 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedDamerauLevenshtein.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedDamerauLevenshtein.kt @@ -37,11 +37,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringEditMeasure import ca.solostudios.stringsimilarity.interfaces.StringSimilarity /** - * A normalized metric based the [Damerau Levenshtein][DamerauLevenshtein] distance, as defined by - * L. Yujian and L. Bo, "A Normalized Levenshtein Distance Metric", - * in IEEE Transactions on Pattern Analysis and Machine Intelligence, - * vol. 29, no. 6, pp. 1091-1095, June 2007, doi: 10.1109/TPAMI.2007.1078. - * [[sci-hub]](https://sci-hub.st/https://ieeexplore.ieee.org/document/4160958) + * Implements a normalized metric based the [Damerau Levenshtein][DamerauLevenshtein] + * distance (Yujian & Bo, 2007). * * The normalized Damerau Levenshtein distance between Strings \(X\) and \(Y\) is: * \(\frac{2 \times distance_{damerau levenshtein}(X, Y)}{w_d \lvert X \rvert + w_i \lvert Y \rvert + distance_{damerau levenshtein}(X, Y)}\). @@ -53,6 +50,12 @@ import ca.solostudios.stringsimilarity.interfaces.StringSimilarity * which implements the dynamic programming approach, * it has a space requirement \(O(m \times n)\)** * + * #### References + * Yujian, L., & Bo, L. (2007-06). A normalized levenshtein distance metric. + * IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), + * 1091-1095. + * [[sci-hub]](https://sci-hub.st/10.1109/tpami.2007.1078) + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param substitutionWeight The weight of a substitution. Represented as \(w_s\). Must be in the range \([0, 1 \times 10^{10} ]\). diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLCS.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLCS.kt index f294be9..1773fbe 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLCS.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLCS.kt @@ -36,11 +36,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringEditMeasure import ca.solostudios.stringsimilarity.interfaces.StringSimilarity /** - * A normalized metric based on the [Longest Common Subsequence][LCS] distance, as defined by - * L. Yujian and L. Bo, "A Normalized Levenshtein Distance Metric", - * in IEEE Transactions on Pattern Analysis and Machine Intelligence, - * vol. 29, no. 6, pp. 1091-1095, June 2007, doi: 10.1109/TPAMI.2007.1078. - * [[sci-hub]](https://sci-hub.st/https://ieeexplore.ieee.org/document/4160958) + * Implements a normalized metric based on the [Longest Common Subsequence][LCS] + * distance (Yujian & Bo, 2007). * * The normalized LCS distance between Strings \(X\) and \(Y\) is: * \(\frac{2 \times distance_{LCS}(X, Y)}{\lvert X \rvert + \lvert Y \rvert + distance_{LCS}(X, Y)}\), @@ -53,6 +50,12 @@ import ca.solostudios.stringsimilarity.interfaces.StringSimilarity * which implements the dynamic programming approach, * it has a space requirement \(O(m \times n)\)** * + * #### References + * Yujian, L., & Bo, L. (2007-06). A normalized levenshtein distance metric. + * IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), + * 1091-1095. + * [[sci-hub]](https://sci-hub.st/10.1109/tpami.2007.1078) + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLevenshtein.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLevenshtein.kt index 1b8279c..4615df1 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLevenshtein.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedLevenshtein.kt @@ -36,11 +36,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringEditMeasure import ca.solostudios.stringsimilarity.interfaces.StringSimilarity /** - * A normalized metric based the [Levenshtein] distance, as defined by - * L. Yujian and L. Bo, "A Normalized Levenshtein Distance Metric", - * in IEEE Transactions on Pattern Analysis and Machine Intelligence, - * vol. 29, no. 6, pp. 1091-1095, June 2007, doi: 10.1109/TPAMI.2007.1078. - * [[sci-hub]](https://sci-hub.st/https://ieeexplore.ieee.org/document/4160958) + * Implements a normalized metric based the [Levenshtein] + * distance (Yujian & Bo, 2007). * * The normalized Levenshtein distance between Strings \(X\) and \(Y\) is: * \(\frac{2 \times distance_{levenshtein}(X, Y)}{w_d \lvert X \rvert + w_i \lvert Y \rvert + distance_{levenshtein}(X, Y)}\). @@ -52,6 +49,13 @@ import ca.solostudios.stringsimilarity.interfaces.StringSimilarity * which implements the dynamic programming approach, * it has a space requirement \(O(m \times n)\)** * + * + * #### References + * Yujian, L., & Bo, L. (2007-06). A normalized levenshtein distance metric. + * IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), + * 1091-1095. + * [[sci-hub]](https://sci-hub.st/10.1109/tpami.2007.1078) + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param substitutionWeight The weight of a substitution. Represented as \(w_s\). Must be in the range \([0, 1 \times 10^{10} ]\). diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedOptimalStringAlignment.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedOptimalStringAlignment.kt index 4dffbc1..57cc49c 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedOptimalStringAlignment.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/NormalizedOptimalStringAlignment.kt @@ -37,11 +37,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringEditMeasure import ca.solostudios.stringsimilarity.interfaces.StringSimilarity /** - * A normalized metric based the [Optimal String Alignment][OptimalStringAlignment] distance, as defined by - * L. Yujian and L. Bo, "A Normalized Levenshtein Distance Metric", - * in IEEE Transactions on Pattern Analysis and Machine Intelligence, - * vol. 29, no. 6, pp. 1091-1095, June 2007, doi: 10.1109/TPAMI.2007.1078. - * [[sci-hub]](https://sci-hub.st/https://ieeexplore.ieee.org/document/4160958) + * Implements a normalized metric based the [Optimal String Alignment][OptimalStringAlignment] + * distance (Yujian & Bo, 2007). * * The normalized Optimal String Alignment distance between Strings \(X\) and \(Y\) is: * \(\frac{2 \times distance_{OSA}(X, Y)}{w_d \lvert X \rvert + w_i \lvert Y \rvert + distance_{OSA}(X, Y)}\). @@ -53,6 +50,12 @@ import ca.solostudios.stringsimilarity.interfaces.StringSimilarity * which implements the dynamic programming approach, * it has a space requirement \(O(m \times n)\)** * + * #### References + * Yujian, L., & Bo, L. (2007-06). A normalized levenshtein distance metric. + * IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), + * 1091-1095. + * [[sci-hub]](https://sci-hub.st/10.1109/tpami.2007.1078) + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param substitutionWeight The weight of a substitution. Represented as \(w_s\). Must be in the range \([0, 1 \times 10^{10} ]\). diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/OptimalStringAlignment.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/OptimalStringAlignment.kt index 5b0ab87..3e5abb4 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/OptimalStringAlignment.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/edit/OptimalStringAlignment.kt @@ -35,8 +35,8 @@ import ca.solostudios.stringsimilarity.util.min import kotlin.math.min /** - * Implementation of the the Optimal String Alignment (sometimes called the - * restricted edit distance) variant of the Damerau-Levenshtein distance. + * Implements the Optimal String Alignment algorithm, sometimes called the + * restricted edit distance variant of the Damerau-Levenshtein distance (Damerau, 1964). * * The difference between the two algorithms consists in that the Optimal String * Alignment algorithm computes the number of edit operations needed to make the @@ -51,6 +51,11 @@ import kotlin.math.min * The similarity is computed as * \(\frac{w_d \lvert X \rvert + w_i \lvert Y \rvert - distance(X, Y)}{2}\). * + * #### References + * Damerau, F. J. (1964-03). A technique for computer detection and correction of + * spelling errors. *Communications of the ACM*, *7*(3), 171-176. + * [[sci-hub]](https://sci-hub.st/10.1145/363958.363994) + * * @param insertionWeight The weight of an insertion. Represented as \(w_i\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param deletionWeight The weight of a deletion. Represented as \(w_d\). Must be in the range \([0, 1 \times 10^{10} ]\). * @param substitutionWeight The weight of a substitution. Represented as \(w_s\). Must be in the range \([0, 1 \times 10^{10} ]\). diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/MetricStringDistance.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/MetricStringDistance.kt index d1e00d2..109d2d8 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/MetricStringDistance.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/MetricStringDistance.kt @@ -48,14 +48,18 @@ package ca.solostudios.stringsimilarity.interfaces * equal to or less than the sum of the lengths of the other two sides. * (eg. \(d(x, z) \leqslant d(x, y) + d(y, z)\)) * - > This is a natural property of both physical and metaphorical notions of distance: - * you can arrive at z from x by taking a detour through y, - * but this will not make your journey any faster than the shortest path. + * > you can arrive at z from x by taking a detour through y, + * > but this will not make your journey any faster than the shortest path. + * > ("Metric space", 2023) * - * [Wikipedia](https://en.wikipedia.org/wiki/Metric_space#Definition) - * [[archive.org]](https://web.archive.org/web/20230709193203/https://en.wikipedia.org/wiki/Metric_space#Definition) * * Where \(d(x, y)\) is the distance between the strings \(x\) and \(y\). * + * #### References + * Wikipedia contributors. (2023-09-19). Metric space — Wikipedia, the free encyclopedia. + * Retrieved 2023-09-29, from + * [[archive.org]](https://web.archive.org/web/20230709193203/https://en.wikipedia.org/wiki/Metric_space#Definition) + * * @author Thibault Debatty, solonovamax */ public interface MetricStringDistance : StringDistance { diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringDistance.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringDistance.kt index ff28064..1405f83 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringDistance.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringDistance.kt @@ -35,16 +35,25 @@ package ca.solostudios.stringsimilarity.interfaces * - `1` indicates that neither string have anything in common. * - If two strings are identical, then it should always return `0`. * - * As stated in - * [Computation of Normalized Edit Distance and Applications](https://www.csie.ntu.edu.tw/~b93076/Computation%20of%20Normalized%20Edit%20Distance%20and%20Applications.pdf) - * [[archive.org]](https://web.archive.org/web/20220303061601/https://www.csie.ntu.edu.tw/~b93076/Computation%20of%20Normalized%20Edit%20Distance%20and%20Applications.pdf), + * The normalized similarity of any normalized string measure can always be computed as is computed as + * \(1 - distance(X, Y)\). * + * As stated in "Computation of Normalized Edit Distance and Applications", * > Given two strings \(x\) and \(y\) over a finite alphabet, * > the normalized edit distance between \(x\) and \(y\), \(d(x,y)\) * > is defined as the minimum of \(W(p)/L(p)\), * > here \(p\) is an editing path between \(x\) and \(y\), \(W(p)\) * > is the sum of the weights of the elementary edit operations of \(p\), * > and \(L(p)\) is the number of these operations (length of \(p\)). + * > (Marzal & Vidal, 1993) + * + * #### References + * Marzal, A., & Vidal, E. (1993-09). Computation of normalized edit distance and + * applications. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, + * *15*(9), 926–932. + * [[sci-hub]](https://sci-hub.st/10.1109/34.232078) + * + * @see NormalizedStringSimilarity * * @author Thibault Debatty, solonovamax */ diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringEditMeasure.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringEditMeasure.kt index 023cc42..c939b9e 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringEditMeasure.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringEditMeasure.kt @@ -32,9 +32,17 @@ package ca.solostudios.stringsimilarity.interfaces * Normalized string edit measure returns a similarity or distance, * relative to the number of edits that must be performed to a string, * which is then normalized according to the function. + * It is normalized according to "A normalized levenshtein distance metric." + * (Yujian & Bo, 2007) * - * The normalized edit distance between Strings \(X\) and \(Y\) is: - * \(\frac{2 \times distance_{edit}(X, Y)}{w_d \lvert X \rvert + w_i \lvert Y \rvert + distance_{levenshtein}(X, Y)}\). + * The normalized edit distance between strings \(X\) and \(Y\) is: + * \(\frac{2 \times distance(X, Y)}{w_d \lvert X \rvert + w_i \lvert Y \rvert + distance(X, Y)}\). + * + * #### References + * Yujian, L., & Bo, L. (2007-06). A normalized levenshtein distance metric. + * IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), + * 1091-1095. + * [[sci-hub]](https://sci-hub.st/10.1109/tpami.2007.1078) * * @see StringEditMeasure * @see NormalizedStringSimilarity diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringSimilarity.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringSimilarity.kt index 42ee9f6..6af9cec 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringSimilarity.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/NormalizedStringSimilarity.kt @@ -35,16 +35,25 @@ package ca.solostudios.stringsimilarity.interfaces * - `1` indicates that both strings are equivalent. Equivalent strings are not necessarily identical. * - If two strings are identical, then it should always return `1`. * - * As stated in - * [Computation of Normalized Edit Distance and Applications](https://www.csie.ntu.edu.tw/~b93076/Computation%20of%20Normalized%20Edit%20Distance%20and%20Applications.pdf) - * [[archive.org]](https://web.archive.org/web/20220303061601/https://www.csie.ntu.edu.tw/~b93076/Computation%20of%20Normalized%20Edit%20Distance%20and%20Applications.pdf), + * The normalized similarity of any normalized string measure can always be computed as is computed as + * \(1 - distance(X, Y)\). * + * As stated in "Computation of Normalized Edit Distance and Applications", * > Given two strings \(x\) and \(y\) over a finite alphabet, * > the normalized edit distance between \(x\) and \(y\), \(d(x,y)\) * > is defined as the minimum of \(W(p)/L(p)\), * > here \(p\) is an editing path between \(x\) and \(y\), \(W(p)\) * > is the sum of the weights of the elementary edit operations of \(p\), * > and \(L(p)\) is the number of these operations (length of \(p\)). + * > (Marzal & Vidal, 1993) + * + * #### References + * Marzal, A., & Vidal, E. (1993-09). Computation of normalized edit distance and + * applications. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, + * *15*(9), 926–932. + * [[sci-hub]](https://sci-hub.st/10.1109/34.232078) + * + * @see NormalizedStringDistance * * @author Thibault Debatty, solonovamax */ diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringDistance.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringDistance.kt index 54b34fc..62fdf2a 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringDistance.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringDistance.kt @@ -34,6 +34,8 @@ package ca.solostudios.stringsimilarity.interfaces * - `0` indicates that both strings are *equivalent*. Equivalent strings are not necessarily identical. * - If two strings are identical, then it should always return `0`. * + * @see StringSimilarity + * * @author Thibault Debatty, solonovamax */ public interface StringDistance { diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringEditMeasure.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringEditMeasure.kt index e8a627e..effcfb0 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringEditMeasure.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringEditMeasure.kt @@ -32,7 +32,7 @@ package ca.solostudios.stringsimilarity.interfaces * String edit measure returns a similarity or distance, * relative to the number of edits that must be performed to a string. * - * The similarity is computed as + * The similarity between strings \(X\) and \(Y\) is: * \(\frac{w_d \lvert X \rvert + w_i \lvert Y \rvert - distance(X, Y)}{2}\). * * @see MetricStringDistance diff --git a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringSimilarity.kt b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringSimilarity.kt index cd55980..c7a8288 100644 --- a/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringSimilarity.kt +++ b/kt-string-similarity/src/commonMain/kotlin/ca/solostudios/stringsimilarity/interfaces/StringSimilarity.kt @@ -35,6 +35,8 @@ package ca.solostudios.stringsimilarity.interfaces * - `0` indicates that neither string have anything in common. * - If two strings are identical and non-empty, then it should never return `0`. * + * @see StringDistance + * * @author Thibault Debatty, solonovamax */ public interface StringSimilarity {