Skip to content

Latest commit

 

History

History
78 lines (70 loc) · 2.97 KB

50_Character_folding.asciidoc

File metadata and controls

78 lines (70 loc) · 2.97 KB

Unicode Character Folding

In the same way as the lowercase token filter is a good starting point for many languages but falls short when exposed to the entire tower of Babel, so the asciifolding token filter requires a more effective Unicode character-folding counterpart for dealing with the many languages of the world.

The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion of numbers in other scripts into their Latin equivalents, plus various other numeric, symbolic, and punctuation transformations.

The icu_folding token filter applies Unicode normalization and case folding from nfkc_cf automatically, so the icu_normalizer is not required:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_folder": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "icu_folding" ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_folder
١٢٣٤٥ (1)
  1. The Arabic numerals ١٢٣٤٥ are folded to their Latin equivalent: 12345.

If there are particular characters that you would like to protect from folding, you can use a UnicodeSet (much like a character class in regular expressions) to specify which Unicode characters may be folded. For instance, to exclude the Swedish letters å, ä, ö, Å, Ä, and Ö from folding, you would specify a character class representing all Unicode characters, except for those letters: [^åäöÅÄÖ] (^ means everything except).

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "swedish_folding": { (1)
          "type": "icu_folding",
          "unicodeSetFilter": "[^åäöÅÄÖ]"
        }
      },
      "analyzer": {
        "swedish_analyzer": { (2)
          "tokenizer": "icu_tokenizer",
          "filter":  [ "swedish_folding", "lowercase" ]
        }
      }
    }
  }
}
  1. The swedish_folding token filter customizes the icu_folding token filter to exclude Swedish letters, both uppercase and lowercase.

  2. The swedish analyzer first tokenizes words, then folds each token by using the swedish_folding filter, and then lowercases each token in case it includes some of the uppercase excluded letters: Å, Ä, or Ö.