From 9bc25c99037f16a838d558203fee4c020012758c Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Mon, 30 Sep 2024 14:38:53 +0100 Subject: [PATCH 1/3] adding reverse token filter docs #8274 Signed-off-by: Anton Rubin --- _analyzers/token-filters/index.md | 2 +- _analyzers/token-filters/reverse.md | 77 +++++++++++++++++++++++++++++ 2 files changed, 78 insertions(+), 1 deletion(-) create mode 100644 _analyzers/token-filters/reverse.md diff --git a/_analyzers/token-filters/index.md b/_analyzers/token-filters/index.md index a9b621d5ab..7a73a27b22 100644 --- a/_analyzers/token-filters/index.md +++ b/_analyzers/token-filters/index.md @@ -49,7 +49,7 @@ Normalization | `arabic_normalization`: [ArabicNormalizer](https://lucene.apache `porter_stem` | [PorterStemFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/en/PorterStemFilter.html) | Uses the [Porter stemming algorithm](https://tartarus.org/martin/PorterStemmer/) to perform algorithmic stemming for the English language. `predicate_token_filter` | N/A | Removes tokens that don’t match the specified predicate script. Supports inline Painless scripts only. `remove_duplicates` | [RemoveDuplicatesTokenFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/miscellaneous/RemoveDuplicatesTokenFilter.html) | Removes duplicate tokens that are in the same position. -`reverse` | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. +[`reverse`]({{site.url}}{{site.baseurl}}/analyzers/token-filters/reverse/) | [ReverseStringFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/reverse/ReverseStringFilter.html) | Reverses the string corresponding to each token in the token stream. For example, the token `dog` becomes `god`. `shingle` | [ShingleFilter](https://lucene.apache.org/core/9_10_0/analysis/common/org/apache/lucene/analysis/shingle/ShingleFilter.html) | Generates shingles of lengths between `min_shingle_size` and `max_shingle_size` for tokens in the token stream. Shingles are similar to n-grams but apply to words instead of letters. For example, two-word shingles added to the list of unigrams [`contribute`, `to`, `opensearch`] are [`contribute to`, `to opensearch`]. `snowball` | N/A | Stems words using a [Snowball-generated stemmer](https://snowballstem.org/). You can use the `snowball` token filter with the following languages in the `language` field: `Arabic`, `Armenian`, `Basque`, `Catalan`, `Danish`, `Dutch`, `English`, `Estonian`, `Finnish`, `French`, `German`, `German2`, `Hungarian`, `Irish`, `Italian`, `Kp`, `Lithuanian`, `Lovins`, `Norwegian`, `Porter`, `Portuguese`, `Romanian`, `Russian`, `Spanish`, `Swedish`, `Turkish`. `stemmer` | N/A | Provides algorithmic stemming for the following languages in the `language` field: `arabic`, `armenian`, `basque`, `bengali`, `brazilian`, `bulgarian`, `catalan`, `czech`, `danish`, `dutch`, `dutch_kp`, `english`, `light_english`, `lovins`, `minimal_english`, `porter2`, `possessive_english`, `estonian`, `finnish`, `light_finnish`, `french`, `light_french`, `minimal_french`, `galician`, `minimal_galician`, `german`, `german2`, `light_german`, `minimal_german`, `greek`, `hindi`, `hungarian`, `light_hungarian`, `indonesian`, `irish`, `italian`, `light_italian`, `latvian`, `Lithuanian`, `norwegian`, `light_norwegian`, `minimal_norwegian`, `light_nynorsk`, `minimal_nynorsk`, `portuguese`, `light_portuguese`, `minimal_portuguese`, `portuguese_rslp`, `romanian`, `russian`, `light_russian`, `sorani`, `spanish`, `light_spanish`, `swedish`, `light_swedish`, `turkish`. diff --git a/_analyzers/token-filters/reverse.md b/_analyzers/token-filters/reverse.md new file mode 100644 index 0000000000..9ac212e6bc --- /dev/null +++ b/_analyzers/token-filters/reverse.md @@ -0,0 +1,77 @@ +--- +layout: default +title: Reverse +parent: Token filters +nav_order: 360 +--- + +# Reverse token filter + +The `reverse` token filter reverses the order of the characters in each token. + +## Example + +The following example request creates a new index named `my-reverse-index` and configures an analyzer with `reverse`: + +```json +PUT /my-reverse-index +{ + "settings": { + "analysis": { + "filter": { + "reverse_filter": { + "type": "reverse" + } + }, + "analyzer": { + "my_reverse_analyzer": { + "type": "custom", + "tokenizer": "standard", + "filter": [ + "lowercase", + "reverse_filter" + ] + } + } + } + } +} + +``` +{% include copy-curl.html %} + +## Generated tokens + +Use the following request to examine the tokens generated using the analyzer: + +```json +GET /my-reverse-index/_analyze +{ + "analyzer": "my_reverse_analyzer", + "text": "hello world" +} +``` +{% include copy-curl.html %} + +The response contains the generated tokens: + +```json +{ + "tokens": [ + { + "token": "olleh", + "start_offset": 0, + "end_offset": 5, + "type": "", + "position": 0 + }, + { + "token": "dlrow", + "start_offset": 6, + "end_offset": 11, + "type": "", + "position": 1 + } + ] +} +``` \ No newline at end of file From b4f2bc91383c1c4a0f01a37c41c8e3d619ae6bfc Mon Sep 17 00:00:00 2001 From: Fanit Kolchina Date: Mon, 18 Nov 2024 16:01:18 -0500 Subject: [PATCH 2/3] Doc review Signed-off-by: Fanit Kolchina --- _analyzers/token-filters/reverse.md | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/_analyzers/token-filters/reverse.md b/_analyzers/token-filters/reverse.md index 9ac212e6bc..51975cd4c4 100644 --- a/_analyzers/token-filters/reverse.md +++ b/_analyzers/token-filters/reverse.md @@ -7,11 +7,22 @@ nav_order: 360 # Reverse token filter -The `reverse` token filter reverses the order of the characters in each token. +The `reverse` token filter reverses the order of the characters in each token, making suffix information accessible at the start of the reversed tokens during analysis. + +This is useful for suffix-based searches: + +The `reverse` token filter in Elasticsearch is useful when you need to perform suffix-based searches. This includes scenarios such as: + +- **Suffix matching**: Searching for words based on their endings, like identifying words that end with specific patterns (for example, `-tion` or `-ing`). +- **File extension searches**: Searching for files by their extensions, such as `.txt` or `.jpg`. +- **Custom sorting or ranking**: By reversing tokens, you can implement unique sorting or ranking logic based on suffixes. +- **Autocomplete for suffixes**: Implementing autocomplete suggestions that focus on suffixes rather than prefixes. + +The `reverse` token filter works by reversing the order of characters in each token, making suffix information accessible at the start of the reversed tokens during analysis. ## Example -The following example request creates a new index named `my-reverse-index` and configures an analyzer with `reverse`: +The following example request creates a new index named `my-reverse-index` and configures an analyzer with a `reverse` filter: ```json PUT /my-reverse-index @@ -36,7 +47,6 @@ PUT /my-reverse-index } } } - ``` {% include copy-curl.html %} From d0e076daf6dcdb4f21773b12720ae4e3352c66ea Mon Sep 17 00:00:00 2001 From: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Date: Mon, 2 Dec 2024 11:18:43 -0500 Subject: [PATCH 3/3] Apply suggestions from code review Co-authored-by: Nathan Bower Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> --- _analyzers/token-filters/reverse.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/_analyzers/token-filters/reverse.md b/_analyzers/token-filters/reverse.md index 51975cd4c4..dc48f07e77 100644 --- a/_analyzers/token-filters/reverse.md +++ b/_analyzers/token-filters/reverse.md @@ -7,18 +7,17 @@ nav_order: 360 # Reverse token filter -The `reverse` token filter reverses the order of the characters in each token, making suffix information accessible at the start of the reversed tokens during analysis. +The `reverse` token filter reverses the order of the characters in each token, making suffix information accessible at the beginning of the reversed tokens during analysis. This is useful for suffix-based searches: -The `reverse` token filter in Elasticsearch is useful when you need to perform suffix-based searches. This includes scenarios such as: +The `reverse` token filter is useful when you need to perform suffix-based searches, such as in the following scenarios: -- **Suffix matching**: Searching for words based on their endings, like identifying words that end with specific patterns (for example, `-tion` or `-ing`). +- **Suffix matching**: Searching for words based on their suffixes, such as identifying words with a specific ending (for example, `-tion` or `-ing`). - **File extension searches**: Searching for files by their extensions, such as `.txt` or `.jpg`. - **Custom sorting or ranking**: By reversing tokens, you can implement unique sorting or ranking logic based on suffixes. -- **Autocomplete for suffixes**: Implementing autocomplete suggestions that focus on suffixes rather than prefixes. +- **Autocomplete for suffixes**: Implementing autocomplete suggestions that use suffixes rather than prefixes. -The `reverse` token filter works by reversing the order of characters in each token, making suffix information accessible at the start of the reversed tokens during analysis. ## Example