Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
opensearch-project · Jan 3, 2025 · c15b7f2 · c15b7f2
1 parent 22cfa31
commit c15b7f2
Showing 1 changed file with 14 additions and 11 deletions.
diff --git a/_analyzers/tokenizers/edge-n-gram.md b/_analyzers/tokenizers/edge-n-gram.md
@@ -7,15 +7,15 @@ nav_order: 40
 
 # Edge n-gram tokenizer
 
-The `edge_ngram` tokenizer generates partial word tokens, or n-grams, starting from the beginning of each word. It splits the text based on specified characters and produces tokens with lengths defined by a minimum and maximum length. This tokenizer is particularly useful for implementing search-as-you-type functionality. 
+The `edge_ngram` tokenizer generates partial word tokens, or _n-grams_, starting from the beginning of each word. It splits the text based on specified characters and produces tokens within a defined minimum and maximum length range. This tokenizer is particularly useful for implementing search-as-you-type functionality. 
 
-Edge n-grams are ideal for autocomplete searches where the order of the words may vary, such as with product names or addresses. For more information, see [Autocomplete]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/autocomplete/). However, for text with a fixed order, like movie or song titles, the completion suggester may be more efficient.
+Edge n-grams are ideal for autocomplete searches where the order of the words may vary, such as when searching for product names or addresses. For more information, see [Autocomplete]({{site.url}}{{site.baseurl}}/search-plugins/searching-data/autocomplete/). However, for text with a fixed order, like movie or song titles, the completion suggester may be more accurate.
 
 By default, the `edge n-gram` tokenizer produces tokens with a minimum length of `1` and a maximum length of `2`. For example, analyzing the text `OpenSearch` with the default configuration will produce the `O` and `Op` n-grams. These short n-grams are often not sufficient for meaningful searches, so configuring the tokenizer is necessary to adjust the gram lengths.
 
 ## Example usage
 
-The following example request creates a new index named `my_index` and configures an analyzer with a `edge_ngram` tokenizer. The tokenizer produces tokens 3--6 characters in length, considering both letters and symbols as valid token characters:
+The following example request creates a new index named `my_index` and configures an analyzer with an `edge_ngram` tokenizer. The tokenizer produces tokens 3--6 characters in length, considering both letters and symbols to be valid token characters:
 
 ```json
 PUT /edge_n_gram_index
@@ -101,31 +101,34 @@ The response contains the generated tokens:
 
 ## Parameters
 
-| Parameter           | Required/Optional | Data Type        | Description    |
+| Parameter           | Required/Optional | Data type        | Description    |
 |:-------|:--------|:------|:---|
 | `min_gram`          | Optional          | Integer          | The minimum token length. Default is `1`.                                                    |
 | `max_gram`          | Optional          | Integer          | The maximum token length. Default is `2`.                                                    |
 | `custom_token_chars`| Optional          | String           | Defines custom characters to be treated as part of a token (for example, `+-_`).                    |
-| `token_chars`       | Optional          | Array of strings | Defines character classes to include in tokens. Tokens are split on characters not in these classes. Default includes all characters. Available classes include: <br> - `letter`: Alphabetic characters (for example, `a`, `ç`, or `京`) <br> - `digit`: Numeric characters (for example, `3`, `7`) <br>- `punctuation`: Punctuation symbols (for example, `!` or `?`) <br> - `symbol`: Other symbols (for example, `$`, `√`) <br> - `whitespace`: Space or newline characters <br> - `custom`: Allows you to specify custom characters in the `custom_token_chars` setting. |
+| `token_chars`       | Optional          | Array of strings | Defines character classes to include in tokens. Tokens are split on characters not included in these classes. Default includes all characters. Available classes include: <br> - `letter`: Alphabetic characters (for example, `a`, `ç`, or `京`) <br> - `digit`: Numeric characters (for example, `3` or `7`) <br>- `punctuation`: Punctuation symbols (for example, `!` or `?`) <br> - `symbol`: Other symbols (for example, `$` or `√`) <br> - `whitespace`: Space or newline characters <br> - `custom`: Allows you to specify custom characters in the `custom_token_chars` setting. |
 
 
-## The max_gram parameter limitations
+## max_gram parameter limitations
 
 The `max_gram` parameter sets the maximum length of tokens generated by the tokenizer. When a search query exceeds this length, it may fail to match any terms in the index.
 
-For instance, if `max_gram` is set to `4`, the query `explore` would be tokenized as `expl` during indexing. As a result, a search for the full term `explore` will not match the indexed token `expl`.
+For example, if `max_gram` is set to `4`, the query `explore` would be tokenized as `expl` during indexing. As a result, a search for the full term `explore` will not match the indexed token `expl`.
 
-To address this limitation, you can apply a `truncate` token filter to shorten search terms to the maximum token length. However, this approach has trade-offs. Truncating `explore` to `expl` might lead to matches with unrelated terms like `explosion` or `explicit`, reducing search precision.
+To address this limitation, you can apply a `truncate` token filter to shorten search terms to the maximum token length. However, this approach presents trade-offs. Truncating `explore` to `expl` might lead to matches with unrelated terms like `explosion` or `explicit`, reducing search precision.
 
 We recommend carefully balancing the `max_gram` value to ensure efficient tokenization while minimizing irrelevant matches. If precision is critical, consider alternative strategies, such as adjusting query analyzers or fine-tuning filters.
 
 ## Best practices
 
 We recommend using the `edge_ngram` tokenizer only at indexing time to ensure partial word tokens are stored. At search time, a simpler analyzer should be used to match full user queries.
 
-##  Configuring search as you type 
+## Configuring search-as-you-type functionality
+
+To implement search-as-you-type functionality, use the `edge_ngram` tokenizer during indexing and an analyzer that performs minimal processing at search time. The following example demonstrates this approach.
+
+Create an index with an `edge_ngram` tokenizer:
 
-To implement search-as-you-type functionality, use the `edge_ngram` tokenizer during indexing and a simpler analyzer at search time. The following configuration demonstrates this approach:
 
 ```json
 PUT /my-autocomplete-index
@@ -180,7 +183,7 @@ PUT my-autocomplete-index/_doc/1?refresh
 
 This configuration ensures that the `edge_ngram` tokenizer breaks terms like Laptop into tokens such as `La`, `Lap`, and `Lapt`, allowing partial matches during search. At search time, the standard tokenizer simplifies queries while ensuring matches are case-insensitive because of the lowercase filter.
 
-Now, searches for `laptop Pr` or `lap pr` retrieve the relevant document based on partial matches:
+Searches for `laptop Pr` or `lap pr` now retrieve the relevant document based on partial matches:
 
 ```json
 GET my-autocomplete-index/_search