New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[DOC] Tokenizers - Classic #8357

Merged

kolchfa-aws merged 15 commits into opensearch-project:main from leanneeliatra:tokenizer-classic

Jan 3, 2025

Contributor

leanneeliatra commented Sep 24, 2024

Description

Addition of the Tokenizer - Classic documentation, to the Analyzers section.

Issues Resolved

Part of #1483 addressed in this PR.

Version

All

Frontend features

n/a

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

leanneeliatra added 2 commits

September 24, 2024 10:48


          adding in classic tokenizer page

ceff6a6

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>


          removing unneeded whitespace

495a8d8

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

github-actions bot commented Sep 24, 2024

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.

github-actions bot assigned kolchfa-aws

leanneeliatra marked this pull request as ready for review

September 24, 2024 09:53

leanneeliatra requested review from kolchfa-aws, Naarcha-AWS, vagimeli, AMoo-Miki, natebower, dlvenable, stephen-crawford and epugh as code owners

September 24, 2024 09:53

kolchfa-aws assigned vagimeli and unassigned kolchfa-aws

vagimeli added 3 - Tech review Needs SME analyzers labels

Contributor

vagimeli commented Sep 24, 2024

@udabhas Will you review this PR for technical accuracy, or have a peer review it? Thank you.


          tokenizers does now have children

e80f5ea

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

vagimeli added the Content gap label

leanneeliatra and others added 5 commits

October 4, 2024 14:19


          Merge branch 'main' into tokenizer-classic

e998f91


          Merge branch 'main' into tokenizer-classic

d7de779


          doc: small update for page numbers

f5021f3

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>


          Merge branch 'main' into tokenizer-classic

caa0811


          format: updates to layout and formatting of page

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

udabhas reviewed

View reviewed changes

_analyzers/tokenizers/classic-tokenizer.md Outdated

+              By analyzing the text "Send an email to john.doe@example.com or call 555-1234!", we can see the punctuation has been removed, while email and phone number
+              ```
+               "Send", "an", "email", "to", "john.doe", "example.com", "or", "call", "555-1234"

udabhas Dec 9, 2024

nit: I was wondering if it would be better to show entire output as there are different token types.
{"<ALPHANUM>", "<APOSTROPHE>", "<ACRONYM>", "<COMPANY>", "<EMAIL>", "<HOST>", "<NUM>", "<CJ>", "<ACRONYM_DEP>"}

{
  "tokens": [
    {
      "token": "Send",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "an",
      "start_offset": 5,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "email",
      "start_offset": 8,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "to",
      "start_offset": 14,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "john.doe@example.com",
      "start_offset": 17,
      "end_offset": 37,
      "type": "<EMAIL>",
      "position": 4
    },
    {
      "token": "or",
      "start_offset": 38,
      "end_offset": 40,
      "type": "<ALPHANUM>",
      "position": 5
    },
    {
      "token": "call",
      "start_offset": 41,
      "end_offset": 45,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "555-1234",
      "start_offset": 46,
      "end_offset": 54,
      "type": "<NUM>",
      "position": 7
    }
  ]
}

kolchfa-aws added 3 commits

January 2, 2025 13:30


          Doc review

5b8506c

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>


          Change example

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>


          Small rewrite

2f76758

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

kolchfa-aws assigned kolchfa-aws and unassigned vagimeli

natebower requested changes

View reviewed changes

Collaborator

natebower left a comment

@kolchfa-aws Please see my comments and changes and tag me for approval when complete. Thanks!

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md

+              The `classic` tokenizer parses text as follows:
+              - **Punctuation**: Splits text at most punctuation marks and removes punctuation characters. Dots that aren't followed by spaces are treated as part of the token.
+              - **Hyphens**: Splits words at hyphens, except when a number is present. When a number is present in a token, the token is not split and is treated like a product number.

Collaborator

natebower Jan 3, 2025

Like a "product number" specifically?

Collaborator

kolchfa-aws Jan 3, 2025

Yes

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

udabhas approved these changes

View reviewed changes

udabhas left a comment

Looks good to me!

kolchfa-aws reviewed

View reviewed changes

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved


          Apply suggestions from code review

8179ffa

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

Collaborator

kolchfa-aws commented Jan 3, 2025

@natebower Comments addressed - please review again. Thanks!


          Merge branch 'main' into tokenizer-classic

ac12e56

natebower reviewed

View reviewed changes

_analyzers/tokenizers/classic.md Outdated Show resolved Hide resolved

natebower approved these changes

View reviewed changes

Collaborator

natebower left a comment

@kolchfa-aws LGTM with one minor deletion. Thanks!

kolchfa-aws and others added 2 commits

January 3, 2025 11:58


          Update _analyzers/tokenizers/classic.md

94434a5

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>


          Merge branch 'main' into tokenizer-classic

6b54b56

kolchfa-aws merged commit cdf2e30 into opensearch-project:main

5 checks passed

kolchfa-aws added the backport 2.18 label

opensearch-trigger-bot bot pushed a commit that referenced this pull request


          [DOC] Tokenizers - Classic (#8357)

4bb58f4

* adding in classic tokenizer page

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* removing unneeded whitespace

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* tokenizers does now have children

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* doc: small update for page numbers

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* format: updates to layout and formatting of page

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>

* Doc review

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Change example

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Small rewrite

Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _analyzers/tokenizers/classic.md

Co-authored-by: Nathan Bower <nbower@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: leanne.laceybyrne@eliatra.com <leanne.laceybyrne@eliatra.com>
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Fanit Kolchina <kolchfa@amazon.com>
Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Co-authored-by: Nathan Bower <nbower@amazon.com>
(cherry picked from commit cdf2e30)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

opensearch-trigger-bot bot mentioned this pull request

[Backport 2.18] [DOC] Tokenizers - Classic #9012

Merged

github-actions bot pushed a commit that referenced this pull request


          [DOC] Tokenizers - Classic (#8357) (#9012)

ab35095

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kolchfa-aws kolchfa-aws left review comments

natebower natebower approved these changes

udabhas udabhas approved these changes

Naarcha-AWS Awaiting requested review from Naarcha-AWS Naarcha-AWS is a code owner

vagimeli Awaiting requested review from vagimeli

AMoo-Miki Awaiting requested review from AMoo-Miki AMoo-Miki is a code owner

dlvenable Awaiting requested review from dlvenable dlvenable is a code owner

stephen-crawford Awaiting requested review from stephen-crawford

epugh Awaiting requested review from epugh epugh is a code owner

Labels

3 - Tech review analyzers backport 2.18 Content gap Needs SME