Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for unicode normalization in config #44

Merged
merged 2 commits into from
Apr 24, 2024
Merged

Add check for unicode normalization in config #44

merged 2 commits into from
Apr 24, 2024

Conversation

newsch
Copy link
Collaborator

@newsch newsch commented Apr 21, 2024

This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case, but I don't think that's worth it yet.

This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case, but I don't think that's worth it yet.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can config values be always normalized after the loading, and each wiki section be normalized before comparison? Or is there a comparison function that automatically handle normalization?

@newsch
Copy link
Collaborator Author

newsch commented Apr 22, 2024

Can config values be always normalized after the loading, and each wiki section be normalized before comparison?

The wiki section headers are already normalized to NFC by Wikipedia as explained here, with this exception (which I think would be unlikely in our use case):

MediaWiki doesn't apply any normalization to its output, for example cafe<nowiki/>́ becomes "café" (shows U+0065 U+0301 in a row, without precomposed characters like U+00E9 appearing).

We could normalize the config values after loading and either normalize each header (to double-check) or not.

In this case, since normalization behavior is sometimes unexpected and most keyboards/websites are already normalized, I think it's better to make the change explicit. If someone copies the header from Wikipedia or types it out, it should already be normalized.

Or is there a comparison function that automatically handle normalization?

I'm not aware of a library that handles case-sensitive unicode normalization comparison. We could make a small wrapper around the unicode-normalization iterators with .zip() and .all().

We could also do case-insensitive comparisons, there are:

regex also supports case-insensitive Unicode comparison, building a single regex automata from all fixed string for a language would probably be the best-performing solution. It doesn't support normalization though.


Regardless, before choosing any of those methods I think we should aggregate all the headers of the articles and see if any are:

  • not normalized
  • mixed case

Then we could compare a couple different methods and see if they give different results.

This PR just makes sure we don't have any characters in the wrong form for the current bytewise comparison.

Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details!
Ok, let's apply the current approach and only fix/react in the future if some bugs are discovered.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
@newsch newsch merged commit cd03fed into main Apr 24, 2024
1 check passed
@newsch newsch deleted the normalization branch April 24, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants