Add check for unicode normalization in config #44

newsch · 2024-04-21T16:42:58Z

This ensures that the config sections match Wikipedia's Unicode
normalization. We could also normalize every section in every article to
handle an edge case, but I don't think that's worth it yet.

This ensures that the config sections match Wikipedia's Unicode normalization. We could also normalize every section in every article to handle an edge case, but I don't think that's worth it yet. Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik

Can config values be always normalized after the loading, and each wiki section be normalized before comparison? Or is there a comparison function that automatically handle normalization?

newsch · 2024-04-22T16:00:41Z

Can config values be always normalized after the loading, and each wiki section be normalized before comparison?

The wiki section headers are already normalized to NFC by Wikipedia as explained here, with this exception (which I think would be unlikely in our use case):

MediaWiki doesn't apply any normalization to its output, for example cafe<nowiki/>́ becomes "café" (shows U+0065 U+0301 in a row, without precomposed characters like U+00E9 appearing).

We could normalize the config values after loading and either normalize each header (to double-check) or not.

In this case, since normalization behavior is sometimes unexpected and most keyboards/websites are already normalized, I think it's better to make the change explicit. If someone copies the header from Wikipedia or types it out, it should already be normalized.

Or is there a comparison function that automatically handle normalization?

I'm not aware of a library that handles case-sensitive unicode normalization comparison. We could make a small wrapper around the unicode-normalization iterators with .zip() and .all().

We could also do case-insensitive comparisons, there are:

regex also supports case-insensitive Unicode comparison, building a single regex automata from all fixed string for a language would probably be the best-performing solution. It doesn't support normalization though.

Regardless, before choosing any of those methods I think we should aggregate all the headers of the articles and see if any are:

not normalized
mixed case

Then we could compare a couple different methods and see if they give different results.

This PR just makes sure we don't have any characters in the wrong form for the current bytewise comparison.

biodranik

Thanks for the details!
Ok, let's apply the current approach and only fix/react in the future if some bugs are discovered.

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

biodranik reviewed Apr 21, 2024

View reviewed changes

biodranik approved these changes Apr 22, 2024

View reviewed changes

Add caveats to section removal doc

62a16d2

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>

newsch merged commit cd03fed into main Apr 24, 2024
1 check passed

newsch deleted the normalization branch April 24, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add check for unicode normalization in config #44

Add check for unicode normalization in config #44

newsch commented Apr 21, 2024

biodranik left a comment

newsch commented Apr 22, 2024 •

edited

Loading

biodranik left a comment

Add check for unicode normalization in config #44

Add check for unicode normalization in config #44

Conversation

newsch commented Apr 21, 2024

biodranik left a comment

Choose a reason for hiding this comment

newsch commented Apr 22, 2024 • edited Loading

biodranik left a comment

Choose a reason for hiding this comment

newsch commented Apr 22, 2024 •

edited

Loading