Skip to content
This repository has been archived by the owner on Nov 30, 2023. It is now read-only.

Add polyphones map to override tones for common polyphonic hanzi #210

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

chambm
Copy link

@chambm chambm commented Aug 20, 2022

This workaround should address most problems mentioned in #173 . It's certainly not perfect, but it will mean having to correct tones in simple sentences a lot less often. A further update could add two syllable polyphones. That should only require adding them to the polyphones.tsv file.

I guess I could have put the polyphones into the SQLite file but that seemed like overkill. I didn't actually run the tests. I developed this entirely by editing the addin python files and seeing what happened when I ran Anki. :)

@noinkling
Copy link

noinkling commented Aug 21, 2022

This is unnecessary - the Unihan database provides the most common (according to their sources) reading for individual characters in the kMandarin field, and this data is already included in the SQLite database here in the hanzi table (easiest way to view it is to use a client like DBeaver or something). I checked a selection of the characters from your table and they agree on the reading.

As I stated here, really all that needs to be done is to make it so that individual characters don't use the word/dictionary table (cidian).

@chambm
Copy link
Author

chambm commented Aug 21, 2022

Bugger, I didn't notice that. I was focused on the cidian table. They do agree 98%. There are cases like this:

hanzi zidian laopo
de dì de5
chà chā cha1
báo bó bao2
dòu dǒu dou3
shí shì shi2

According to the Unihan docs the latter of these pairs are Taiwanese variations. I'm not sure I buy that. The 地 hanzi is the 46th most frequent in the giga-zh list, almost certainly because it's a grammar particle. When used as a particle, I think it's always pronounced de5. My (mainland) wife says so anyway, and this indicates Taiwanese also say de:
https://forvo.com/word/%E5%9C%B0%EF%BC%88de%EF%BC%89/#zh

It's obviously important we get 地 right when it's used as a particle...how should we handle that?

@noinkling
Copy link

noinkling commented Aug 21, 2022

Yeah my guess is the Unihan people didn't give any consideration to the different meanings in the case of 地 (or if they did decided it wasn't their problem, "out of scope"), rather the Taiwanese source(s) they used said that di4 (i.e. the noun meaning) is more common for whatever reason, so it was simply included. Keep in mind that the readings given are almost certainly based on how often it appears in compounds too, whereas for our purposes we're mostly only considering standalone meanings, since compounds should (ideally) be covered by the word dictionary.

Ultimately automated Chinese word segmentation and transliteration are a hard problem (it's an active area of research) because it so often depends on contextual meaning, so whatever approach that is taken is going to be flawed in some way, it's always going to be "best effort".

Anyway these kinds of discrepancies are probably best solved by adding "manual override" data, so basically what you were already doing, but only for characters where we think the kMandarin readings aren't ideal. I think I agree with your wife about the more likely readings for 差 and 斗, for what it's worth given my limited knowledge.

@chambm
Copy link
Author

chambm commented Aug 21, 2022

OK, here's the subset of polyphones.tsv where my wife's pick differs from kMandarin (or kMandarin is a pair). I'm going to ask her to double-check these since it's a small set. So basically I'll update polyphones to this (including any adjustments later), and then add short-circuiting for single character words to use hanzi's kMandarin instead of cidian?

There's also about 250 two character words in cidian with ambiguous pronunciations. I'll try to add those in as well because I'm sure some of those meanings are much more commonly used than others.

chambm added 3 commits August 21, 2022 20:34
The removed polyphones will get the correct pinyin from the hanzi table.
Fix Ruby/Bopomofo test (maybe due to using hanzi transcription)
Move polyphones.tsv to the right location
@chambm
Copy link
Author

chambm commented Aug 22, 2022

Surprisingly, it took me a while to find a few two-character polyphones that didn't already transcribe the same way as my wife picked. So there are some possibly redundant two-character polyphones in the override table, but that could change if SQLite query optimization changes or an ORDER BY clause gets added (as it probably should be).

I actually ran the tests before those last commits. They all pass for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants