Add polyphones map to override tones for common polyphonic hanzi #210

chambm · 2022-08-20T23:15:19Z

This workaround should address most problems mentioned in #173 . It's certainly not perfect, but it will mean having to correct tones in simple sentences a lot less often. A further update could add two syllable polyphones. That should only require adding them to the polyphones.tsv file.

I guess I could have put the polyphones into the SQLite file but that seemed like overkill. I didn't actually run the tests. I developed this entirely by editing the addin python files and seeing what happened when I ran Anki. :)

noinkling · 2022-08-21T01:16:19Z

This is unnecessary - the Unihan database provides the most common (according to their sources) reading for individual characters in the kMandarin field, and this data is already included in the SQLite database here in the hanzi table (easiest way to view it is to use a client like DBeaver or something). I checked a selection of the characters from your table and they agree on the reading.

As I stated here, really all that needs to be done is to make it so that individual characters don't use the word/dictionary table (cidian).

chambm · 2022-08-21T04:23:42Z

Bugger, I didn't notice that. I was focused on the cidian table. They do agree 98%. There are cases like this:

hanzi	zidian	laopo
地	de dì	de5
差	chà chā	cha1
薄	báo bó	bao2
斗	dòu dǒu	dou3
识	shí shì	shi2

According to the Unihan docs the latter of these pairs are Taiwanese variations. I'm not sure I buy that. The 地 hanzi is the 46th most frequent in the giga-zh list, almost certainly because it's a grammar particle. When used as a particle, I think it's always pronounced de5. My (mainland) wife says so anyway, and this indicates Taiwanese also say de:
https://forvo.com/word/%E5%9C%B0%EF%BC%88de%EF%BC%89/#zh

It's obviously important we get 地 right when it's used as a particle...how should we handle that?

noinkling · 2022-08-21T06:12:05Z

Yeah my guess is the Unihan people didn't give any consideration to the different meanings in the case of 地 (or if they did decided it wasn't their problem, "out of scope"), rather the Taiwanese source(s) they used said that di4 (i.e. the noun meaning) is more common for whatever reason, so it was simply included. Keep in mind that the readings given are almost certainly based on how often it appears in compounds too, whereas for our purposes we're mostly only considering standalone meanings, since compounds should (ideally) be covered by the word dictionary.

Ultimately automated Chinese word segmentation and transliteration are a hard problem (it's an active area of research) because it so often depends on contextual meaning, so whatever approach that is taken is going to be flawed in some way, it's always going to be "best effort".

Anyway these kinds of discrepancies are probably best solved by adding "manual override" data, so basically what you were already doing, but only for characters where we think the kMandarin readings aren't ideal. I think I agree with your wife about the more likely readings for 差 and 斗, for what it's worth given my limited knowledge.

chambm · 2022-08-21T17:00:44Z

OK, here's the subset of polyphones.tsv where my wife's pick differs from kMandarin (or kMandarin is a pair). I'm going to ask her to double-check these since it's a small set. So basically I'll update polyphones to this (including any adjustments later), and then add short-circuiting for single character words to use hanzi's kMandarin instead of cidian?

There's also about 250 two character words in cidian with ambiguous pronunciations. I'll try to add those in as well because I'm sure some of those meanings are much more commonly used than others.

The removed polyphones will get the correct pinyin from the hanzi table.

Fix Ruby/Bopomofo test (maybe due to using hanzi transcription) Move polyphones.tsv to the right location

chambm · 2022-08-22T14:30:55Z

Surprisingly, it took me a while to find a few two-character polyphones that didn't already transcribe the same way as my wife picked. So there are some possibly redundant two-character polyphones in the override table, but that could change if SQLite query optimization changes or an ORDER BY clause gets added (as it probably should be).

I actually ran the tests before those last commits. They all pass for me.

chambm added 3 commits August 20, 2022 19:03

Add polyphone overrides to Dictionary

c78287d

Add polyphones.tsv

7213983

Add tests for polyphones

c6cab73

chambm added 3 commits August 21, 2022 20:34

Add hanzi lookup for single character words

dd0a757

Remove redundant polyphone overrides

c703bd6

The removed polyphones will get the correct pinyin from the hanzi table.

Add two-character polyphones and tests

991dfe4

Fix Ruby/Bopomofo test (maybe due to using hanzi transcription) Move polyphones.tsv to the right location

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add polyphones map to override tones for common polyphonic hanzi #210

Add polyphones map to override tones for common polyphonic hanzi #210

chambm commented Aug 20, 2022

noinkling commented Aug 21, 2022 •

edited

Loading

chambm commented Aug 21, 2022

noinkling commented Aug 21, 2022 •

edited

Loading

chambm commented Aug 21, 2022 •

edited

Loading

chambm commented Aug 22, 2022

Add polyphones map to override tones for common polyphonic hanzi #210

Are you sure you want to change the base?

Add polyphones map to override tones for common polyphonic hanzi #210

Conversation

chambm commented Aug 20, 2022

noinkling commented Aug 21, 2022 • edited Loading

chambm commented Aug 21, 2022

noinkling commented Aug 21, 2022 • edited Loading

chambm commented Aug 21, 2022 • edited Loading

chambm commented Aug 22, 2022

noinkling commented Aug 21, 2022 •

edited

Loading

noinkling commented Aug 21, 2022 •

edited

Loading

chambm commented Aug 21, 2022 •

edited

Loading