-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LangModels refs error #36
Comments
Hi @cxjwin ! Thanks for the report. As explained on the README, this repository is not the upstream anymore (only kept for historical reasons, I'd say). We are now hosted at Freedesktop. Could you post any further bug report at: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet ? This being said, I'm not really sure what I am looking at. All the language models have been moved to If you have any patch to provide, please do so on the new bug tracker for the project at Freedesktop's. |
Hi @Jehan I have a problem with iOS for compile uchardet, can you help me ? |
Hello @marinofaggiana, Maybe I can but I have no iOS machine so any help on my side can only be generic. I see a bunch of files under Finally, as explained in my previous comment, this is not the upstream repository anymore. It means that this is not later code of uchardet, and also that is not the place to deal with bugs. I only answer exceptionally but I won't do it every time.
|
Thanks @Jehan, my issue is not a build (I think) ... no errors but the build-mac/ is (old) now the LangModel are in new dir (LangModel) ... the issue is for detect, for example I have this fine txt with a italian words : I have install on my Mac OS X the ucharsed with brew and test the file : ` uchardet Command Line Tool Authors: BYVoid, Jehan MacBook-Pro:000 marinofaggiana$ The issue is with iOS the detect return : ... this is the issue ... |
I don't understand. Where does this come from? Are you saying that comes from uchardet too? Is "detect" a command of iOS maybe? You'll have to give me a bunch more details for me to understand. :-) |
ok, no I have used a wrapper on Object-C for library .a :
|
I am not a Object-C expert (to say the least) but from what I read, it looks like it should work. I can think of 2 things: are you sure Object-C does not reencode the data before it reaches uchardet by any chance? I would try and dump the data and make sure it is byte for byte the same as it is in the file. Second thing is that I see you reuse the same detector by keeping it around and running uchardet_reset(). Most use cases I saw is to create a new detector every time, so who knows, maybe the barely used uchardet_reset() is broken. If that is the case and you detected encoding of various files before, it may have interfered. Could you try to delete and recreate a new detector after every detection and see if it helps? |
Could we see the data in bytes mode to make sure that's UTF-8? :-) |
Ok looks fine UTF-8. But I just understood the problem. It's not your program. Actually I realize that development code of uchardet returns this data as ISO-8859-1, which is wrong. Are you using last git code for your development while you are using stable 0.0.6 for the uchardet tool by any chance? :-) |
Good question ... I have used : https://cgit.freedesktop.org/uchardet/uchardet/ for a copy ... is the 0.0.6 ? |
Well by default master is the development code. Keep the same code, but checkout the commit for v0.0.6 release:
Then you'll have the code used for 0.0.6. Alternatively use the snapshot in a tarball: https://www.freedesktop.org/software/uchardet/releases/ (but that should be the same code if you checkout the right tag). |
ok @Jehan with https://www.freedesktop.org/software/uchardet/releases/ the detect it's ok UTF-8, thanks, for the future if you want a test for iOS we are here with our project : |
Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-) I opened a bug report related to the file you gave which is now detected as ISO-8859-1 because of a new language support. Though I'm not sure I have much of a solution for now. There are 2 problems here: 1/ With very short texts (like here, just 2 words), a system based on language statistics will be a lot less efficient. For longer texts (even just a few more words with a complete sentence), the encoding detection will become a lot more accurate (and in particular any slight confidence which makes the system believe it may be another language currently would likely disappear with more words). 2/ UTF-8 detection is not language aware currently. If it were and knew of Italian letter-usage statistics, this should definitely raise the confidence for UTF-8. This second point is something I plan to work on someday. The first point is inherent to uchardet algorithm (the smaller the input data, the harder it is to map results to generic language statistics). Bug report: https://bugs.freedesktop.org/show_bug.cgi?id=102292 P.S.: the text was Italian, right? |
Android too, for a discuss when do you want :-)
yes, of course !
Yes, Italian |
Oh there is actually another solution which I am planning to work on at some point: language hints. |
A question @Jehan, can the
|
It can return "" when no charset was found with high confidence enough. Otherwise a charset name. It won't ever return NULL. |
Very well, thanks for your help ! |
LangModels refs error in build-mac/uchardet.xcodeproj
The text was updated successfully, but these errors were encountered: