LangModels refs error #36

cxjwin · 2017-02-13T14:48:31Z

LangModels refs error in build-mac/uchardet.xcodeproj

Jehan · 2017-02-13T16:17:52Z

Hi @cxjwin ! Thanks for the report.

As explained on the README, this repository is not the upstream anymore (only kept for historical reasons, I'd say). We are now hosted at Freedesktop. Could you post any further bug report at: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet ?

This being said, I'm not really sure what I am looking at. All the language models have been moved to src/LangModels/. What do these red files mean?
Or is something related to the MacOS build (I assume that's what is under build-mac/) and an output of whatever development GUI you use on this platform? If so, I know nothing about the platform, don't have a MacOS machine or tools. But I will gladly accept any patch fixing whatever needs to be fixed there. :-)

If you have any patch to provide, please do so on the new bug tracker for the project at Freedesktop's.
Thanks!

marinofaggiana · 2017-08-18T07:41:51Z

Hi @Jehan I have a problem with iOS for compile uchardet, can you help me ?

Jehan · 2017-08-18T09:20:23Z

Hello @marinofaggiana,

Maybe I can but I have no iOS machine so any help on my side can only be generic. I see a bunch of files under build-mac/ in our repository, but have no idea what they are and how it works (which is why I never touched these, therefore I am not surprised if something is broken).
So if you explain me with details the problem, error messages and if you have hints, maybe we can fix this together.
Ideally if you are able to fix and provide a patch, it is even better. ;-)

Finally, as explained in my previous comment, this is not the upstream repository anymore. It means that this is not later code of uchardet, and also that is not the place to deal with bugs. I only answer exceptionally but I won't do it every time.
Uchardet is now a Freedesktop project.

Web page: https://www.freedesktop.org/wiki/Software/uchardet/
Report bugs at: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet
See current opened bugs at: https://bugs.freedesktop.org/describecomponents.cgi?product=uchardet
Last development code: https://cgit.freedesktop.org/uchardet/uchardet/

marinofaggiana · 2017-08-18T09:29:16Z

Thanks @Jehan, my issue is not a build (I think) ... no errors but the build-mac/ is (old) now the LangModel are in new dir (LangModel) ... the issue is for detect, for example I have this fine txt with a italian words :

utf8.txt

I have install on my Mac OS X the ucharsed with brew and test the file :

`
MacBook-Pro:000 marinofaggiana$ uchardet -v

uchardet Command Line Tool
Version 0.0.6

Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet

MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$ uchardet utf8.txt
UTF-8
MacBook-Pro:000 marinofaggiana$
`
Response : UTF-8, ok Correct

The issue is with iOS the detect return :
encodingName __NSCFString * @"ISO-8859-1" 0x00006080006209e0

... this is the issue ...

Jehan · 2017-08-18T09:36:07Z

The issue is with iOS the detect return :
encodingName __NSCFString * @"ISO-8859-1" 0x00006080006209e0

I don't understand. Where does this come from? Are you saying that comes from uchardet too? Is "detect" a command of iOS maybe?
If this is the former, you'll have to tell me more (what is the difference between the 2 calls?). If this is the later, then… well that's why uchardet exists (because most other tools make a lot of detection errors).

You'll have to give me a bunch more details for me to understand. :-)

marinofaggiana · 2017-08-18T09:41:27Z

ok, no I have used a wrapper on Object-C for library .a :

@interface NCUchardet ()
{
   uchardet_t _detector;
}
@end

@implementation NCUchardet

+ (NCUchardet *)sharedNUCharDet {
    static NCUchardet *nuCharDet;
    @synchronized(self) {
        if (!nuCharDet) {
            nuCharDet = [NCUchardet new];
        }
        return nuCharDet;
    }
}

- (id)init
{
    self = [super init];
    
    if (self) {
        _detector = uchardet_new();
    }
    
    return self;
}

- (void)dealloc
{
    uchardet_delete(_detector);
}

- (NSString *)encodingStringDetectWithData:(NSData *)data
{
    uchardet_handle_data(_detector, [data bytes], [data length]);
    uchardet_data_end(_detector);
    
    const char *charset = uchardet_get_charset(_detector);
    NSString *encoding = [NSString stringWithCString:charset encoding:NSASCIIStringEncoding];
    
    uchardet_reset(_detector);
    
    return encoding;
}

@end

Jehan · 2017-08-18T10:00:00Z

I am not a Object-C expert (to say the least) but from what I read, it looks like it should work. I can think of 2 things: are you sure Object-C does not reencode the data before it reaches uchardet by any chance? I would try and dump the data and make sure it is byte for byte the same as it is in the file.

Second thing is that I see you reuse the same detector by keeping it around and running uchardet_reset(). Most use cases I saw is to create a new detector every time, so who knows, maybe the barely used uchardet_reset() is broken. If that is the case and you detected encoding of various files before, it may have interfered. Could you try to delete and recreate a new detector after every detection and see if it helps?

marinofaggiana · 2017-08-18T10:13:44Z

I have removed the singleton library, but this is not the issue :

First dumb :

Jehan · 2017-08-18T10:23:28Z

Could we see the data in bytes mode to make sure that's UTF-8? :-)

marinofaggiana · 2017-08-18T10:33:51Z

marinofaggiana · 2017-08-18T10:41:11Z

Jehan · 2017-08-18T11:35:37Z

Ok looks fine UTF-8. But I just understood the problem. It's not your program.

Actually I realize that development code of uchardet returns this data as ISO-8859-1, which is wrong. Are you using last git code for your development while you are using stable 0.0.6 for the uchardet tool by any chance? :-)

marinofaggiana · 2017-08-18T11:42:20Z

Good question ... I have used :

https://cgit.freedesktop.org/uchardet/uchardet/

for a copy ... is the 0.0.6 ?

Jehan · 2017-08-18T11:45:25Z

Well by default master is the development code. Keep the same code, but checkout the commit for v0.0.6 release:

git checkout v0.0.6

Then you'll have the code used for 0.0.6. Alternatively use the snapshot in a tarball: https://www.freedesktop.org/software/uchardet/releases/ (but that should be the same code if you checkout the right tag).

marinofaggiana · 2017-08-18T12:07:33Z

ok @Jehan with https://www.freedesktop.org/software/uchardet/releases/ the detect it's ok UTF-8, thanks, for the future if you want a test for iOS we are here with our project :

https://github.com/nextcloud/ios

Jehan · 2017-08-18T12:23:55Z

Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-)

I opened a bug report related to the file you gave which is now detected as ISO-8859-1 because of a new language support. Though I'm not sure I have much of a solution for now. There are 2 problems here:

1/ With very short texts (like here, just 2 words), a system based on language statistics will be a lot less efficient. For longer texts (even just a few more words with a complete sentence), the encoding detection will become a lot more accurate (and in particular any slight confidence which makes the system believe it may be another language currently would likely disappear with more words).

2/ UTF-8 detection is not language aware currently. If it were and knew of Italian letter-usage statistics, this should definitely raise the confidence for UTF-8.

This second point is something I plan to work on someday. The first point is inherent to uchardet algorithm (the smaller the input data, the harder it is to map results to generic language statistics).

Bug report: https://bugs.freedesktop.org/show_bug.cgi?id=102292

P.S.: the text was Italian, right?

marinofaggiana · 2017-08-18T12:27:48Z

Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-)

Android too, for a discuss when do you want :-)

1/ With very short texts (like here, just 2 words) ....

yes, of course !

P.S.: the text was Italian, right?

Yes, Italian

Jehan · 2017-08-18T12:32:18Z

Oh there is actually another solution which I am planning to work on at some point: language hints.
I want to provide ways to hint a detector towards a list of language, either a hard hint (the file owner says "that's definitely Italian", which will basically really make encoding detection much easier, even for short texts), or soft hints (for instance a software could keep a list of languages commonly read by the user and gives higher weight for these languages; this won't prevent detection for other language and encoding yet gives better confidence on the user preferred languages which are statistically more likely to appear again).
But that's all future wishes. I don't know when I'll be able to make the time for language hinting.

marinofaggiana · 2017-08-18T15:57:40Z

A question @Jehan, can the

const char *charset = uchardet_get_charset(_detector);
return NULL or "" or NIL ?

Jehan · 2017-08-18T22:34:50Z

It can return "" when no charset was found with high confidence enough. Otherwise a charset name. It won't ever return NULL.

marinofaggiana · 2017-08-19T08:30:46Z

Very well, thanks for your help !

marinofaggiana mentioned this issue Aug 18, 2017

Preview of non-UTF-8 text is not working nextcloud/ios#351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LangModels refs error #36

LangModels refs error #36

cxjwin commented Feb 13, 2017 •

edited

Loading

Jehan commented Feb 13, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017 •

edited

Loading

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017 •

edited

Loading

marinofaggiana commented Aug 19, 2017

LangModels refs error #36

LangModels refs error #36

Comments

cxjwin commented Feb 13, 2017 • edited Loading

Jehan commented Feb 13, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017 • edited Loading

Jehan commented Aug 18, 2017

marinofaggiana commented Aug 18, 2017

Jehan commented Aug 18, 2017 • edited Loading

marinofaggiana commented Aug 19, 2017

cxjwin commented Feb 13, 2017 •

edited

Loading

marinofaggiana commented Aug 18, 2017 •

edited

Loading

Jehan commented Aug 18, 2017 •

edited

Loading