Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LangModels refs error #36

Open
cxjwin opened this issue Feb 13, 2017 · 21 comments
Open

LangModels refs error #36

cxjwin opened this issue Feb 13, 2017 · 21 comments

Comments

@cxjwin
Copy link

cxjwin commented Feb 13, 2017

LangModels refs error in build-mac/uchardet.xcodeproj

2017-02-13 10 49 40

@Jehan
Copy link
Collaborator

Jehan commented Feb 13, 2017

Hi @cxjwin ! Thanks for the report.

As explained on the README, this repository is not the upstream anymore (only kept for historical reasons, I'd say). We are now hosted at Freedesktop. Could you post any further bug report at: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet ?

This being said, I'm not really sure what I am looking at. All the language models have been moved to src/LangModels/. What do these red files mean?
Or is something related to the MacOS build (I assume that's what is under build-mac/) and an output of whatever development GUI you use on this platform? If so, I know nothing about the platform, don't have a MacOS machine or tools. But I will gladly accept any patch fixing whatever needs to be fixed there. :-)

If you have any patch to provide, please do so on the new bug tracker for the project at Freedesktop's.
Thanks!

@marinofaggiana
Copy link

Hi @Jehan I have a problem with iOS for compile uchardet, can you help me ?

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

Hello @marinofaggiana,

Maybe I can but I have no iOS machine so any help on my side can only be generic. I see a bunch of files under build-mac/ in our repository, but have no idea what they are and how it works (which is why I never touched these, therefore I am not surprised if something is broken).
So if you explain me with details the problem, error messages and if you have hints, maybe we can fix this together.
Ideally if you are able to fix and provide a patch, it is even better. ;-)

Finally, as explained in my previous comment, this is not the upstream repository anymore. It means that this is not later code of uchardet, and also that is not the place to deal with bugs. I only answer exceptionally but I won't do it every time.
Uchardet is now a Freedesktop project.

@marinofaggiana
Copy link

Thanks @Jehan, my issue is not a build (I think) ... no errors but the build-mac/ is (old) now the LangModel are in new dir (LangModel) ... the issue is for detect, for example I have this fine txt with a italian words :

utf8.txt

I have install on my Mac OS X the ucharsed with brew and test the file :

`
MacBook-Pro:000 marinofaggiana$ uchardet -v

uchardet Command Line Tool
Version 0.0.6

Authors: BYVoid, Jehan
Bug Report: https://bugs.freedesktop.org/enter_bug.cgi?product=uchardet

MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$
MacBook-Pro:000 marinofaggiana$ uchardet utf8.txt
UTF-8
MacBook-Pro:000 marinofaggiana$
`
Response : UTF-8, ok Correct

The issue is with iOS the detect return :
encodingName __NSCFString * @"ISO-8859-1" 0x00006080006209e0

... this is the issue ...

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

The issue is with iOS the detect return :
encodingName __NSCFString * @"ISO-8859-1" 0x00006080006209e0

I don't understand. Where does this come from? Are you saying that comes from uchardet too? Is "detect" a command of iOS maybe?
If this is the former, you'll have to tell me more (what is the difference between the 2 calls?). If this is the later, then… well that's why uchardet exists (because most other tools make a lot of detection errors).

You'll have to give me a bunch more details for me to understand. :-)

@marinofaggiana
Copy link

ok, no I have used a wrapper on Object-C for library .a :

@interface NCUchardet ()
{
   uchardet_t _detector;
}
@end

@implementation NCUchardet

+ (NCUchardet *)sharedNUCharDet {
    static NCUchardet *nuCharDet;
    @synchronized(self) {
        if (!nuCharDet) {
            nuCharDet = [NCUchardet new];
        }
        return nuCharDet;
    }
}

- (id)init
{
    self = [super init];
    
    if (self) {
        _detector = uchardet_new();
    }
    
    return self;
}

- (void)dealloc
{
    uchardet_delete(_detector);
}

- (NSString *)encodingStringDetectWithData:(NSData *)data
{
    uchardet_handle_data(_detector, [data bytes], [data length]);
    uchardet_data_end(_detector);
    
    const char *charset = uchardet_get_charset(_detector);
    NSString *encoding = [NSString stringWithCString:charset encoding:NSASCIIStringEncoding];
    
    uchardet_reset(_detector);
    
    return encoding;
}

@end

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

I am not a Object-C expert (to say the least) but from what I read, it looks like it should work. I can think of 2 things: are you sure Object-C does not reencode the data before it reaches uchardet by any chance? I would try and dump the data and make sure it is byte for byte the same as it is in the file.

Second thing is that I see you reuse the same detector by keeping it around and running uchardet_reset(). Most use cases I saw is to create a new detector every time, so who knows, maybe the barely used uchardet_reset() is broken. If that is the case and you detected encoding of various files before, it may have interfered. Could you try to delete and recreate a new detector after every detection and see if it helps?

@marinofaggiana
Copy link

I have removed the singleton library, but this is not the issue :

First dumb :

schermata 2017-08-18 alle 12 11 12

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

Could we see the data in bytes mode to make sure that's UTF-8? :-)

@marinofaggiana
Copy link

schermata 2017-08-18 alle 12 33 34

@marinofaggiana
Copy link

schermata 2017-08-18 alle 12 38 56

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

Ok looks fine UTF-8. But I just understood the problem. It's not your program.

Actually I realize that development code of uchardet returns this data as ISO-8859-1, which is wrong. Are you using last git code for your development while you are using stable 0.0.6 for the uchardet tool by any chance? :-)

@marinofaggiana
Copy link

Good question ... I have used :

https://cgit.freedesktop.org/uchardet/uchardet/

for a copy ... is the 0.0.6 ?

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

Well by default master is the development code. Keep the same code, but checkout the commit for v0.0.6 release:

git checkout v0.0.6

Then you'll have the code used for 0.0.6. Alternatively use the snapshot in a tarball: https://www.freedesktop.org/software/uchardet/releases/ (but that should be the same code if you checkout the right tag).

@marinofaggiana
Copy link

ok @Jehan with https://www.freedesktop.org/software/uchardet/releases/ the detect it's ok UTF-8, thanks, for the future if you want a test for iOS we are here with our project :

https://github.com/nextcloud/ios

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-)

I opened a bug report related to the file you gave which is now detected as ISO-8859-1 because of a new language support. Though I'm not sure I have much of a solution for now. There are 2 problems here:

1/ With very short texts (like here, just 2 words), a system based on language statistics will be a lot less efficient. For longer texts (even just a few more words with a complete sentence), the encoding detection will become a lot more accurate (and in particular any slight confidence which makes the system believe it may be another language currently would likely disappear with more words).

2/ UTF-8 detection is not language aware currently. If it were and knew of Italian letter-usage statistics, this should definitely raise the confidence for UTF-8.

This second point is something I plan to work on someday. The first point is inherent to uchardet algorithm (the smaller the input data, the harder it is to map results to generic language statistics).

Bug report: https://bugs.freedesktop.org/show_bug.cgi?id=102292

P.S.: the text was Italian, right?

@marinofaggiana
Copy link

marinofaggiana commented Aug 18, 2017

Nice to know that you use uchardet in Nextcloud (only the iOS app?). I have a lot of stuff I'd like to discuss for Nextcloud (not related to character detection). Probably some time later. :-)

Android too, for a discuss when do you want :-)

1/ With very short texts (like here, just 2 words) ....

yes, of course !

P.S.: the text was Italian, right?

Yes, Italian

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

Oh there is actually another solution which I am planning to work on at some point: language hints.
I want to provide ways to hint a detector towards a list of language, either a hard hint (the file owner says "that's definitely Italian", which will basically really make encoding detection much easier, even for short texts), or soft hints (for instance a software could keep a list of languages commonly read by the user and gives higher weight for these languages; this won't prevent detection for other language and encoding yet gives better confidence on the user preferred languages which are statistically more likely to appear again).
But that's all future wishes. I don't know when I'll be able to make the time for language hinting.

@marinofaggiana
Copy link

A question @Jehan, can the

const char *charset = uchardet_get_charset(_detector);
return NULL or "" or NIL ?

@Jehan
Copy link
Collaborator

Jehan commented Aug 18, 2017

It can return "" when no charset was found with high confidence enough. Otherwise a charset name. It won't ever return NULL.

@marinofaggiana
Copy link

Very well, thanks for your help !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants