Skip to content
This repository has been archived by the owner on Jan 20, 2023. It is now read-only.

Multibyte characters ignored on OS X #1

Open
cpsdqs opened this issue Sep 18, 2016 · 14 comments
Open

Multibyte characters ignored on OS X #1

cpsdqs opened this issue Sep 18, 2016 · 14 comments
Assignees

Comments

@cpsdqs
Copy link

cpsdqs commented Sep 18, 2016

Characters with large code points are ignored for some reason. Possibly because they use multiple bytes.
Screenshot

@miestasmia
Copy link
Owner

I'm not sure what you're trying to write but afaik u+F09F does not exist.

@cpsdqs
Copy link
Author

cpsdqs commented Sep 19, 2016

Each of the first six characters that aren't a space use four bytes, not two: e.g. F0 9F 90 A1, which is u+1F421 (blowfish)

@miestasmia
Copy link
Owner

miestasmia commented Sep 19, 2016

Sorry, my bad. I'm afraid I can't reproduce it.

image

@miestasmia
Copy link
Owner

@cpsdqs Can you please confirm if this is still an issue in the latest version? Some of the internals have been changed.

@cpsdqs
Copy link
Author

cpsdqs commented Sep 24, 2017

sure is
screenshot

@miestasmia miestasmia reopened this Sep 24, 2017
@miestasmia
Copy link
Owner

miestasmia commented Sep 24, 2017

I still cannot reproduce this, which is leading me to think it's a Mac-only issue. Can you try run this on a Linux machine by any chance?
perdygui

@miestasmia miestasmia self-assigned this Sep 24, 2017
@miestasmia
Copy link
Owner

miestasmia commented Sep 24, 2017

Additionally, could you try prepending export PYTHONIOENCODING=utf-8 prior to running unilookup on OS X?

@cpsdqs
Copy link
Author

cpsdqs commented Sep 24, 2017

Doesn't work either ¯\_(ツ)_/¯
['D83C', 'DF29', '000A'] vs. ['1F329', '000A']
seems to be an issue with python itself

@miestasmia
Copy link
Owner

Could you try manually setting the input string instead of reading from stdin so we can try figure out where the issue arises?

@cpsdqs
Copy link
Author

cpsdqs commented Sep 24, 2017

replacing sys.stdin on line 31 with ['🌩'] and adding # coding=utf-8 on line 2 (because it doesn't even run without: SyntaxError: Non-ASCII character '\xf0' in file ./unilookup on line 31, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details) doesn't really work
screenshot

@miestasmia
Copy link
Owner

Hm, can you try add print char in the second for loop (at :33)?

@cpsdqs
Copy link
Author

cpsdqs commented Sep 24, 2017

prints:

���
���

(all of them u+FFFD) … (side note: unilookup works fine for echo '���' | unilookup)

@miestasmia
Copy link
Owner

Okay so that seems to be where it breaks, because I'm getting the individual characters on Linux. I'll look into it

@miestasmia
Copy link
Owner

From what I've been able to find this is an issue with how Python 2 (on some platforms) handles unicode. The solution here would be to read the byte stream and manually determine byte length (as done here) and then split the byte stream as appropriate. This'll require quite a bit of refactoring to do, but I'll try to get it done soonish.

@miestasmia miestasmia changed the title Characters are ignored Multibyte characters ignored on OS X Oct 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants