Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 (surrogates) sent to libzim #382

Open
rgaudin opened this issue Sep 3, 2024 · 1 comment
Open

Invalid UTF-8 (surrogates) sent to libzim #382

rgaudin opened this issue Sep 3, 2024 · 1 comment
Labels
bug Something isn't working
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented Sep 3, 2024

In this zimit run (logs gone) of https://www.meds.cl/, it seems the string added as content is not valid UTF-8 (U+DBFF in this case), having the libzim refuse it.

This SO answer suggests this could be encoded JSON.

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 695, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 616, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 168, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 384, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 965, in add_items_for_warc_record
    raise exc
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 962, in add_items_for_warc_record
    self.creator.add_item(payload_item)
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 468, in add_item
    raise exc
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 465, in add_item
    super().add_item(item)
  File "libzim/libzim.pyx", line 358, in libzim._Creator.add_item
RuntimeError: Traceback (most recent call last):
  File "libzim/libzim.pyx", line 121, in libzim.contentprovider_cy_call_fct
  File "libzim/libzim.pyx", line 85, in libzim.call_method
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/items.py", line 130, in get_contentprovider
    return StringProvider(content=content, ref=self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/providers.py", line 36, in __init__
    super().__init__(content)
  File "libzim/libzim.pyx", line 469, in libzim.StringProvider.__init__
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 493623: surrogates not allowed

I believe this should be sorted out before passing it to libzim but I haven't dug into the issue

@rgaudin rgaudin added the bug Something isn't working label Sep 3, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Sep 3, 2024

To be precise, I believe it is not the libzim but the python-libzim which refuses to encode:

Python 3.12.1 (main, Jan  9 2024, 15:41:00) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\uDBFF')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 0: surrogates not allowed

To be analyzed, but it needs to be sorted out one way or the other.

ZIM must contain only UTF-8, so there is no reason to pass it a unicode surrogate which is invalid UTF-8.

@benoit74 benoit74 added this to the 2.2.0 milestone Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants