You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this zimit run (logs gone) of https://www.meds.cl/, it seems the string added as content is not valid UTF-8 (U+DBFF in this case), having the libzim refuse it.
Traceback (most recent call last):
File "/usr/bin/zimit", line 8, in <module>
sys.exit(zimit.zimit())
^^^^^^^^^^^^^
File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 695, in zimit
run(sys.argv[1:])
File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 616, in run
return warc2zim(warc2zim_args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 168, in main
return converter.run()
^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 384, in run
self.add_items_for_warc_record(record)
File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 965, in add_items_for_warc_record
raise exc
File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 962, in add_items_for_warc_record
self.creator.add_item(payload_item)
File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 468, in add_item
raise exc
File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 465, in add_item
super().add_item(item)
File "libzim/libzim.pyx", line 358, in libzim._Creator.add_item
RuntimeError: Traceback (most recent call last):
File "libzim/libzim.pyx", line 121, in libzim.contentprovider_cy_call_fct
File "libzim/libzim.pyx", line 85, in libzim.call_method
File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/items.py", line 130, in get_contentprovider
return StringProvider(content=content, ref=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/providers.py", line 36, in __init__
super().__init__(content)
File "libzim/libzim.pyx", line 469, in libzim.StringProvider.__init__
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 493623: surrogates not allowed
I believe this should be sorted out before passing it to libzim but I haven't dug into the issue
The text was updated successfully, but these errors were encountered:
To be precise, I believe it is not the libzim but the python-libzim which refuses to encode:
Python 3.12.1 (main, Jan 9 2024, 15:41:00) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> print('\uDBFF')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udbff' in position 0: surrogates not allowed
To be analyzed, but it needs to be sorted out one way or the other.
ZIM must contain only UTF-8, so there is no reason to pass it a unicode surrogate which is invalid UTF-8.
In this zimit run (logs gone) of https://www.meds.cl/, it seems the string added as content is not valid UTF-8 (
U+DBFF
in this case), having the libzim refuse it.This SO answer suggests this could be encoded JSON.
I believe this should be sorted out before passing it to libzim but I haven't dug into the issue
The text was updated successfully, but these errors were encountered: