Getting the rough sequence of text and images on a page: How-to? #3612
Replies: 1 comment 3 replies
-
The sequence of the blocks you get from The dictionaries of the second list have a key As per the image data contained in the second list, please look at this. The value of the Or, if you want to be lazy, use "json" instead of "dict" for extraction. Then that encoding has already been done for you. |
Beta Was this translation helpful? Give feedback.
-
I want to get the order of text and images on a page. I'm very happy with what I get from that:
Problem is, that the images in the list are just like that:
(498.94989013671875, 40.149940490722656, 576.7498779296875, 78.64190673828125, '<image: Indexed(255,DeviceRGB), width: 175, height: 86, bpc: 8>', 1, 1)
And the image data has IDs (I believe it's '1978' in this case):
(1978, 1983, 175, 86, 8, 'Indexed', '', 'Im0', 'FlateDecode', 0)
At the end I want a JSON list of text/image/text/image in the sequence of the blocks but with base64 data of the images. I'm not clear how I should identify the image in the
images
list from theblocks
list entry as there is no id?Yes, I could use "dict" but then I get e.g. "8th October" separated into "8", "th", "October" and lose the nice layout I get from
blocks
.Any idea?
Beta Was this translation helpful? Give feedback.
All reactions