Remove image if page.get_image_info() returns multiple images at its location #3631

paulgekeler · 2024-06-27T08:06:39Z

paulgekeler
Jun 27, 2024

Hi, thank you for maintaining this project. It really is exceptional.
I am trying to replace images in pdfs with different ones I have stored locally. Thanks to this Github discussion https://github.com/pymupdf/PyMuPDF/discussions/924#discussioncomment-7249686 I have no problem doing this as follows:

# iterating over pages here
# then get images
img_list = page.get_image_info(xrefs=True)
for img_index, img in enumerate(img_list, start=1):
                # img is a dict
                bbox = img['bbox']
                ra = page.add_redact_annot(bbox)
                page.apply_redactions(
                    images=pymupdf.PDF_REDACT_IMAGE_REMOVE
                    )
                _, img_name = alt_manager.next_text_img_pair # this is my own class returning new replacement images
                total_img_path = Path(alt_imgs_path, img_name)
                page.insert_image(bbox, filename=total_img_path)
# local function saving page using pymupdf Matrix and page.get_pixmap() -> works fine
            save_page_to_png(page, page_name_img)

However, I have encountered a pdf page (see below) where get_image_info() returns multiple images for a single image on the pdf:

img_list = [{'number': 2, 'bbox': (...), 'transform': (...), 'width': 1, 'height': 224, 'colorspace': 1, 'cs-name': 'Indexed(255,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 1081, 'digest': b'\xa5^\xda}\x12\x1e\xff\x84\x0cl\xf0\x1c3D\x9f\xac', 'xref': 111}, {'number': 4, 'bbox': (...), 'transform': (...), 'width': 1, 'height': 224, 'colorspace': 1, 'cs-name': 'Indexed(255,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 1081, 'digest': b'\x7f\x94\xe2=\xeeUiC!Z\xa4\xd7\x8e\x8d\xd0}', 'xref': 112}, {'number': 6, 'bbox': (...), 'transform': (...), 'width': 1, 'height': 224, 'colorspace': 1, 'cs-name': 'Indexed(255,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 1081, 'digest': b'\x8c\xa6\x94}6\xe4\xe0T\x84X\xf9\x94\xb7\xb4\xe2\x94', 'xref': 113}, {'number': 8, 'bbox': (...), 'transform': (...), 'width': 1, 'height': 224, 'colorspace': 1, 'cs-name': 'Indexed(255,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 1081, 'digest': b'\x97\xca\x8d%,\xe2\xb7\x10e\x1du\xcaI\\\xd4\xce', 'xref': 114}, {'number': 10, 'bbox': (...), 'transform': (...), 'width': 1, 'height': 224, 'colorspace': 1, 'cs-name': 'Indexed(255,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 1081, 'digest': b'\xed\xed\x8d\xd8\xae\xbd\x15\xa7B\xeb\xa3\x93\xa8\x03\x1c\xa0', 'xref': 115}, {'number': 12, 'bbox': (...), 'transform': (...), 'width': 1, 'height': 224, 'colorspace': 1, 'cs-name': 'Indexed(255,ICCBased(RGB,sRGB IEC61966-2.1))', 'xres': 96, 'yres': 96, 'bpc': 8, 'size': 1081, 'digest': b'\x7f\x94\xe2=\xeeUiC!Z\xa4\xd7\x8e\x8d\xd0}', 'xref': 112}]

I would like to replace only one of them, i.e. the actual one visible on the page. I am not familiar with how pdfs are assembled or if images can be composed of multiple parts. Maybe thats what I'm missing.
I have tried to synchronise the source by calling page.clean_contents() before, but that doesn't help.
Is there a way to recognise if images returned by get_image_info() are actually within the same image? (I could do something tedious like checking if bounding boxes are close enough, but that seems prone to errors.) I know the returned images have different xrefs so maybe they are different.

Thank you for some needed insight.

JorjMcKie · 2024-06-27T08:28:03Z

JorjMcKie
Jun 27, 2024
Maintainer

Without the PDF itself, there is no way to provide definitive advice, but you have a couple of options here:

Instead of image removal option PDF_REDACT_IMAGE_REMOVE you could use PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE - which should remove only visible images (never checked it out myself).
Do not use redactions at all but "delete" images by xref via page.delete_image(xref). This is really a replacement by a transparent (invisible) tiny image (1x1 pixels).

1 reply

paulgekeler Jun 27, 2024
Author

Thanks, I'll give it a shot deleting via xrefs and then inserting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove image if page.get_image_info() returns multiple images at its location #3631

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Remove image if page.get_image_info() returns multiple images at its location #3631

paulgekeler Jun 27, 2024

Replies: 1 comment · 1 reply

JorjMcKie Jun 27, 2024 Maintainer

paulgekeler Jun 27, 2024 Author

paulgekeler
Jun 27, 2024

Replies: 1 comment 1 reply

JorjMcKie
Jun 27, 2024
Maintainer

paulgekeler Jun 27, 2024
Author