Unable to get all images from a page #2714

mahdyshabeeb · 2023-10-03T19:45:50Z

mahdyshabeeb
Oct 3, 2023

When using the page.get_image() method, only some of the images on the given page are returned. Is there any parameter I need to add to be able to get all the images? Or is there a different method I can use?

Answered by JorjMcKie

Oct 3, 2023

This may have several reasons:

There exist "inline" images. They have no xref and exist only inside the page appearace source (/Contents objects). You can still extract them via page.get_text("dict") (or "rawdict").
There exist vector graphics. The more complex ones may look like images, but they aren't and they also have no xref. You can extract those via page.get_drawings().
Some PDF creators make annotations with an image instead of a fill color (e.g. buttons). They do not appear in the page.get_images() although they do have an xref. We don't yet support to locate them.

View full answer

JorjMcKie · 2023-10-03T20:41:12Z

JorjMcKie
Oct 3, 2023
Maintainer

Atypical "Discussions" post. Let me convert first.

0 replies

JorjMcKie · 2023-10-03T20:51:41Z

JorjMcKie
Oct 3, 2023
Maintainer

This may have several reasons:

There exist "inline" images. They have no xref and exist only inside the page appearace source (/Contents objects). You can still extract them via page.get_text("dict") (or "rawdict").
There exist vector graphics. The more complex ones may look like images, but they aren't and they also have no xref. You can extract those via page.get_drawings().
Some PDF creators make annotations with an image instead of a fill color (e.g. buttons). They do not appear in the page.get_images() although they do have an xref. We don't yet support to locate them.

1 reply

mahdyshabeeb Oct 4, 2023
Author

You are right. The images I was missing are vector graphics that's why they were not returned.

However, for some reason, the method page.get_text("dict") does not return all the images returned by page.get_images(). On the other hand, using "html" instead of "dict" returns all the images.

Now I have 3 questions (please let me know if I should open a new discussion):

Is it possible to convert these drawings to an image/ images? I have applied the code snippit provided in the documentation (with minor modifications) which made a new pdf file out of them but my question is if I can decode them to get their binary representation (which can for example be displayed by PIL.Image.open())
Is there a known reason why page.get_text("dict") does not return all images?
Is there a way to get the positions of the images using the page.get_images() method or should I use page.get_text() instead?

JorjMcKie · 2023-10-04T09:23:59Z

JorjMcKie
Oct 4, 2023
Maintainer

Your questions:

However, for some reason, the method page.get_text("dict") does not return all the images returned by page.get_images(). On the other hand, using "html" instead of "dict" returns all the images.

This sounds weird a bit. get_images() (same as doc.get_page_images()) is just a report for what the PDF object definition of the page has to say. Depending on how the PDF was made, this list may sometimes reference all images of the full PDF - not only the ones, that this page displays.
There is a function, page.clean_contents(), which cleans up such situations.
page.get_image_info() reports meta information of each image actually displayed (xref-ed or not). Should be the same as the respective subset of page.get_text("dict").
If html output returns more than these, then it may only be caused by images occurring in annotations as described above.
Otherwise please send me the example.

Image positions on the page:
Use page.get_image_rects(xref). This returns the bbox plus optionally a matrix - for each occurrence of the same image on that page (which does happen). So this is a list.
The optional matrix contains information about what was happening to the original image to fit it into the bbox: scaling, rotation, flipping, etc. Please see documentation for details.

Recovering (extracting) Images:
There are several options:

Take the binary returned by get_text("[raw]dict"). Please note that unfortunately there is no way to access transparency here - or even knowing whether the original image was transparent. But independent from that you also get the appropriate file extension (like "png"). You can save the image via pathlib.Path(f"image.{img['ext']}").write_bytes(img["image"]). Or open it as a PIL Image: define a file-like object first using fp = io.BytesIO(img["image"]), then pil_img = PIL.Image.open(fp). Or make a fitz.Pixmap from it: pix = fitz.Pixmap(img["image"]).
If you have the image xref, you can do img = doc.extract_image(xref) which returns a dictionary with image meta information and the binary. Process this similar to as before. Here, you do have access to transparency information: there is an "smask" key in the dictionary. If this is positive, then this is the xref of the transparency object (PDF separates this unfortunately). Please see the documentation of the method to learn how the full original image can be recovered in this case.

18 replies

JorjMcKie Nov 7, 2023
Maintainer

Sorry, should have thought of the infinite rectangle in the beginning.

benmagos Nov 7, 2023

Many thanks for the context!

mahdyshabeeb Nov 8, 2023
Author

Thanks a lot Jorj! This is exactly what I was looking for.

maxjeblick May 24, 2024

So, page.get_image_info() is roughly equivalent to (ignoring potential different return types)

blocks = page.get_text("dict", sort=True, clip=fitz.INFINITE_RECT())["blocks"]
image_blocks = [block for block in blocks if block["type"] == 1]

?

JorjMcKie May 24, 2024
Maintainer

@maxjeblick Correct! Another difference is that get_image_info does not return the image binaries ...

enlacroix · 2024-09-13T12:31:38Z

enlacroix
Sep 13, 2024

How to extract inline images via page.get_image_info() or page dict?

4 replies

JorjMcKie Sep 13, 2024
Maintainer

How to extract inline images via page.get_image_info() or page dict?

This only works via page.get_image_info() because that method inspects the display instructions (the /Contents) of the page.
The Page dictionary is informal only in comparison. It only enumerates images that have an xref (inliners don't). Also, there is no guarantee that all enumerated images here are really displayed by the page etc.
Plus, there may be images with having an xref that are displayed, but are not in the page object definition. Example: images in annotations.

enlacroix Sep 13, 2024

Thank you for such quick answer. I read all discussion and understood about inline specific images.
But, how to save inline images, that's my question.
For images with xref I can write:

  image_list = page.get_images()
  
  for image_index, img in enumerate(image_list):
      xref = img[0]
      position = page.get_image_info()[image_index - 1]["bbox"]
      base_image = pdf_file.extract_image(xref)
      image_bytes = base_image["image"]
      image_ext = base_image["ext"]
      extracted_image = Image.open(io.BytesIO(image_bytes)).convert('RGB')

But I see that not all images were extracted. If I'll use get_images_info I see that this method computes correct number of images on page.
To sum up, how to save inline images?

JorjMcKie Sep 13, 2024
Maintainer

All image items of a page are also contained in page.get_text("dict") and have item["image"] and item["ext"]. So you can do pathlib.Path(f"my_image.{item['ext']}").write_bytes(item["image"]).
This should save the image as a file with the appropriate extension.

Sorry - I gave the wrong info about the get_image_info(): this method only provides meta-information, not the image binary itself.

JorjMcKie Sep 13, 2024
Maintainer

If an image has transparency (inliners don't), then there currently is no way to save the image as a transparent image, because we currently haven't implemented access to this so-called "mask" information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to get all images from a page #2714

{{title}}

Replies: 4 comments 23 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unable to get all images from a page #2714

mahdyshabeeb Oct 3, 2023

Replies: 4 comments · 23 replies

JorjMcKie Oct 3, 2023 Maintainer

JorjMcKie Oct 3, 2023 Maintainer

mahdyshabeeb Oct 4, 2023 Author

JorjMcKie Oct 4, 2023 Maintainer

JorjMcKie Nov 7, 2023 Maintainer

benmagos Nov 7, 2023

mahdyshabeeb Nov 8, 2023 Author

maxjeblick May 24, 2024

JorjMcKie May 24, 2024 Maintainer

enlacroix Sep 13, 2024

JorjMcKie Sep 13, 2024 Maintainer

enlacroix Sep 13, 2024

JorjMcKie Sep 13, 2024 Maintainer

JorjMcKie Sep 13, 2024 Maintainer

mahdyshabeeb
Oct 3, 2023

Replies: 4 comments 23 replies

JorjMcKie
Oct 3, 2023
Maintainer

JorjMcKie
Oct 3, 2023
Maintainer

mahdyshabeeb Oct 4, 2023
Author

JorjMcKie
Oct 4, 2023
Maintainer

JorjMcKie Nov 7, 2023
Maintainer

mahdyshabeeb Nov 8, 2023
Author

JorjMcKie May 24, 2024
Maintainer

enlacroix
Sep 13, 2024

JorjMcKie Sep 13, 2024
Maintainer

JorjMcKie Sep 13, 2024
Maintainer

JorjMcKie Sep 13, 2024
Maintainer