Extract Text from Annotiations and corresponding highlighted Text #1671

M-M99 · 2022-04-11T05:06:13Z

M-M99
Apr 11, 2022

Hello,

can somebody explain me how i can get the Text which is highlighted in the PDF.

I know, that i can get the annot Text with a loop over page.annot() / annot.getText.
But how i can get the corresponding Text which is highlighted on the PDF.

Thanks!

Answered by JorjMcKie

Apr 11, 2022

There is one thing you must keep in mind:
Annotations are not part of the page's contents. Imagine them like dust on a nice painting on the wall. The items shown in the painting are not aware of any dust that may cover them. And like dust, annotations can be wiped out without changing the page itself.
You get the idea.
Accordingly, an annotation may cover just anything: text, drawings, images ... or nothing.
There is a rectangle associated with an annotation, annot.rect, which can be used to find out what is underneath it. For example do this to find any covered text: text = page.get_text(clip=annot.rect).

But of course, highlight annotations (like their friends: underlines, strike-throug…

View full answer

JorjMcKie · 2022-04-11T10:41:24Z

JorjMcKie
Apr 11, 2022
Maintainer

There is one thing you must keep in mind:
Annotations are not part of the page's contents. Imagine them like dust on a nice painting on the wall. The items shown in the painting are not aware of any dust that may cover them. And like dust, annotations can be wiped out without changing the page itself.
You get the idea.
Accordingly, an annotation may cover just anything: text, drawings, images ... or nothing.
There is a rectangle associated with an annotation, annot.rect, which can be used to find out what is underneath it. For example do this to find any covered text: text = page.get_text(clip=annot.rect).

But of course, highlight annotations (like their friends: underlines, strike-throughs, squiggles) are mostly used to mark text. So you might want to do this:

for annot in page.annots(types=(fitz.PDF_ANNOT_HIGHLIGHT, fitz.PDF_ANNOT_UNDERLINE)):
    print(page.get_text(clip=annot.rect))

The annotation method annot.get_text() only delivers the text stored as part of some specific annotation types: freetext and stamps. Like mentioned above, this text is not part of the page, but part of the annotation.

4 replies

M-M99 Apr 11, 2022
Author

Hello JorjMcKie,
thanks for your detailed answer!
Helped me a lot!

tristone13th May 15, 2022

Hello, sometimes we will add annotation text on the highlight annotations (maybe we can call it 'comment' on anl annotation). Could you show me how to.extract those text? thanks!

JorjMcKie May 15, 2022
Maintainer

Hello, sometimes we will add annotation text on the highlight annotations (maybe we can call it 'comment' on anl annotation). Could you show me how to.extract those text? thanks!

You mean popup annotations? If an annotation has a popup, the popup text is also stored in the annotations info["content"].

tristone13th May 15, 2022

It really helped, many thanks!

JorjMcKie · 2022-04-11T10:43:27Z

JorjMcKie
Apr 11, 2022
Maintainer

BTW: your post is a typical Discussions item, not an issue.

0 replies

sunic23 · 2022-09-23T09:20:35Z

sunic23
Sep 23, 2022

Hi Jorj,
I am trying to create freeText callout annotation in 1.20.1 PyMupdf but not able to get it working and quite not clear on the doucment type
A number and one or two strings describing the annotation type, like [2, ‘FreeText’, ‘FreeTextCallout’]. The second string entry is optional and may be empty. See the appendix Annotation Types for a list of possible values and their meanings.

Could you please help in provide details on how to create a FreeText Cllaout Annotation ?

1 reply

JorjMcKie Sep 23, 2022
Maintainer

If there exists a callout annot, PyMuPDF is able to report it. This is what you have found in the documentation.

But there currently is no support to create a callout annotation, sorry.

tofDou · 2024-05-29T10:01:08Z

tofDou
May 29, 2024

I found something strange when dealing with extracting comments from PDF (made with Acrobat).
The annot.info["content"] is "empty" (not relevant in fact) when the state (Accepted, Canceled, Closed) of the comment is modified. Even if the state is reset to "none", the content text is no more available the content entry

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract Text from Annotiations and corresponding highlighted Text #1671

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extract Text from Annotiations and corresponding highlighted Text #1671

M-M99 Apr 11, 2022

Replies: 4 comments · 5 replies

JorjMcKie Apr 11, 2022 Maintainer

M-M99 Apr 11, 2022 Author

tristone13th May 15, 2022

JorjMcKie May 15, 2022 Maintainer

tristone13th May 15, 2022

JorjMcKie Apr 11, 2022 Maintainer

sunic23 Sep 23, 2022

JorjMcKie Sep 23, 2022 Maintainer

tofDou May 29, 2024

M-M99
Apr 11, 2022

Replies: 4 comments 5 replies

JorjMcKie
Apr 11, 2022
Maintainer

M-M99 Apr 11, 2022
Author

JorjMcKie May 15, 2022
Maintainer

JorjMcKie
Apr 11, 2022
Maintainer

sunic23
Sep 23, 2022

JorjMcKie Sep 23, 2022
Maintainer

tofDou
May 29, 2024