The extracted text get reversed after is interacted with rect #3590

vinniec2 · 2024-06-17T13:23:20Z

vinniec2
Jun 17, 2024

I tried two ways to extract a single piece of text from a pdf page with tables:

1)removing the boundaryboxes of the tables and then using the apply_redactions() method and then get_text() to get the text.
2)using get_textbox() on a rectangle in the area outside the tables.
Both methods work but return the text in reverse order! from the bottom row up.

If instead I use get_text() without apply_redaction() the text is returned in the right order, but includes unwanted tab text.

Do I have to reorder the text myself with splitlines() and reverse()?

Answered by JorjMcKie

Jun 17, 2024

Why don't you simply extract via page.get_text(sort=True)?

View full answer

JorjMcKie · 2024-06-17T15:47:39Z

JorjMcKie
Jun 17, 2024
Maintainer

Why don't you simply extract via page.get_text(sort=True)?

1 reply

vinniec2 Jun 17, 2024
Author

It seems to work for page.get_text() but does not seem to work for page.get_textbox(rect) which does not seem to have the sort parameter.

However I looked at the parameters of get_text and it also has the clip parameter which seems to do exactly what get_textbox does but also with the sort parameter.
So can I use page.get_text(clip=rect, sort=True) without any problems instead of get_textbox(rect) ?

vinniec2 · 2024-06-22T08:40:03Z

vinniec2
Jun 22, 2024
Author

I always used page.get_text(clip=rect, sort=True) and it worked as intended!
thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The extracted text get reversed after is interacted with rect #3590

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

The extracted text get reversed after is interacted with rect #3590

vinniec2 Jun 17, 2024

Replies: 2 comments · 1 reply

JorjMcKie Jun 17, 2024 Maintainer

vinniec2 Jun 17, 2024 Author

vinniec2 Jun 22, 2024 Author

vinniec2
Jun 17, 2024

Replies: 2 comments 1 reply

JorjMcKie
Jun 17, 2024
Maintainer

vinniec2 Jun 17, 2024
Author

vinniec2
Jun 22, 2024
Author