page.find_tables how to extract words in table cell #3768

wangqiangJN · 2024-08-07T09:25:42Z

wangqiangJN
Aug 7, 2024

Is your feature request related to a problem? Please describe.
pymupdf version 1.24.9

1 I want to parse my pdf , my pdf contains tables and other text.
2.when I use page. find_tables ,it can extract text in cell , but I find when cell has multi words ,as
example cell
price:520 people:bob
expect result by words :
price
520
people
bob
but results now
price:520 people:bob
3. so i want to split table cell content by words and get bbox, here have any function method and solution?

JorjMcKie · 2024-08-07T16:32:44Z

JorjMcKie
Aug 7, 2024
Maintainer

This feature is already there:
Simply extract the text using each table cell as clip rectangle like this cell_words = page.get_text("words", clip=cell).

A minor issue may arise when the table has very narrow cell borders. Then the table finder might identify cell content that technically is not completely inside a cell. The general text extraction is very strict and will discard everything not completely inside the rectangle.
If this happens, just enlarge the cell a bit in each direction when you use it as a clip: I.e. use clip=pymupdf.Rect(cell)+(-3,-3,3,3).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.find_tables how to extract words in table cell #3768

{{title}}

Replies: 1 comment

{{title}}

Select a reply

page.find_tables how to extract words in table cell #3768

wangqiangJN Aug 7, 2024

Replies: 1 comment

JorjMcKie Aug 7, 2024 Maintainer

wangqiangJN
Aug 7, 2024

JorjMcKie
Aug 7, 2024
Maintainer