How to extract pdf page text line by line? #3552
Replies: 2 comments
-
This is no bug. But there is a way to get correct results. Please continue in the Discussions tab. |
Beta Was this translation helpful? Give feedback.
-
We are in the process to provide a new As you correctly write, the current approach does not solve the problem when text has been stored in some crazy sequence. The new method will work a lot better and, for instance, also extract text correctly for multi-column pages.
The new line extraction method is currently already available in our new package pymupdf4llm as follows: import pymupdf
from pymupdf4llm.helpers.get_text_lines import get_text_lines
doc = pymupdf.open("UMNwriteup.pdf")
page = doc[0]
text = get_text_lines(page)
print(text) This will produce the following output:
So in this example, the intended page layout is undetectable: 2 columns? 2 column table? The There also is a parameter
In addition to |
Beta Was this translation helpful? Give feedback.
-
I am trying to extract pdf text line by line.
I have tried
UMNwriteup.pdf
Option 1
page.get_text('text').split("\n")
but that results in some lines being broken up into chunks (because spacing between words in one sentence is too much and a new line character is inputted.
Option 2
page.get_text('blocks')
That is more towards what I'm looking for, but some chunks (multi-line sentences) are intelligently grouped together.
Option 3
This results in output similar to option 2.
So how do I extract text line by line, without any chunking / blocks behinds the scenes?
If I can stop putting new line characters between two words that are separated by blank spaces (even though on same bbox height), that should solve this for me.
Hi @JorjMcKie Thanks for any help.
Beta Was this translation helpful? Give feedback.
All reactions