How to extract pdf page text line by line? #3552

mikejokic · 2024-06-05T04:37:31Z

mikejokic
Jun 5, 2024

I am trying to extract pdf text line by line.

I have tried

doc = fitz.open("UMNwriteup.pdf")
page =doc.load_page(0)

Option 1
page.get_text('text').split("\n")

but that results in some lines being broken up into chunks (because spacing between words in one sentence is too much and a new line character is inputted.

Option 2
page.get_text('blocks')

That is more towards what I'm looking for, but some chunks (multi-line sentences) are intelligently grouped together.

Option 3


dictionary_elements = page.get_text('dict')
for block in dictionary_elements['blocks']:
    line_text = ''
    for line in block['lines']:
        for span in line['spans']:
             line_text += ' ' + span['text']

This results in output similar to option 2.

So how do I extract text line by line, without any chunking / blocks behinds the scenes?

If I can stop putting new line characters between two words that are separated by blank spaces (even though on same bbox height), that should solve this for me.

Hi @JorjMcKie Thanks for any help.

JorjMcKie · 2024-06-05T09:21:32Z

JorjMcKie
Jun 5, 2024
Maintainer

This is no bug. But there is a way to get correct results. Please continue in the Discussions tab.

0 replies

JorjMcKie · 2024-06-05T10:18:13Z

JorjMcKie
Jun 5, 2024
Maintainer

We are in the process to provide a new Page method or variant of page.get_text(sort=True). This method will walk through the page's text spans and re-compose text lines based on the span coordinates.

As you correctly write, the current approach does not solve the problem when text has been stored in some crazy sequence.

The new method will work a lot better and, for instance, also extract text correctly for multi-column pages.

I have to add however, that there still may exist situations that go beyond even these advanced capabilities. For example, the sequence of single characters may have been completely scrambled. Or the page layout may be ambiguous in that it is simply undetectable whether we have a 2-column page are a table with two columns, etc.

The new line extraction method is currently already available in our new package pymupdf4llm as follows:

import pymupdf
from pymupdf4llm.helpers.get_text_lines import get_text_lines

doc = pymupdf.open("UMNwriteup.pdf")
page = doc[0]
text = get_text_lines(page)
print(text)

This will produce the following output:

Sample Written History and Physical Examination 

History and Physical Examination	Comments
Patient Name: 	Rogers, Pamela 
Date:	6/2/04 
Referral Source:	Emergency Department 
Data Source: 	Patient 
Chief Complaint & ID:  Ms. Rogers is a 56 y/o WF 	Define the reason for the patient’s visit as who has been 
having chest pains for the last week. 	specifically as possible.  
History of Present Illness
This is the first admission for this 56 year old woman, 	Convey the acute or chronic nature of the problem and 
who states she was in her usual state of good health until   	establish a chronology. 
one week prior to admission.  At that time she noticed the 
abrupt onset (over a few seconds to a minute) of chest pain 	onset 
...

So in this example, the intended page layout is undetectable: 2 columns? 2 column table?

The get_text_lines at least puts text pieces / spans on one line if their coordinates suggest this. Large inter-span distances are reflected by tab characters "\t", so their is some indication.

There also is a parameter sep= in the method which will separate text spans by a desired string. For example:

text2=get_text_lines(page, sep="|")
print(text2)

Sample Written History and Physical Examination 

History and Physical Examination|Comments
Patient Name: |Rogers, Pamela 
Date:|6/2/04 
Referral Source:|Emergency Department 
Data Source: |Patient 
Chief Complaint & ID:  Ms. Rogers is a 56 y/o WF |Define the reason for the patient’s visit as who has been 
having chest pains for the last week. |specifically as possible.  
History of Present Illness
This is the first admission for this 56 year old woman, |Convey the acute or chronic nature of the problem and

In addition to get_text_lines(), there also is a more low-level method, pymupdf4llm.helpers.get_text_lines.get_raw_lines.
Method get_raw_lines() (accepts a TextPage of the page [!] and) delivers a list of the "raw" text lines. This is a list of tuples (rect, spans), one tuple per line.
The rect is the re-computed line bbox and spans is a list of text spans (re-ordered) occurring in that line.
If you want, you can use this method to maybe better re-construct the intended page layout.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract pdf page text line by line? #3552

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to extract pdf page text line by line? #3552

mikejokic Jun 5, 2024

Replies: 2 comments

JorjMcKie Jun 5, 2024 Maintainer

JorjMcKie Jun 5, 2024 Maintainer

mikejokic
Jun 5, 2024

JorjMcKie
Jun 5, 2024
Maintainer

JorjMcKie
Jun 5, 2024
Maintainer