Match text extracted with unstructured to parsing from PyMuPDF #3707

JanEricNitschke · 2024-07-19T17:23:56Z

JanEricNitschke
Jul 19, 2024

I am parsing PDFs with unstructured for use in an application. Later i would like to highlight some of these extracted text passaged in the actual pdf. I have seen that pymupdf has the capabilities to do that. However i am running into the case where the pymupdf search functionality does not find the extracted text again. This appears to be related to linebreaks and quotation marks. I have seen the search functionality in pymupdf allows the setting of some flags, however i have not found a documentation for these and how they match against what happens in unstructured. Does someone have experience doing something like this and knows what settings or tricks can be used to align them and make this work?

Cheers

JorjMcKie · 2024-07-19T18:02:59Z

JorjMcKie
Jul 19, 2024
Maintainer

Do you have 2 example files, 1 for unstructured text 1 for a PDF page where you expect search hits?

3 replies

JanEricNitschke Jul 19, 2024
Author

It is only ever about the same page. I parse the page with unstructured.io (there is some post processing that doesn't affect the portions I tested) and then i later want to find a portion of that parsed PDF again in the original PDF for adding a highlight.

Unfortunately I am in the train and phone at the moment, but I will check if I can give you an example over the weekend or on Monday.

JorjMcKie Jul 19, 2024
Maintainer

Alles klar, kein Problem!

JanEricNitschke Jul 21, 2024
Author

Hi,

i now took this pdf and ran the following code:

"""Parse the pdf."""

from pathlib import Path

from pymupdf import Document
from pymupdf.utils import search_for
from tqdm import tqdm
from unstructured.partition.pdf import partition_pdf


def highlight_text(doc: Document, text: str) -> bool:
    """Highlight the text in the document."""
    found = False
    for page in doc:
        ### SEARCH
        text_instances = search_for(page, text)

        if text_instances:
            found = True

        ### HIGHLIGHT
        for inst in text_instances:
            highlight = page.add_highlight_annot(inst)
            highlight.update()
    return found


def main() -> None:
    """Check the pdf."""
    data_dir = Path(__file__).parent / "data"
    data_dir.mkdir(exist_ok=True)
    pdf_path = data_dir / "TUD_Satzung_Sicherung-gute-wissenschaftliche-Praxis-2022.pdf"
    elements = partition_pdf(
        str(pdf_path), languages=["de"], strategy="hi_res", infer_table_structure=True
    )
    doc = Document(pdf_path)
    found = 0
    not_found = 0
    for element in tqdm(elements):
        text_found = highlight_text(doc, str(element))
        if text_found:
            found += 1
        else:
            not_found += 1
    print(f"Found: {found}")
    print(f"Not Found: {not_found}")
    doc.save(data_dir / "highlighted.pdf")


if __name__ == "__main__":
    main()

and got the following output:

Found: 106
Not Found: 106

and pdf:
highlighted.pdf

What can i do to reduce the number of not_found text passages?

JorjMcKie · 2024-07-22T11:45:33Z

JorjMcKie
Jul 22, 2024
Maintainer

I do not see my answer from the weekend 🤷‍♂️ - don't know what happened.

You can directly say page.search_for(...) no need to extra import it from utils.
I cannot install all those packages you are using. So I don't know what type of string this unstructured animal will deliver.
Chances are that problems are around deviating behavior WRT to handling of de-hyphenation (standard in PyMuPDF), line breaks, ligatures.
Ligatures are things like extra glyphs "ﬃ" for certain character combinations. In this case, unstructured may or may not deliver "ffi" (dissolution in components).
To cope with that, experiment with the text extraction flags like so page.search_for(text, flags=0). Bit positions in this integer switch certain behavior details on or off. Value 0 represents no de-hypenation, ligature dissolution, do not generate extra spaces and some others.

2 replies

JanEricNitschke Jul 22, 2024
Author

Thanks!

The import was because pycharm was screaming at me :D

Do you have good documentation what each value of flags does in details?

Will check this out and report back.

I was also thinking about employing slightly different approachs. Like for example not trying to match the full string but just the first and last three words and then use the start and stop options for highlighting? Then a mismatch somewhere in the middle wouldnt cause issues.

JorjMcKie Jul 22, 2024
Maintainer

Looking into the documentation is always a good idea (😎): https://pymupdf.readthedocs.io/en/latest/app1.html#text-extraction-flags-defaults.

That last idea of yours is a good one BTW!

rhlarora84 · 2024-09-12T09:18:13Z

rhlarora84
Sep 12, 2024

@JanEricNitschke out of curiosity, do you see noticable parsing differences between using unstructured and pymupdf? Curios why not use the same extraction and annotation library.

1 reply

JanEricNitschke Sep 12, 2024
Author

I don't actually know why the original decision fell on unstructured over pymupdf.

But when I started a lot of postprocessing had already been built in top and it was deemed too much work to change.

So I needed a way to get the existing parsing to work with the highlighting where we needed pymupdf.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match text extracted with unstructured to parsing from PyMuPDF #3707

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Match text extracted with unstructured to parsing from PyMuPDF #3707

JanEricNitschke Jul 19, 2024

Replies: 3 comments · 6 replies

JorjMcKie Jul 19, 2024 Maintainer

JanEricNitschke Jul 19, 2024 Author

JorjMcKie Jul 19, 2024 Maintainer

JanEricNitschke Jul 21, 2024 Author

JorjMcKie Jul 22, 2024 Maintainer

JanEricNitschke Jul 22, 2024 Author

JorjMcKie Jul 22, 2024 Maintainer

rhlarora84 Sep 12, 2024

JanEricNitschke Sep 12, 2024 Author

JanEricNitschke
Jul 19, 2024

Replies: 3 comments 6 replies

JorjMcKie
Jul 19, 2024
Maintainer

JanEricNitschke Jul 19, 2024
Author

JorjMcKie Jul 19, 2024
Maintainer

JanEricNitschke Jul 21, 2024
Author

JorjMcKie
Jul 22, 2024
Maintainer

JanEricNitschke Jul 22, 2024
Author

JorjMcKie Jul 22, 2024
Maintainer

rhlarora84
Sep 12, 2024

JanEricNitschke Sep 12, 2024
Author