Match text extracted with unstructured to parsing from PyMuPDF #3707
Unanswered
JanEricNitschke
asked this question in
Q&A
Replies: 3 comments 6 replies
-
Do you have 2 example files, 1 for unstructured text 1 for a PDF page where you expect search hits? |
Beta Was this translation helpful? Give feedback.
3 replies
-
I do not see my answer from the weekend 🤷♂️ - don't know what happened.
|
Beta Was this translation helpful? Give feedback.
2 replies
-
@JanEricNitschke out of curiosity, do you see noticable parsing differences between using unstructured and pymupdf? Curios why not use the same extraction and annotation library. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am parsing PDFs with unstructured for use in an application. Later i would like to highlight some of these extracted text passaged in the actual pdf. I have seen that pymupdf has the capabilities to do that. However i am running into the case where the pymupdf search functionality does not find the extracted text again. This appears to be related to linebreaks and quotation marks. I have seen the search functionality in pymupdf allows the setting of some flags, however i have not found a documentation for these and how they match against what happens in unstructured. Does someone have experience doing something like this and knows what settings or tricks can be used to align them and make this work?
Cheers
Beta Was this translation helpful? Give feedback.
All reactions