Replies: 1 comment 1 reply
-
MuPDF uses Tesseract-OCR version 4.0 code internally. We have no dealings with pytesseract, so they may or may not rely on that or another version of Tesseract and we also do not know which parameters that Python package passes to Tesseract (especially PSM). The internally used font of Tesseract is a glyph-less font and also is named like that. It is a mono-spaced font - which implies that the boundary boxes of single characters cannot be trusted. Or more clearly: they are wrong. If the original page e.g. has the following words written with the same font and font size |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem? Please describe.
I recently began an OCR data cleaning improvement project called RemarkableOCR on top of PyTesseract, for a specialization on books, journal articles, and newspapers. This is the first time I am exploring PDF data structures, and I would like to use the native PDF data extraction of PyMuPDF rather than covert PDFs to images and OCR the images. This question regards the guarantees of PDF data blocks, which you are familiar with and I am not.
Describe the solution you'd like
fitz.open(pdf_filename)[0].get_text(option="rawdict")
returns a collection of blocks, which each contain a collection of lines, which contain spans, which contains individual characters, which contains their precise bounding boxes. This data is "easy enough" to parse into word-sized bounding boxes and replicate the PyTesseract data output. My question is how you would approach this problem. The order of the blocks can be sorted so that the sequence is in left-right, top-down order; but there are configurations in which that does not easily lend itself to a structured reading orientation. Can you provide your insight into the definition of 'block' and 'line' in relation to how those are defined by PyTesseract, and in what edge cases those align and do not align? Thank you for entering into this discussion with me.Beta Was this translation helpful? Give feedback.
All reactions