Detecting bold fonts in a pdf #3779

theshahshow · 2024-08-13T22:17:42Z

theshahshow
Aug 13, 2024

To check whether the text inside a span is bold or not, i usually check if 'bold' exists in span['font'] (font name) or not.

But in the attached pdf bdh_single.pdf, all the font's are "SourceSansPro-Regular". When looking at the pdf, you can clearly see that some text (headings) are bold. In such cases, how do i find if some text is bold or not?

For font names, i look at the ouput of this code:

doc = fitz.open("bdh_single.pdf")
page = doc[0]

blocks = page.get_text("dict")["blocks"]
for block in blocks:
if block["type"] == 0:
for line in block["lines"]:
for span in line['spans']:
font_name = span['font']

Answered by JorjMcKie

Aug 13, 2024

Just had another look:
This PDF uses PDF command 2 Tr which means "Set text rendering mode 2":

So if fill and stroke color are the same (e.g. black) then - with a suitable line width - characters written like that appear bold, while still the same font is used.

View full answer

JorjMcKie · 2024-08-13T22:37:10Z

JorjMcKie
Aug 13, 2024
Maintainer

Of course, PyMuPDF has no way of visually determining how a glyph looks like for the human eye: it is dependent on what the font itself tells us about its properties.
If the font says "I am not bold" we are bound to believe it - no matter how fat the characters appear to be.

Some PDF creators use tricks in an effort to achieve text effects without embedding another font.
These tricks may include things like writing some text twice - with a little offset (to the right).
Or writing characters with a border around the interior, and more of that sort.

5 replies

JorjMcKie Aug 13, 2024
Maintainer

Just had another look:
This PDF uses PDF command 2 Tr which means "Set text rendering mode 2":

So if fill and stroke color are the same (e.g. black) then - with a suitable line width - characters written like that appear bold, while still the same font is used.

Answer selected by theshahshow

theshahshow Aug 14, 2024
Author

Thanks for the quick reply!
I get your explanation. How do i find the text rendering mode for a span of text (is the text rendering mode set only once? or can it be different throughout the pdf?). Similarly is there a way to find the fill and stroke color of a text?

theshahshow Aug 14, 2024
Author

Ok, so going through earlier discussions, I found out how to find the fill and stroke, by using the page.get_texttrace(). Just wondering if any other solution for this exists or not.

JorjMcKie Aug 14, 2024
Maintainer

You answered most of your questions yourself 👍!
Command Tr can be set for every piece of text. Multiple values are possible per page. Value 3 is often used by OCR engines to store invisible (but extractable and searchable) text upon scanned pages.

Only get_texttrace() delivers all those text details - it is not based on a TextPage object but built upon more low-level MuPDF code.

theshahshow Aug 14, 2024
Author

Got it, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detecting bold fonts in a pdf #3779

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Detecting bold fonts in a pdf #3779

theshahshow Aug 13, 2024

Replies: 1 comment · 5 replies

JorjMcKie Aug 13, 2024 Maintainer

JorjMcKie Aug 13, 2024 Maintainer

theshahshow Aug 14, 2024 Author

theshahshow Aug 14, 2024 Author

JorjMcKie Aug 14, 2024 Maintainer

theshahshow Aug 14, 2024 Author

theshahshow
Aug 13, 2024

Replies: 1 comment 5 replies

JorjMcKie
Aug 13, 2024
Maintainer

JorjMcKie Aug 13, 2024
Maintainer

theshahshow Aug 14, 2024
Author

theshahshow Aug 14, 2024
Author

JorjMcKie Aug 14, 2024
Maintainer

theshahshow Aug 14, 2024
Author