-
Notifications
You must be signed in to change notification settings - Fork 79
Calculate font size
Tesseract's hocr output provides some information about the x_size
(see x-height) of the recognized text (together with information about ascender and descender). It is also possible to activate the hocr_font_info
to become some information about the font size as well. However, the font size is then rounded to an integer which is not always what one want.
The calculation for the font size is easy and independent of any other information such that we can do this again from the hocr file (with given x_size
parameter for each line) without any rounding, cf. this simple hack:
$ perl -ne 'print("$1 ", $2*72/600, "\n") if /^.*id=.([^ ]*). .*x_size ([0-9.]*);.*$/' h7.html
line_1_1 8.62807344
line_1_2 7.08
line_1_3 6.36
line_1_4 6.36
line_1_5 6.36
line_1_6 6.35710104
line_1_7 6.48
line_1_8 6.36
line_1_9 6.24
line_1_10 6.36
...
If your image has some other resolution, then substitute the "600" above with that.
Source: http://stackoverflow.com/questions/43531282/getting-exact-font-size-in-hocr-output/