Correct Text box is not picked up #3828
-
Description of the bugI am using pymupdf to read a pdf and then translate its text to another language in the same pdf. In this, I am using redactions to redact the original text and apply another layer to put the translated text over there. i want to retain the original font color, size and font-family. Now, all of this is working but for some blocks of text, this is not working. It worked before but now running the same program, its not working fine. The original file I'm using for testing: The french translation before this was: No code was changed during this. I tried updating to the latest version for this too. Previously I used an older one but issue persists. This is how I am reading the blocks
This is on an AWS Lambda. How to reproduce the bugJust translate file. Happening every time. Was not happening yesterday. PyMuPDF version1.24.8 Operating systemLinux Python version3.8 |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
Please explain what you mean by the above:
As a preliminary comment, overlaps across images / vector graphics like these: happen, when rectangles passed to As you see, the circle with the exclamation mark symbol overlaps the text block rectangle. But maybe I did not understand your complaint correctly. |
Beta Was this translation helpful? Give feedback.
-
Thanks for replying. Yes, I understand that the rectangle is not being computed correctly so text block gets assigned to the wrong place overlapping and getting the wrong blocks styling. I don't understand how to change my logic here though. Its not a bug maybe but do you know how can I change this? I shared my code for finding the rectangle for each text block. |
Beta Was this translation helpful? Give feedback.
-
Thanks for confirming that my assumption was correct and that we do not have a bug. |
Beta Was this translation helpful? Give feedback.
-
The problem is that the highlighted lines "respect" the exclamation mark and are all shorter than the rest in this block. |
Beta Was this translation helpful? Give feedback.
-
Hi Jorj, Thanks for the helpful clues. I assigned each text "line span" its own styling properties instead of giving it per block. That fixed this issue. I am still figuring out how to handle font resizing best in the case after translation occurs and the translated string is longer than the original. The package does rescale it but it appears too weird sometimes. Any insights or advice you have on this would be appreciated. Thanks again for all the help. |
Beta Was this translation helpful? Give feedback.
-
Hi, There is one more thing. Not sure if this is a bug but seems like it. I am translating this pdf and on applying redactions the image gets removed. If a page has only the image, that stays but those pages with images and text, the image gets removed. In this pdf, the image on the last page remains but all others are removed. This is the result: and this is the file with redactions: I figured it was due to text overlapping with images but doing this: did not work either. So how to tackle this and is this a bug? Otherwise I'll open the respective issue. Thanks |
Beta Was this translation helpful? Give feedback.
-
Getting perfect behavior with this snippet: import pymupdf
pymupdf.version
('1.24.10', '1.24.9', '20240902000001')
doc=pymupdf.open("redacted.pdf")
for page in doc:
page.apply_redactions(images=0,graphics=0,text=0)
doc.ez_save("y.pdf") |
Beta Was this translation helpful? Give feedback.
-
Hi Jorj, I was using the version 1.24.8. Upgrading to the latest version did solve the problem. Thank you so much for responding fast and solving my issues. |
Beta Was this translation helpful? Give feedback.
Getting perfect behavior with this snippet:
y.pdf