Correct Text box is not picked up #3828

Mominadar · 2024-08-29T15:19:22Z

Mominadar
Aug 29, 2024

Description of the bug

I am using pymupdf to read a pdf and then translate its text to another language in the same pdf. In this, I am using redactions to redact the original text and apply another layer to put the translated text over there. i want to retain the original font color, size and font-family. Now, all of this is working but for some blocks of text, this is not working.

It worked before but now running the same program, its not working fine. The original file I'm using for testing:
original.pdf

The french translation before this was:

(Uploading screenshot since the file was over 200mb)
and now the one I see:
french-1-translation (6).pdf

No code was changed during this. I tried updating to the latest version for this too. Previously I used an older one but issue persists.

This is how I am reading the blocks

    def process_block( block):
        block_text = ""
        block_bbox = None
        font_properties = None
        if "lines" in block:
            for line in block["lines"]:
                for span in line["spans"]:
                    font_properties = {
                        "size": "%g" % span["size"], 
                        "color": "#%06x" % span["color"], 
                        "font-family": '%s' % span["font"]
                    }
                    if span["text"].strip():
                        bbox = span["bbox"]
                        
                        if isinstance(bbox, (list, tuple)) and len(bbox) == 4:
                            bbox = [float(coord) for coord in bbox]
                            if block_bbox is None:
                                block_bbox = bbox
                            else:
                                block_bbox[0] = min(block_bbox[0], bbox[0])
                                block_bbox[1] = min(block_bbox[1], bbox[1])
                                block_bbox[2] = max(block_bbox[2], bbox[2])
                                block_bbox[3] = max(block_bbox[3], bbox[3])
                            block_text += span["text"] + " "
                            
                        if "scarring" in span["text"].lower() or "fetal loss" in span["text"].lower() or "death" in span["text"].lower() or "global spread" in span["text"].lower():
                            logger.info(f"text: {span['text']} bbox {block_bbox}")
                            
        if block_text.strip():
            return {
                "text": block_text.strip(),
                "bbox": block_bbox,
                "font_properties": font_properties
            }
        return None

    def extract_text_blocks(self, doc):
        pages_blocks = []

        for page_num in range(len(doc)):
            page = doc.load_page(page_num)
            blocks = page.get_text("dict")["blocks"]
    
            with ThreadPoolExecutor(max_workers=10) as executor:
                future_to_block = {executor.submit(self.process_block, block): block for block in blocks}
    
                page_blocks = []
                for future in as_completed(future_to_block):
                    block_result = future.result()
                    if block_result:
                        page_blocks.append(block_result)

            pages_blocks.append(page_blocks)
            
        return pages_blocks
    
    def translate(file, source_language, target_language):
            doc = fitz.open(stream=file, filetype="pdf")
            pages_blocks = self.extract_text_blocks(doc)

            for page_num, page_blocks in enumerate(pages_blocks):
                page = doc.load_page(page_num)
                for block in page_blocks:
                    text = block["text"]
                    bbox = block["bbox"]
                    font_properties = block["font_properties"]
                    if isinstance(bbox, (list, tuple)) and len(bbox) == 4:
                        try:
                            rect = fitz.Rect(bbox)
                            translated_text = self.bedrock_client.translate_text(text, source_language, target_language)
                            if translated_text:
                                page.add_redact_annot(rect, text="")
                                page.apply_redactions()
                                html = f'''
                                <div style="font-size:{font_properties['size']}px; font-family:{font_properties["font-family"]}; color:{font_properties["color"]}">
                                    {translated_text}
                                </div>
                                '''                        
                                page.insert_htmlbox(rect, html)
                            else:
                                logger.error(f"Skipping block due to translation error: '{text}'")
                        except Exception as e:
                            logger.error(f"Error processing block: '{text}' with bbox: {bbox}")
                            logger.error(e)
                            raise
                    else:
                        logger.error(f"Invalid bbox: {bbox}")

This is on an AWS Lambda.

How to reproduce the bug

Just translate file. Happening every time. Was not happening yesterday.

PyMuPDF version

1.24.8

Operating system

Linux

Python version

3.8

Answered by JorjMcKie

Sep 6, 2024

Getting perfect behavior with this snippet:

import pymupdf
pymupdf.version
('1.24.10', '1.24.9', '20240902000001')
doc=pymupdf.open("redacted.pdf")
for page in doc:
    page.apply_redactions(images=0,graphics=0,text=0)
doc.ez_save("y.pdf")

y.pdf

View full answer

JorjMcKie · 2024-08-30T09:36:15Z

JorjMcKie
Aug 30, 2024
Maintainer

It worked before but now running the same program, its not working fine.

Please explain what you mean by the above:

If the same program was running fine with the same input file, but now isn't anymore, then something else must have changed: different PyMuPDF version, different Python, ...
Otherwise: so what has changed?

As a preliminary comment, overlaps across images / vector graphics like these:

happen, when rectangles passed to .insert_htmlbox() were dimensioned incorrectly, respectively computed with insufficient information.
For the first example, here is the text block marked in red. It has been computed correctly!

As you see, the circle with the exclamation mark symbol overlaps the text block rectangle.
There is no error of the package in this case. What is missing is more logic in your program that detects situations like these and computes the required output rectangle differently.

But maybe I did not understand your complaint correctly.

0 replies

Mominadar · 2024-08-30T11:01:46Z

Mominadar
Aug 30, 2024
Author

Thanks for replying. Yes, I understand that the rectangle is not being computed correctly so text block gets assigned to the wrong place overlapping and getting the wrong blocks styling. I don't understand how to change my logic here though. Its not a bug maybe but do you know how can I change this? I shared my code for finding the rectangle for each text block.

0 replies

JorjMcKie · 2024-08-30T11:07:04Z

JorjMcKie
Aug 30, 2024
Maintainer

Thanks for confirming that my assumption was correct and that we do not have a bug.
This changes your post to a Discussions item. So I will transfer to this category.

0 replies

JorjMcKie · 2024-08-30T13:16:26Z

JorjMcKie
Aug 30, 2024
Maintainer

Looking at this example:

The problem is that the highlighted lines "respect" the exclamation mark and are all shorter than the rest in this block.
MuPDF (which is responsible for identifying what to regard as a block) found no (or not enough) reason to start a new block for the text starting with "People may be at risk...".
There is however a slightly larger inter-line distance between this line and the inter-line distance of the marked lines.
You may want to exploit this observation by taking the minimum vertical line distance and create a sub-block for consecutive lines not exceeding this value. For this sub-block, set its width to a smaller value, e.g. the max line width.

0 replies

Mominadar · 2024-09-06T08:50:47Z

Mominadar
Sep 6, 2024
Author

Hi Jorj,

Thanks for the helpful clues. I assigned each text "line span" its own styling properties instead of giving it per block. That fixed this issue. I am still figuring out how to handle font resizing best in the case after translation occurs and the translated string is longer than the original. The package does rescale it but it appears too weird sometimes. Any insights or advice you have on this would be appreciated. Thanks again for all the help.

0 replies

Mominadar · 2024-09-06T11:08:16Z

Mominadar
Sep 6, 2024
Author

Hi,

There is one more thing. Not sure if this is a bug but seems like it. I am translating this pdf and on applying redactions the image gets removed. If a page has only the image, that stays but those pages with images and text, the image gets removed.
orginal.pdf

In this pdf, the image on the last page remains but all others are removed. This is the result:
result-final.pdf

and this is the file with redactions:
redacted.pdf

I figured it was due to text overlapping with images but doing this:
page.apply_redactions(images=0, graphics=0, text=0)
tried all combinations of parameters for this function.

did not work either. So how to tackle this and is this a bug? Otherwise I'll open the respective issue. Thanks

0 replies

JorjMcKie · 2024-09-06T14:28:59Z

JorjMcKie
Sep 6, 2024
Maintainer

Getting perfect behavior with this snippet:

import pymupdf
pymupdf.version
('1.24.10', '1.24.9', '20240902000001')
doc=pymupdf.open("redacted.pdf")
for page in doc:
    page.apply_redactions(images=0,graphics=0,text=0)
doc.ez_save("y.pdf")

y.pdf

0 replies

Mominadar · 2024-09-09T09:02:56Z

Mominadar
Sep 9, 2024
Author

Hi Jorj,

I was using the version 1.24.8. Upgrading to the latest version did solve the problem. Thank you so much for responding fast and solving my issues.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct Text box is not picked up #3828

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Correct Text box is not picked up #3828

Mominadar Aug 29, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 8 comments

JorjMcKie Aug 30, 2024 Maintainer

Mominadar Aug 30, 2024 Author

JorjMcKie Aug 30, 2024 Maintainer

JorjMcKie Aug 30, 2024 Maintainer

Mominadar Sep 6, 2024 Author

Mominadar Sep 6, 2024 Author

JorjMcKie Sep 6, 2024 Maintainer

Mominadar Sep 9, 2024 Author

Mominadar
Aug 29, 2024

JorjMcKie
Aug 30, 2024
Maintainer

Mominadar
Aug 30, 2024
Author

JorjMcKie
Aug 30, 2024
Maintainer

JorjMcKie
Aug 30, 2024
Maintainer

Mominadar
Sep 6, 2024
Author

Mominadar
Sep 6, 2024
Author

JorjMcKie
Sep 6, 2024
Maintainer

Mominadar
Sep 9, 2024
Author