Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature upgrade apache pdfbox #4450

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Conversation

buchen
Copy link
Member

@buchen buchen commented Jan 3, 2025

First draft for #4449

It switches to PDFBox 3.0.3 and falls back - in case of errors - to PDFBox 1.8.

What I am still thinking about: How to learn if and where there are differences between the PDFBox versions.

Right now, I am printing a log message. But users will most likely not notice it let alone inform us.

I wonder if we should create both versions of the extracted text, and, if there are differences, put both files into the debug text. At least we see documents that are different.

@buchen buchen requested a review from Nirus2000 January 3, 2025 11:06
@ZfT2
Copy link
Contributor

ZfT2 commented Jan 4, 2025

Maybe just provide a separate "beta" Version of PP with (only) pdfbox 3, post it in the forums and let the community test it?
If after a time no (bigger) failures are reported, upgrade it also in the main release...

Advantage: we will (hopefully) only deal with one pdfbox Version in future.

@buchen buchen force-pushed the feature_upgrade_apache_pdfbox branch from 05ea668 to aeea8dd Compare January 6, 2025 10:27
@buchen
Copy link
Member Author

buchen commented Jan 6, 2025

Advantage: we will (hopefully) only deal with one pdfbox Version in future.

Yes. That is the goal.

I have now made two more changes to help us learn about differences.

The extracted debug text indicates if the version is different. At least we will learn if there are differences.

Bildschirmfoto 2025-01-06 um 11 21 28

And I print the diff to the log file. In case we want to better understand what the differences are. The changes so far do not make a material difference (better extraction of special characters, additional line breaks in address, etc.)

Bildschirmfoto 2025-01-06 um 11 10 08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants