Deduping library of 47,000 photos // SUGGESTION: More granular exposed filtering options AND throttle #41

One-Hoopy-Frood · 2024-02-12T14:59:20Z

One-Hoopy-Frood
Feb 12, 2024

Hi Matt, first, thanks for taking this on and creating a util that seems to be working for many.

Background:

I've been doing photography since the early 90's and have a library (on Google) that has 47,000 photos.
At several points, I was given libraries with many known duplicates.
(Win 11, Docker Win Desktop, i7-4c8t, 32G Ram)
All runs fine and system resources are not overwhelmed, usually hang at about .33 while running the web/Docker page.

Problems:
A) I think there may be a throttling issue, as an INITIAL 'scrape' to collect the GooglePhotos with [RefreshMediaItems] seems to succeed without issue, but re-running the library update (where it attempts to refresh all images) fails with a lot of log entries like this:

2024-02-11 23:12:08 worker-1 | attempts left: 1
2024-02-11 23:12:08 worker-1 | sleeping for 30 seconds before retrying
2024-02-11 23:12:08 worker-1 | [2024-02-12 04:12:08,129: WARNING/ForkPoolWorker-22] Received 429 Client Error: Too Many Requests for url: https://lh3.googleusercontent.com/

I'm also wondering if maybe timing out is causing some of what's going on in B) below, but not sure.

B) This system appears to be searching for duplicates by what seems like multiple combined matching strategies:
(OpenCV Visual similarity)
OR
(Matching Filenames)
(Matching dimensions)
(Matching filesizes)

At the top of my results, I get a TON of CLEARLY 4-5x duplicate images with >99.7% similarity. These are all actual duplicates (I think).
At the bottom, I get some really unhelpful results where there the system appears to find the following combinations as "matches"

Same filename
Same size
ZERO (or unknown) visual similarity
(see uploaded images)

I was wondering if this is a side-effect of "Storage Saver", which might not technically "expose" a file size to the API inquiries? I downloaded a copy of the bottom two photos and compared their sizes on disk and they are clearly different.
(see image)

This results in the bottom 10% of a list of 4,200 "matches" which are very clearly not similar. For someone with a very small library, "filename matches" may be useful.

Suggestions/Ideas:

Limit request rate to API Limit
If there are too many requests through the API and it's timing out, for large libraries, is there a way to for the system to track the number of "requests made" and compare them to the "request rate limit" so the system stays below that rate?
Exposed Filters
Exposed "similarity" filters at the "Process Duplicates" phase AND report in batches? I was thinking something like:
Matching Strategy
Minimum Visual Similarity Threshold: [NN.N]
[CHECKBOX] Search for files with identical size (optional)
[CHECKBOX] Find files with the same filename (optional)
Limit output to first [NN] matches (suggested for large libraries)

^^ the above would be a simple way to adjust (what I think is the current matching strategies) so the user can tweak them to suit. For example:
I might first want to make a pass with ONLY VISUAL SIMILARITY at 99.8 since I know that those images will nearly certainly be actual duplicates.
Then, I might want to do another pass with a LOWER VISUAL SIMILARITY (75%) but only consider files with "matching filenames". And, since I have a HUGE library and I want to VISUALLY check ALL reported matches, I could limit it to a more manageable set when deleting to ensure it's "doing what I expect it to do".

I can think of more COMPLICATED things for UX, but the above list would get me what I need by allowing me to AND combine the above requirements.

aleon1220 · 2024-08-10T11:09:27Z

aleon1220
Aug 10, 2024

for problem A Backup your .env values and run a clean up

docker compose down ; docker compose up --detach

or you could clone the github repository in a different directory and run the docker compose stack again with your .env values.

for problem B I do agree more granular filters or a higher percentage of comparison should be used.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduping library of 47,000 photos // SUGGESTION: More granular exposed filtering options AND throttle #41

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Deduping library of 47,000 photos // SUGGESTION: More granular exposed filtering options AND throttle #41

One-Hoopy-Frood Feb 12, 2024

Background:

Suggestions/Ideas:

Replies: 1 comment

aleon1220 Aug 10, 2024

One-Hoopy-Frood
Feb 12, 2024

aleon1220
Aug 10, 2024