Deduping library of 47,000 photos // SUGGESTION: More granular exposed filtering options AND throttle #41
One-Hoopy-Frood
started this conversation in
Ideas
Replies: 1 comment
-
for problem A Backup your docker compose down ; docker compose up --detach or you could clone the github repository in a different directory and run the docker compose stack again with your .env values. for problem B I do agree more granular filters or a higher percentage of comparison should be used. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Matt, first, thanks for taking this on and creating a util that seems to be working for many.
Background:
I've been doing photography since the early 90's and have a library (on Google) that has 47,000 photos.
At several points, I was given libraries with many known duplicates.
(Win 11, Docker Win Desktop, i7-4c8t, 32G Ram)
All runs fine and system resources are not overwhelmed, usually hang at about .33 while running the web/Docker page.
Problems:
A) I think there may be a throttling issue, as an INITIAL 'scrape' to collect the GooglePhotos with [RefreshMediaItems] seems to succeed without issue, but re-running the library update (where it attempts to refresh all images) fails with a lot of log entries like this:
I'm also wondering if maybe timing out is causing some of what's going on in B) below, but not sure.
B) This system appears to be searching for duplicates by what seems like multiple combined matching strategies:
(OpenCV Visual similarity)
OR
(Matching Filenames)
(Matching dimensions)
(Matching filesizes)
At the top of my results, I get a TON of CLEARLY 4-5x duplicate images with >99.7% similarity. These are all actual duplicates (I think).
At the bottom, I get some really unhelpful results where there the system appears to find the following combinations as "matches"
(see uploaded images)
I was wondering if this is a side-effect of "Storage Saver", which might not technically "expose" a file size to the API inquiries? I downloaded a copy of the bottom two photos and compared their sizes on disk and they are clearly different.
(see image)
This results in the bottom 10% of a list of 4,200 "matches" which are very clearly not similar. For someone with a very small library, "filename matches" may be useful.
Suggestions/Ideas:
Limit request rate to API Limit
If there are too many requests through the API and it's timing out, for large libraries, is there a way to for the system to track the number of "requests made" and compare them to the "request rate limit" so the system stays below that rate?
Exposed Filters
Exposed "similarity" filters at the "Process Duplicates" phase AND report in batches? I was thinking something like:
Matching Strategy
Minimum Visual Similarity Threshold: [NN.N]
[CHECKBOX] Search for files with identical size (optional)
[CHECKBOX] Find files with the same filename (optional)
Limit output to first [NN] matches (suggested for large libraries)
^^ the above would be a simple way to adjust (what I think is the current matching strategies) so the user can tweak them to suit. For example:
I might first want to make a pass with ONLY VISUAL SIMILARITY at 99.8 since I know that those images will nearly certainly be actual duplicates.
Then, I might want to do another pass with a LOWER VISUAL SIMILARITY (75%) but only consider files with "matching filenames". And, since I have a HUGE library and I want to VISUALLY check ALL reported matches, I could limit it to a more manageable set when deleting to ensure it's "doing what I expect it to do".
I can think of more COMPLICATED things for UX, but the above list would get me what I need by allowing me to AND combine the above requirements.
Beta Was this translation helpful? Give feedback.
All reactions