Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mandatory metadata are not all set #427

Open
rgaudin opened this issue Dec 19, 2024 · 4 comments
Open

Mandatory metadata are not all set #427

rgaudin opened this issue Dec 19, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@rgaudin
Copy link
Member

rgaudin commented Dec 19, 2024

Here's the log of a zimit run that eventually died with a ValueError: Mandatory metadata are not all set..

Should warc2zim fail on missing mandatory metadata before starting the creator?

[zimit::2024-12-18 22:41:11,580] INFO:----------
[zimit::2024-12-18 22:41:11,580] INFO:Testing warc2zim args
[zimit::2024-12-18 22:41:11,580] INFO:Running: warc2zim --name=www.gradphotonetwork.com_43c54d45 --zim-file=www.gradphotonetwork.com_43c54d45.zim --publisher=openZIM --scraper-suffix zimit 2.1.6 --output /output --url https://www.gradphotonetwork.com/QPPlus/Proofs.aspx
[warc2zim::2024-12-18 22:41:11,587] INFO:Arguments valid, no inputs to process. Exiting with return code 100
[zimit::2024-12-18 22:41:11,588] INFO:Writing progress to /output/task_progress.json
[zimit::2024-12-18 22:41:11,595] INFO:
[zimit::2024-12-18 22:41:11,595] INFO:----------
[zimit::2024-12-18 22:41:11,596] INFO:Output to tempdir: /output/.tmp86x0bre6 - will keep
[zimit::2024-12-18 22:41:11,596] INFO:Running browsertrix-crawler crawl: crawl --failOnFailedSeed --waitUntil load --depth -1 --timeout 90 --behaviors autoplay,autofetch,siteSpecific --behaviorTimeout 90 --sizeLimit 4294967296 --diskUtilization 90 --timeLimit 7200 --url https://www.gradphotonetwork.com/QPPlus/Proofs.aspx --userAgentSuffix zimit.kiwix.org+ contact+zimfarm@kiwix.org --mobileDevice Pixel 2 --cwd /output/.tmp86x0bre6 --statsFilename /output/crawl.json
{"timestamp":"2024-12-18T22:41:14.605Z","logLevel":"info","context":"general","message":"Browsertrix-Crawler 1.3.5 (with warcio.js 2.3.1)","details":{}}
{"timestamp":"2024-12-18T22:41:14.610Z","logLevel":"info","context":"general","message":"Seeds","details":[{"url":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","scopeType":"prefix","include":["/^https?:\\/\\/www\\.gradphotonetwork\\.com\\/QPPlus\\//"],"exclude":[],"allowHash":false,"depth":-1,"sitemap":null,"auth":null,"_authEncoded":null,"maxExtraHops":0,"maxDepth":1000000}]}
{"timestamp":"2024-12-18T22:41:14.611Z","logLevel":"info","context":"general","message":"Behavior Options","details":{"message":"{\"autoplay\":true,\"autofetch\":true,\"siteSpecific\":true,\"log\":\"__bx_log\",\"startEarly\":true}"}}
{"timestamp":"2024-12-18T22:41:16.457Z","logLevel":"info","context":"worker","message":"Creating 1 workers","details":{}}
{"timestamp":"2024-12-18T22:41:16.460Z","logLevel":"info","context":"worker","message":"Worker starting","details":{"workerid":0}}
{"timestamp":"2024-12-18T22:41:18.155Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx"}}
{"timestamp":"2024-12-18T22:41:18.160Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":0,"total":1,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-12-18T22:41:16.468Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.gradphotonetwork.com\\/QPPlus\\/Proofs.aspx\",\"added\":\"2024-12-18T22:41:14.809Z\",\"depth\":0}"]}}
{"timestamp":"2024-12-18T22:41:18.998Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:19.576Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","frameId":"E7C86EE06B63209F678D88B88B9128C1"}}
{"timestamp":"2024-12-18T22:41:20.678Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.gradphotonetwork.com/QPPlus/Default.aspx?Action=Expire&Source=Proofs&p2&c","frameId":"E7C86EE06B63209F678D88B88B9128C1"}}
{"timestamp":"2024-12-18T22:41:21.541Z","logLevel":"warn","context":"recorder","message":"Skipping URL from unknown frame","details":{"url":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Proofs&p2&c","frameId":"E7C86EE06B63209F678D88B88B9128C1"}}
{"timestamp":"2024-12-18T22:41:22.623Z","logLevel":"info","context":"general","message":"Seed page redirected, adding redirected seed","details":{"origUrl":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","newUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Proofs&p2&c","seedId":1}}
{"timestamp":"2024-12-18T22:41:23.693Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details":{"url":"mailto:quicpics@candid.com","page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:23.740Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Proofs&p2&c"],"page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:23.741Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Proofs&p2&c","page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:24.269Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Proofs&p2&c","page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:24.270Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:25.276Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.gradphotonetwork.com/QPPlus/Proofs.aspx","workerid":0}}
{"timestamp":"2024-12-18T22:41:25.303Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite"}}
{"timestamp":"2024-12-18T22:41:25.306Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":1,"total":2,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":1,\"started\":\"2024-12-18T22:41:25.302Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.gradphotonetwork.com\\/QPPlus\\/m\\/Helper.ashx?Action=VisitFullSite\",\"added\":\"2024-12-18T22:41:23.684Z\",\"depth\":1}"]}}
{"timestamp":"2024-12-18T22:41:25.720Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:27.793Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details":{"url":"mailto:quicpics@candid.com","page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:27.814Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c"],"page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:27.814Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:28.337Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:28.337Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:29.341Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.gradphotonetwork.com/QPPlus/m/Helper.ashx?Action=VisitFullSite","workerid":0}}
{"timestamp":"2024-12-18T22:41:29.359Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c"}}
{"timestamp":"2024-12-18T22:41:29.362Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":2,"total":3,"pending":1,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":1,\"started\":\"2024-12-18T22:41:29.359Z\",\"extraHops\":0,\"url\":\"https:\\/\\/www.gradphotonetwork.com\\/QPPlus\\/m\\/Default.aspx?Action=Expire&Source=Images.aspx&c\",\"added\":\"2024-12-18T22:41:27.789Z\",\"depth\":2}"]}}
{"timestamp":"2024-12-18T22:41:29.506Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:30.999Z","logLevel":"warn","context":"general","message":"Invalid Page - URL must start with http:// or https://","details":{"url":"mailto:quicpics@candid.com","page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:31.020Z","logLevel":"info","context":"behavior","message":"Running behaviors","details":{"frames":1,"frameUrls":["https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c"],"page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:31.021Z","logLevel":"info","context":"behavior","message":"Run Script Started","details":{"frameUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:31.543Z","logLevel":"info","context":"behavior","message":"Run Script Finished","details":{"frameUrl":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:31.543Z","logLevel":"info","context":"behavior","message":"Behaviors finished","details":{"finished":1,"page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:32.547Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":4,"page":"https://www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Images.aspx&c","workerid":0}}
{"timestamp":"2024-12-18T22:41:32.569Z","logLevel":"info","context":"worker","message":"Worker done, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-12-18T22:41:32.888Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3,"total":3,"pending":0,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-12-18T22:41:32.891Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-12-18T22:41:32.894Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: done","details":{}}
[zimit::2024-12-18 22:41:32,928] INFO:
[zimit::2024-12-18 22:41:32,928] INFO:----------
[zimit::2024-12-18 22:41:32,929] INFO:Processing WARC files in/at /output/.tmp86x0bre6/collections/crawl-20241218224114579/archive
[zimit::2024-12-18 22:41:32,929] INFO:Calling warc2zim with these args: ['--name=www.gradphotonetwork.com_43c54d45', '--zim-file=www.gradphotonetwork.com_43c54d45.zim', '--publisher=openZIM', '--scraper-suffix', 'zimit 2.1.6', '--output', '/output', '--url', 'https://www.gradphotonetwork.com/QPPlus/Proofs.aspx', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmp86x0bre6/collections/crawl-20241218224114579/archive']
[warc2zim::2024-12-18 22:41:32,935] DEBUG:Attempting to confirm output is writable in directory /output
[warc2zim::2024-12-18 22:41:32,936] DEBUG:Output is writable. Temporary file used for test: /output/tmpb5_i4efn
[warc2zim::2024-12-18 22:41:32,937] DEBUG:Confirming ZIM file can be created using name: www.gradphotonetwork.com_43c54d45.zim
[warc2zim::2024-12-18 22:41:32,938] DEBUG:1 WARC files found
[warc2zim::2024-12-18 22:41:32,944] WARNING:HTTP 302 occurred on main page; replacing ZimPath(www.gradphotonetwork.com/QPPlus/Proofs.aspx) with ZimPath(www.gradphotonetwork.com/QPPlus/Default.aspx?Action=Expire&Source=Proofs&p2&c)
[warc2zim::2024-12-18 22:41:32,946] WARNING:HTTP 302 occurred on main page; replacing ZimPath(www.gradphotonetwork.com/QPPlus/Default.aspx?Action=Expire&Source=Proofs&p2&c) with ZimPath(www.gradphotonetwork.com/QPPlus/m/Default.aspx?Action=Expire&Source=Proofs&p2&c)
[warc2zim::2024-12-18 22:41:33,210] DEBUG:Title: 

[warc2zim::2024-12-18 22:41:33,210] DEBUG:Language: 
[warc2zim::2024-12-18 22:41:33,210] DEBUG:Favicons to consider: https://www.gradphotonetwork.com/favicon.ico
[warc2zim::2024-12-18 22:41:33,259] INFO:Expecting 17 ZIM entries to files
[warc2zim::2024-12-18 22:41:33,259] DEBUG:Preparing 5 redirections
[warc2zim::2024-12-18 22:41:33,259] DEBUG:0 redirections will be ignored
[warc2zim::2024-12-18 22:41:33,259] INFO:Expecting 22 ZIM entries including redirects
[warc2zim::2024-12-18 22:41:33,260] WARNING:No valid ZIM language, fallbacking to `eng`.
[warc2zim::2024-12-18 22:41:33,472] DEBUG:Found favicon in WARC at https://www.gradphotonetwork.com/favicon.ico which is 16x16
Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 688, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 609, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/main.py", line 168, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/warc2zim/converter.py", line 359, in run
    ).start()
      ^^^^^^^
  File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/zim/creator.py", line 212, in start
    raise ValueError("Mandatory metadata are not all set.")
ValueError: Mandatory metadata are not all set.
[zimit::2024-12-18 22:41:33,504] INFO:
[zimit::2024-12-18 22:41:33,504] INFO:
[zimit::2024-12-18 22:41:33,504] INFO:SIGINT/SIGTERM received, stopping zimit
[zimit::2024-12-18 22:41:33,504] INFO:
[zimit::2024-12-18 22:41:33,504] INFO:
@rgaudin rgaudin added the enhancement New feature or request label Dec 19, 2024
@benoit74
Copy link
Collaborator

It should indeed ... but it is probably not feasible, because we do not (yet) have all metadata.

I suspect the problem is something like the title which we source from the main webpage.

Since we source lot of things from the main page (favicon, title, description, ...) one idea could be to modify zimit to run two crawls: one first crawl with depth: 0 to capture only the main page, then we run warc2zim to confirm everything is fine on this first page, then run the real crawl ...

Another idea could be to provide fallback values for these mandatory metadata so that ZIM is always valid ; this is simpler, we already do it for the illustration, but it is clearly a bit dirty when it comes to title and description.

@rgaudin
Copy link
Member Author

rgaudin commented Dec 20, 2024

Yes that's a possibility ; I was only suggesting that when we figure out there's a missing title for instance, we fail with a clear message.

@benoit74
Copy link
Collaborator

Displaying list of missing metadata is already fixed in scraperlib 5.0.0 (just a matter of releasing it then).

https://github.com/openzim/python-scraperlib/blob/b7b5e46df88ced341be283ebfedfce4148f36fa2/src/zimscraperlib/zim/creator.py#L218-L222

I still consider failing a scrape which might have taken hours or even days of crawling just because there is a missing title or description is quite disappointing for the user

@rgaudin
Copy link
Member Author

rgaudin commented Dec 20, 2024

I still consider failing a scrape which might have taken hours or even days of crawling just because there is a missing title or description is quite disappointing for the user

Yes. This has been discussed. You're welcome to revive the discussion if you want.

I'd propose at least a flag for zimit SaaS where so replace with default values for them.

For our ZIMs, it's a different issue:

  • we dont want to publish with incorrect metadata
  • but we should test ZIMs first
  • and we have zimrecreate to fix
  • but we are not very rigorous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants