Skip to content

2.1.0

Compare
Choose a tag to compare
@benoit74 benoit74 released this 09 Aug 07:43
· 34 commits to main since this release
76c50ed

Added

  • New fuzzy-rule for cheatography.com (#342), der-postillon.com (#330), iranwire.com (#363)
  • Properly rewrite redirect target url when present in HTML tag (#237)
  • New --encoding-aliases argument to pass encoding/charset aliases (#331)
  • Add support for SVG favicon (#148)
  • Automatically index PDF content and use PDF title (#289 and #290)

Changed

  • Upgrade to python-scraperlib 4.0.0
  • Generate fuzzy rules tests in Python and Javascript (#284)
  • Refactor HTML rewriter class to make it more open to change and expressive (#305)
  • Detect charset in document header only for HTML documents (#331)
  • Use software property from warcinfo record to set ZIM Scraper metadata (#357)
  • Store ContentDate as metadata, based on WARC-Date (#358)
  • Remove domain specific rules (#328)
  • Revisit retrieve_illustration logic to prefer best favicons (#352 and #369)
  • Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376)

Fixed

  • Handle case where the redirect target is bad / unsupported (#332 and #356)
  • Fixed WARC files handling order to follow creation order (#366)
  • Remove subsequent slashes in URLs, both in Python and JS (#365)
  • Ignore non HTTP(S) WARC records (#351)
  • Fix vimeo_cdn_fix fuzzy rule for proper operation in Javascript (#348)
  • Performance issue linked to new "extensible" HTML rewriting rules (#370)