This is just a list of the tools we’re using and creating
We’ve been collaborating via Slack and Github. If you’re not already familiar with these tools, and have time to check them out, please do so!
The EOT project has a simple bookmarklet for nominating “seeds” for their webcrawler; you can read about it here. For our event we derived a Chrome Extension, and would recommend its use to future groups.
We expect to use the scrapy Python library and maybe its GUI companion portia for this work.
The IA uses Heritrix for its crawl, but it is a little finicky to set up and somewhat complex to learn to use, so we are not using it at our event. Instead, we will have to code our own. There are basically two technical specifications:- any tool we use should respect robots.txt
- Any archives we produce should be in WARC format, the ISO standard for web archives used by the Internet Archive.
One simple solution seems to be WARC Middleware, which can accept a list of URLs in a file. As I understand it Scrapy obeys robots.txt if the variable assignment ROBOTSTXT_OBEY = True
is included in the file.
Another suggestion we’ve gotten is grab-site, which came to us via the maintainers of webrecorder.
You can learn more about the WARC here:
We have an offer of help from the maintainer of IPFS. Here’s some background on the protocol.