Tools

This is just a list of the tools we’re using and creating

Communication and Collaboration

We’ve been collaborating via Slack and Github. If you’re not already familiar with these tools, and have time to check them out, please do so!

Chrome Extension

The EOT project has a simple bookmarklet for nominating “seeds” for their webcrawler; you can read about it here. For our event we derived a Chrome Extension, and would recommend its use to future groups.

Scrapers

We expect to use the scrapy Python library and maybe its GUI companion portia for this work.

Crawlers

The IA uses Heritrix for its crawl, but it is a little finicky to set up and somewhat complex to learn to use, so we are not using it at our event. Instead, we will have to code our own. There are basically two technical specifications:

any tool we use should respect robots.txt
Any archives we produce should be in WARC format, the ISO standard for web archives used by the Internet Archive.

One simple solution seems to be WARC Middleware, which can accept a list of URLs in a file. As I understand it Scrapy obeys robots.txt if the variable assignment ROBOTSTXT_OBEY = True is included in the file.

Another suggestion we’ve gotten is grab-site, which came to us via the maintainers of webrecorder.

You can learn more about the WARC here:

WARC Ecosystem
collecting browsing history
List of WARC tools

Server Architecture

We have an offer of help from the maintainer of IPFS. Here’s some background on the protocol.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tools.org

Tools.org

Tools

Communication and Collaboration

Chrome Extension

Scrapers

Crawlers

Server Architecture

Files

Tools.org

Latest commit

History

Tools.org

File metadata and controls

Tools

Communication and Collaboration

Chrome Extension

Scrapers

Crawlers

Server Architecture