Skip to content
This repository has been archived by the owner on Aug 10, 2018. It is now read-only.

Latest commit

 

History

History
28 lines (23 loc) · 2.21 KB

Tools.org

File metadata and controls

28 lines (23 loc) · 2.21 KB

Tools

This is just a list of the tools we’re using and creating

Communication and Collaboration

We’ve been collaborating via Slack and Github. If you’re not already familiar with these tools, and have time to check them out, please do so!

Chrome Extension

The EOT project has a simple bookmarklet for nominating “seeds” for their webcrawler; you can read about it here. For our event we derived a Chrome Extension, and would recommend its use to future groups.

Scrapers

We expect to use the scrapy Python library and maybe its GUI companion portia for this work.

Crawlers

The IA uses Heritrix for its crawl, but it is a little finicky to set up and somewhat complex to learn to use, so we are not using it at our event. Instead, we will have to code our own. There are basically two technical specifications:
  • any tool we use should respect robots.txt
  • Any archives we produce should be in WARC format, the ISO standard for web archives used by the Internet Archive.

One simple solution seems to be WARC Middleware, which can accept a list of URLs in a file. As I understand it Scrapy obeys robots.txt if the variable assignment ROBOTSTXT_OBEY = True is included in the file.

Another suggestion we’ve gotten is grab-site, which came to us via the maintainers of webrecorder.

You can learn more about the WARC here:

Server Architecture

We have an offer of help from the maintainer of IPFS. Here’s some background on the protocol.