Understanding What Heritrix Does

Our nomination tool asks the Internet Archive webcrawler, Heritrix, to start deeper crawls from URLs that are buried 2-3 clicks down in a website’s link hierarchy. In this document we explain a little bit about what Heritrix can do, why it needs our help, and also how to identify documents and datasets that Heritrix can’t reach.

How a webcrawler works

A webcrawler like Heritrix reads web pages and searches for links. When it finds a link, it follows it to a new page, where it once again searches for links, and then follows those, and so on, and so on… As you can imagine, this rapidly leads to a very large number of links! the Internet Archive therefore imposes limits on Heritrix: after a certain number of “hops”, it will stop collecting links and move on to the next “seed” in its list. This allows it to cover a lot of “territory” – moving relatively rapidly through the huge .gov domain – but it means that there are “hidden depths” which it doesn’t see.

When we nominate seeds, we draw Heritrix’s attention to some of this “deep web” – identifying it as especially important, and worth crawling. This is important work, and we’re grateful for the volunteers that help us to do it.

What Heritrix can and can’t get

Heritrix is a clever program, but it runs in a “command-line” environment, and is fully automated. That means that there are certain things it can’t do. On the other hand, what it can do, it does very well.

Heritrix can:

browse any “http://” or “https://” link that it finds. This includes zip files, excel files, pdfs, etc. - -as long as the link you see is really a link to the file, and not an “obfuscated” link that instead points to a page that includes a complex environment (see below).

But here are some things that Heritrix can’t do:

it can’t click “Go!” or “Submit” on a web form – so search forms tend to stop it pretty much dead in its tracks, unless there’s a “browse” link also that links to the full list of searchable resources
It doesn’t have a sophisticated browser environment that can do computing work for it. Many sophisticated webpages ask the browser to do a lot of computational work. Heritrix can’t do that very well.
It also can’t follow links that don’t use the Hypertext Transfer Protocol (HTTP). This mostly affects resources that aren’t designed for direct browsing. There are at least two important sub-categories here:
- [UPDATE <2016-12-22 Thu>: IA has advised that EOT is now doing an FTP crawl. FTP sites can now be seeded directly using the Nomination Tool]
- Databases themselves: Lots of the data on the web is stored in so-called relational databases. These databases are almost never available directly on the web. Instead, a database server particular bits of information in response to a request from a webserver, or from the user or browser. Those requests travel over paths that Heritrix can’t follow. So, recovering the database itself will require either a lot of technical hoop-jumping, or other methods, like sneakernet.
- Web Applications: Lots of the most user-friendly sites display data in a “web application” embedded in a web page. Often these applications use maps or other fancy interfaces to display government data in a user-friendly form. Since these applications load “dynamically” – in the browser environment, running in the JavaScript, Java, or Flash languages – Heritrix won’t see them. If the application goes off line, the fancy interface will be useless. The Internet Archive will still have a copy of the web page, but it will just display an empty box in the middle where the fancy map is supposed to be.

So one important task you can help with, is identifying these un-crawlable database interfaces and alerting us (the project, not the webcrawler) to their location. Once we know about them, we can try to “reverse engineer” the interface – look for clues that identify the underlying data, and try to preserve it somehow, either by coding tricks, or through some other preservation avenue (such as an FOIA request or a manual download at a government library).

[for more information on these categories, see James Jacobs’ excellent blog post]

Identifying Un-crawlable data

Let’s look at one example of a difficult-to-crawl site. The EPA’s publication search form leads to results pages like this one. If you open one of the actual results in a new tab, you’ll be able to follow along with the rest of this section.

This is quite a pretty-looking page, in some ways, but there are two red flags that suggest that the important data won’t be preserved by an IA webcrawl.

first, the pretty picture in the center of the page is not an actual PDF, but just an image with your search terms highlighted in yellow. If you look at the actual HTML, it looks like this:

<img src="/Exe/tiff2png.cgi/P100OW3T.PNG?-i+-r+85+-g+15+-h+5,0,7,7,31,7,10,0,7,13,3,7,14,28,7,20,20,7,21,45,7,25,36,7,28,0,7,32,28,7,34,43,7,37,78,7,42,45,7,42,62,7,43,64,7,45,68,7,51,81,7,54,55,7,56,74,7,59,70,7+D%3A%5CZYFILES%5CINDEX%20DATA%5C11THRU15%5CTIFF%5C00001231%5CP100OW3T.TIF" style="max-width:none;margin-bottom:5px">

So by capturing the page content, we’ll only have archived a single page of the document rather than the document itself!

second, while there is a “link” of sorts to the PDF and text documents, they aren’t “real” links. Instead, they are javascript commands which open new browser windows and dynamically generate the URLs for the requested documents.

You can verify this in two ways. First, “inspect element” on the “PDF” link, and you’ll see that it looks like this:

<a href="#" title="Download this document as a PDF" alt="Download this document as a PDF" onclick="ZyShowPDF('PDF',event)">PDF</a>

Heritrix can’t capture links of this kind.

The other way to confirm this is by clicking on the link and paying attention while the pdf loads in the new window. Instead of loading right away, the page first gives you a fancy loading screen. This means that some kind of communication is happening between your browser and the server while you wait. Heritrix can’t talk to the server like your browser can! So the capture will fail and the really important document – the actual resource! – won’t be archived.

In these cases, follow the instructions above to notify the tech team that you’ve found something the webcrawler can’t archive!

How to Nominate Seeds

For today’s archiv-a-thon, we are using [edit this to say either the nomination tool, or a chrome extension, or something else.]. Then add some more detailed instructions about how to use it.

When you find un-crawlable resources

This is an important job! When you find these, please:

add them to this spreadsheet [add link]
let the coding group know what you’ve found, so they can add it to their task queue

The org-mode source for this document can be found on Github.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what-heritrix-does.org

what-heritrix-does.org

Understanding What Heritrix Does

How a webcrawler works

What Heritrix can and can’t get

Identifying Un-crawlable data

How to Nominate Seeds

When you find un-crawlable resources

Files

what-heritrix-does.org

Latest commit

History

what-heritrix-does.org

File metadata and controls

Understanding What Heritrix Does

How a webcrawler works

What Heritrix can and can’t get

Identifying Un-crawlable data

How to Nominate Seeds

When you find un-crawlable resources