Skip to content

CDX Server requirements

Graham Seaman edited this page Mar 1, 2016 · 10 revisions

As part of making the CDX-Server the default index engine for OpenWayback we need to clean up and formally define the API for the CDX-Server. This document is meant as a workplace for defining those API's.

The CDX-Server API, as it is today, is chracterized by a relatively close link to how the underlying CDX format is implemented. Functionality varies if you are using traditional flat CDX files or compressed zipnum clusters. One of the nice things by having a CDX Server is to separate the API from the underlying implementation. This way it would be relatively easy to implement indexes based on other technologies in the future. As a consequence we should avoid implementing features just because they are easy to do with a certain format if there is no real need for it. The same feature might be hard to implement on other technologies.

The API should also try to avoid giving the user conflicting options. For example it is possible, in the current api, to indicate match type both with a parameter and a wildcard. It is then possible to set matchType=prefix and at the same time use a wildcard indicating matchType=domain.

The following is a list of use-cases seen from the perspective of a user. Many of the use-cases are described as expectations to the GUI of OpenWayback, but is meant to help the understanding of the CDX-Server's role. For each use-case we need to understand what functionality the CDX-Server is required to support. CDX-Server functionality with no supporting use-case should not be implemented in OpenWayback 3.0.0.

This is a work in progress. Edits and comments are highly appreciated.

Use-cases

1. The user has a link to a particular version of a document

This case could be a user referencing a document from a thesis. It is important that the capture referenced is exactly the one the user used when writing the thesis. In this case the user should get the capture that exactly matches both the url and timestamp.

The digest needs also to be considered to actually guarantee that the user gets the same version. In addition you need to know that all embeds also are the same version the user originally requested. Achieving all this might be hard or impossible to do.

2. The user selects one particular capture in the calendar

Similar to the above, but it might be allowed to return a capture close in time if the requested capture is missing i.e. the requirement for getting the same version is slightly loosened.

3. Get the best matching page when following a link

User is looking at a page and want to follow a link by clicking it. User then expects to be brought to closest in time capture of the new page.

4. Get the best match for embedded resources

Similar to above, but user is not involved. This is for loading embedded images and so on.

5. User requests/searches for an exact url without any timestamp, expecting to get a summary of captures for the url over time

The summary of captures might be presented in different ways, for example a list or a calendar.

6. User looks up a domain expecting a summary of captures over time

7. User searches with a truncated path expecting the results to show up as matching paths regardless of time

8. User searches with a truncated path expecting the results to show up as matching paths regardless of time and subdomain

9. User navigates back and forth in the calendar

Requires the ability to request a date range.

10. User wants to see when content of a page has changed

This require consulting the digest of the captures for a page. This could be done in the CDX-Server if only the captures with a change is needed. Otherwise it is probably best solved by the consumer of the CDX-Server API, for example OpenWayback.

11. User requests/searchers for an exact url with a partial timestamp, expecting to get a summary of captures for the url over time

12. Get a random page within a partial timestamp

Possibly add a go to "random page" feature. This could potentially require a lot of searching through the CDX-files since they are first sorted on url and then on timestamp. If the requirement is loosened to get random page regardless of time, then it is simple.

13. Get number of snapshots taken for a date range

Used by the calendar view.

14. Bulk/batch requests

The ability to get big portions of the CDX-data to be used by processing tools like Map Reduce. The data needs to be returned in chunks. It is preferable if the chunks could be requested in parallel from different processing nodes.

15. Lookup url with specific schema

Current CDX Server seems to strip away the schema part of the url (i.e. http://example.com -> example.com) when looking up a url. Is there a need to sometimes be more strict? Let say you got http://example.com/foo.html and ftp://example.com/foo.html with different content. Is this a real world problem?

16. (W)ARC file management

The following are not use cases for a Wayback machine, but for a system which provides access to raw (W)ARC files for export to researchers. Is this an appropriate use for the CDX server?

a. Identify the (W)ARC files which contain a particular domain/subdomain/full url (additionally: specify date range), returning a count of the relevant domains etc for each (W)ARC file (to be used to determine (W)ARC files for export)

b. Given a (W)ARC file identifier, list the URLs it holds which match a set of criteria (domain/subdomain/date etc) (to be used to export (W)ARC file extracts)

SURTing a CDX file

The Library of Congress has implemented a cdxserver and surt-ordered a cdx file using the following script: grep -v -P '^(dns|filedesc)' final_index.cdx | java -jar ia-hadoop-tools-1.0-SNAPSHOT-jar-with-dependencies.jar cdx-convert > surt.cdx