Skip to content
MohammedElsayyed edited this page Apr 2, 2014 · 19 revisions

The openwayback project is under active development, so this documentation may not be up-to-date.

Configuring Wayback Machine

Configuring CDX

Edit wayback.xml as follows:

Comment:

   <import resource="BDBCollection.xml"/>
   <property name="collection" ref="localbdbcollection" />

Uncomment:

   <import resource="CDXCollection.xml"/>
   <property name="collection" ref="localcdxcollection" />

Configuring Flat CDX Files

Edit CDXCollection.xml as follows:

   <bean class="org.archive.wayback.resourceindex.cdx.CDXIndex">
         <property name="path" value="/<PATH-TO-CDXFILE>" />
   </bean>

Configuring ZipNumCluster

Note: It is working on git checkout bc7513ddcc10c11f6989a8c14b07b398010a00f7

Edit CDXCollection.xml as follows:

    <property name="resourceIndex">
      <bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
        <property name="canonicalizer" ref="waybackCanonicalizer" />
        <property name="source">

        <bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
                <property name="cluster">
                        <bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
                                <property name="summaryFile" value="/<PATH-TO-SUMMARYFILE>"/>
                        </bean>
                </property>
                <property name="params">
                        <bean class="org.archive.format.gzip.zipnum.ZipNumParams"/>
                </property>
        </bean>

        </property>
        <property name="maxRecords" value="100000" />
        <property name="dedupeRecords" value="true" />    
      </bean>
    </property>

Building summary File

This file consists of 4 fields separated by tab as follows:

  1. First line from each block in cluster: this consists of many fields e.g. URL, time stamp, etc. separated by white space
  2. PartId: this is a cluster ID or name of compressed CDX file without .gz as code appends .gz to partId e.g. 1.cdx (actually it is 1.cdx.gz)
  3. Offset: gzip member offset
  4. Length: gzip member length

Configuring a remote ResourceStore

Edit resourceStore property in CDXCollection.xml as follows:

<property name="resourceStore">
      <bean class="org.archive.wayback.resourcestore.LocationDBResourceStore">
        <property name="db">
          <bean class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
            <property name="path" value="${wayback.basedir}/path-index.txt" />
          </bean>
        </property>
      </bean>
    </property>

Edit wayback.xml as follows:

Comment

  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB">
    <property name="bdbPath" value="${wayback.basedir}/file-db/db/" />
    <property name="bdbName" value="DB1" />
    <property name="logPath" value="${wayback.basedir}/file-db/db.log" />
  </bean>

Uncomment

  <bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
    <property name="path" value="${wayback.basedir}/path-index.txt" />
  </bean>

invoke-rc.d tomcat6 restart

Building a path-index File

The format of the path-index.txt is . If there is x.arc.gz under /a, then path-index.txt will be as follows:

x /a/x.arc.gz

Or

x.arc.gz /a/x.arc.gz

The first field will be as it is in the CDX file (9th or 10th field).

If ARC file is in remote host, then you can write http://(Remote-ResourceStore-Hostname)/ before PATH (second column)