-
Notifications
You must be signed in to change notification settings - Fork 276
How to configure
The openwayback project is under active development, so this documentation may not be up-to-date.
Edit wayback.xml as follows:
Comment:
<import resource="BDBCollection.xml"/>
<property name="collection" ref="localbdbcollection" />
Uncomment:
<import resource="CDXCollection.xml"/>
<property name="collection" ref="localcdxcollection" />
Edit CDXCollection.xml as follows:
<bean class="org.archive.wayback.resourceindex.cdx.CDXIndex">
<property name="path" value="/<PATH-TO-CDXFILE>" />
</bean>
Note: It is working on git checkout bc7513ddcc10c11f6989a8c14b07b398010a00f7
Edit CDXCollection.xml as follows:
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
<property name="canonicalizer" ref="waybackCanonicalizer" />
<property name="source">
<bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
<property name="cluster">
<bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
<property name="summaryFile" value="/<PATH-TO-SUMMARYFILE>"/>
</bean>
</property>
<property name="params">
<bean class="org.archive.format.gzip.zipnum.ZipNumParams"/>
</property>
</bean>
</property>
<property name="maxRecords" value="100000" />
<property name="dedupeRecords" value="true" />
</bean>
</property>
This file consists of 4 fields separated by tab as follows:
- First line from each block in cluster: this consists of many fields e.g. URL, time stamp, etc. separated by white space
- PartId: this is a cluster ID or name of compressed CDX file without .gz as code appends .gz to partId e.g. 1.cdx (actually it is 1.cdx.gz)
- Offset: gzip member offset
- Length: gzip member length
Edit resourceStore property in CDXCollection.xml as follows:
<property name="resourceStore">
<bean class="org.archive.wayback.resourcestore.LocationDBResourceStore">
<property name="db">
<bean class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
<property name="path" value="${wayback.basedir}/path-index.txt" />
</bean>
</property>
</bean>
</property>
Edit wayback.xml as follows:
Comment
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB">
<property name="bdbPath" value="${wayback.basedir}/file-db/db/" />
<property name="bdbName" value="DB1" />
<property name="logPath" value="${wayback.basedir}/file-db/db.log" />
</bean>
Uncomment
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
<property name="path" value="${wayback.basedir}/path-index.txt" />
</bean>
invoke-rc.d tomcat6 restart
The format of the path-index.txt is . If there is x.arc.gz under /a, then path-index.txt will be as follows:
x /a/x.arc.gz
Or
x.arc.gz /a/x.arc.gz
The first field will be as it is in the CDX file (9th or 10th field).
If ARC file is in remote host, then you can write http://(Remote-ResourceStore-Hostname)/ before PATH (second column)
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git