-
Notifications
You must be signed in to change notification settings - Fork 276
How to configure
This document outlines the core configuration element needed to get OpenWayback up and running, including building your CDX index.
For advanced options see Advanced configuration
As described in How to install OpenWayback typically runs inside a Tomcat web server deployed on a Linux or Unix system. The following documentation assumes this is the case and that the reader is generally familiar with both Tomcat and the bash command shell commonly found on Linux and Unix operating systems.
OpenWayback uses Spring XML configuration file. This file is called wayback.xml
and can be found in the WEB-INF
folder of the webapp (typically, this would be $CATALINA_HOME/webapps/ROOT/WEB-INF/wayback.xml
.
The file contains multiple configuration options, with various parts commented out.
The default configuration that comes with OpenWayback uses a Berkeley DB (BDB) database to store information about where to find your ARC and/or WARC files and an index of their content. It is also configured to automatically populate these indexes.
To get started with OpenWayback you only need to edit a few of the properties specified right near the top of the WEB-INF/wayback.xml file:
<bean class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
<property name="properties">
<value>
<!-- Customize these basic placeholders. -->
wayback.basedir.default=/tmp/openwayback
wayback.url.scheme.default=http
wayback.url.host.default=localhost
wayback.url.port.default=8080
<!-- Environment variables (if present) override defaults. No need to customize these. -->
wayback.basedir=#{ systemEnvironment['WAYBACK_BASEDIR'] ?: '${wayback.basedir.default}' }
wayback.url.scheme=#{ systemEnvironment['WAYBACK_URL_SCHEME'] ?: '${wayback.url.scheme.default}' }
wayback.url.host=#{ systemEnvironment['WAYBACK_URL_HOST'] ?: '${wayback.url.host.default}' }
wayback.url.port=#{ systemEnvironment['WAYBACK_URL_PORT'] ?: '${wayback.url.port.default}' }
<!-- No need to edit this setting unless deploying in a non-ROOT context
or using a load balancer, in which case configure it with the frontend URL prefix. -->
wayback.url.prefix.default=${wayback.url.scheme}://${wayback.url.host}:${wayback.url.port}
<!-- Environment variable (if present) overrides default. No need to customize this. -->
wayback.url.prefix=#{ systemEnvironment['WAYBACK_URL_PREFIX'] ?: '${wayback.url.prefix.default}' }
<!-- Customize or add additional placeholders if needed to use elsewhere.
Check BDBCollection.xml for following placeholders being used. -->
wayback.archivedir.1=${wayback.basedir}/files1/
wayback.archivedir.2=${wayback.basedir}/files2/
</value>
</property>
</bean>
-
Directories These are settings that direct the BDB indexing.
-
wayback.basedir.default should point at a directory where OpenWayback can store its internal state and keep temporary files.
-
wayback.archivedir points at two directories where OpenWayback can find ARC and/or WARC files. By default they are relative to the
basedir
but you may specify any fully qualified path. Do note, it may take OpenWayback a very long time to process a large collection of ARC/WARC files. For larger collections, you should read on about how to use CDX indexes. If you need to specify more than two directories, you will need to edit theBDBCollection.xml
configuration file. Also, by updating<property name="recurse" value="false" />
configuration in theBDBCollection.xml
file totrue
OpenWayback can be enabled to search these directories recursively for WARC files. -
Web access These are more general settings telling OpenWayback how the webserver hosting it is configured.
-
wayback.url.scheme.default Whether access to the OpenWayback instance is via
http
orhttps
. Forhttps
you will also need to configure the Tomcat web server accordingly. By defaulthttp
is selected and this is usually the right choice. -
wayback.url.host.default The host name used to access OpenWayback. Typically the host name of the server running OpenWayback. If you leave it at the default,
localhost
, OpenWayback will only behave normally if accessed from the machine hosting it. -
wayback.url.port.default The port on which Tomcat is listening. By default this is
8080
which is also the default port for Tomcat. -
wayback.url.prefix.default Assembles the scheme, host name and port into an URL prefix. You do not need to edit this setting unless you are deploying OpenWayback in a non-ROOT context or if you are using a load balancer, in which case you configure it with the frontend URL prefix.
Setting these correctly is enough to get a small collection up and running in OpenWayback. When you restart Tomcat, OpenWayback will automatically index all the ARC and WARC files under wayback.archivedir.1
and wayback.archivedir.2
(this can take some time!). You can then view them in a browser at http://localhost:8080/wayback
(or whatever you configured the scheme, host name and port to be).
This approach is only suitable for very small collections. For larger collections, we recommend the use of CDX indexes.
A CDX file is a simple text file which contains a list of all the URLs in your collections, one URL per line.
A CDX index is a sorted CDX file.
OpenWayback ships with several command line tools in addition to the WAR file. If you untar the OpenWayback distribution into $WAYBACK_HOME
these can all be found under $WAYBACK_HOME/bin
.
To generate a CDX file of the contents of a single ARC or WARC you simple invoke:
$WAYBACK_HOME/bin/cdx-indexer <ARCHIVE-FILE> <CDX-FILE>
Here $WAYBACK_HOME
refers to the directory created when you untarred the distribution tarball as discussed in the installation instructions.
This will generate a CDX called CDX-FILE
for the contents of ARCHIVE-FILE
.
The cdx-indexer
does have a few option for configuring the exact nature of the CDX file. However, the default options are typically correct if you are using the cdx-indexer
from the same OpenWayback distribution as the WAR web application comes from.
This generates an unsorted CDX file for one archive. You will need to either manually or (and this is recommended) via script generate a CDX file for each and every ARC and WARC file you have. This process is largely limited by I/O and running multiple cdx-indexer
instances in parallel can be useful, especially if the CDXs are spread over many physical HDDs. The exact number of instances to use depends entirely on your hardware.
Then you need to merge and sort the resultant files to get a CDX index file. As you'll see, OpenWayback can handle multiple CDX index files. However, we do recommend merging them until each file is at least 10 GB assuming the filesystem allows large files. There is no limit to the size of CDX index files other than those imposed by the filesystem being used.
To merge and sort CDX files, the bash sort
command found on most *nix systems is usually used.
IMPORTANT If you are using the bash sort
command, you must set the environment variable LC_ALL=C
. This tells sort how to sort and ensures that it matches how OpenWayback expects CDX indexes to be sorted.
This is done by executing the following (or including it in your scripts):
export LC_ALL=C;
Once you've built your CDX files, sorted and merged them into a small number of CDX index files, you need to configure OpenWayback to use them.
In the wayback.xml
configuration file, you'll find a block that looks like the following:
<import resource="BDBCollection.xml"/>
<!--
<import resource="CDXCollection.xml"/>
<import resource="RemoteCollection.xml"/>
<import resource="NutchCollection.xml"/>
-->
Start by removing or comment out the BDBCollection.xml
line and uncommenting the CDXCollection.xml
line.
E.g.
<import resource="CDXCollection.xml"/>
<!--
<import resource="BDBCollection.xml"/>
<import resource="RemoteCollection.xml"/>
<import resource="NutchCollection.xml"/>
-->
You then need to configure your access point to use the CDX collection. Further down the file you'll find:
<property name="collection" ref="localbdbcollection" />
<!--
<property name="collection" ref="localcdxcollection" />
-->
Simply comment out or remove the reference to localbdbcollection
and uncomment the reference to localcdxcollection
.
OpenWayback will now expect to find a single CDX index at ${wayback.basedir}/cdx-index/index.cdx
. If you only have one CDX index file you can simply leave it at that.
Alternatively, you can edit the CDXCollection.xml
configuration file. There you will need to remove the simple CDXIndex
resource index and enable the CompositeSearchResultSource
that is commented out.
<bean class="org.archive.wayback.resourceindex.CompositeSearchResultSource">
<property name="CDXSources">
<list>
<value>${wayback.basedir}/cdx-index/index-1.cdx</value>
<value>${wayback.basedir}/cdx-index/index-2.cdx</value>
</list>
</property>
</bean>
As you'll note, you can specify multiple CDX indexes using it. The value line of the list can be repeated as often as is needed.
CDX files only contain the name of the ARC and or WARC file that contains the URL. OpenWayback uses what are called ResourceFileLocationDB
objects to resolve ARC and WARC filenames to actual locations.
There are several different implementations of this available in OpenWayback (such as the BDBResourceFileLocationDB
that is used by default). But when using a CDX index, we recommend that the FlatFileResourceFileLocationDB
be used.
There is a default configuration for this provided near the top of the wayback.xml
configuration file. Simply remove or comment out the BDBResourceFileLocationDB
and then uncomment the following:
<bean id="resourcefilelocationdb" class="org.archive.wayback.resourcestore.locationdb.FlatFileResourceFileLocationDB">
<property name="path" value="${wayback.basedir}/path-index.txt" />
</bean>
As you can see, it reuses the wayback.basedir
setting and expect to find one file under that path, path-index.txt
.
This file simply contains a list of all your ARC and WARC filenames (sorted) followed by a tab character and then the full path (including again the filename). This path can be either a filesystem path or an URL. You are allowed to mix filesystem paths and URLs in the same path-index file.
Do note that OpenWayback assumes that the ARC/WARC filenames are unique.
Example of how this file might look:
ARCNAME001.arc.gz /data/ARCS01/ARCNAME001.arc.gz
ARCNAME002.arc.gz /data/ARCS02/ARCNAME002.arc.gz
WARCNAME001.warc.gz http://example.com/WARCS01/WARCNAME001.warc.gz
WARCNAME002.warc.gz http://example.com/WARCS02/WARCNAME001.warc.gz
The first field should be as it is in the CDX index (9th or 10th field).
OpenWayback does not provide any special tools for generating this file. It should, however, be fairly straightforward to do so with *nix tools.
The following script is an example of how this might be accomplished assuming all ARC/WARC files share a root directory:
#!/bin/bash
# Find all ARC/WARC files
ARCHIVE_BASE_DIR=$1;
TARGET_FILE=$2;
tempfile="$TARGET_FILE.tmp";
unset a i
while IFS= read -r -d $'\0' file; do
archive=$(basename $file);
echo -e "$archive\t$file" >> $tempfile;
done < <(find $ARCHIVE_BASE_DIR -type f -regex ".*\.w?arc\.gz$" -print0)
# Now sort the file
export LC_ALL=C;
sort $tempfile > $TARGET_FILE;
rm $tempfile
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git