Metadata parsing and routing for the Prior Art Archive
The file parser is deployed as an Elastic Beanstalk application on AWS (called FileParser). This platform was chosen because it lets deploy with docker images and lets us auto-scale behind a load-balancer.
This means the code in this repository gets built as a docker image and pushed to Docker Hub at https://hub.docker.com/r/priorartarchive/file-parser. Deploying to AWS is just uploading a Dockerrun.aws.json
configuration file (documented here) that tells Elastic Beanstalk to pull priorartarchive/file-parser
(along with a sibling container from logicalspark/docker-tikaserver
).
To get a Dockerrun.aws.json
to upload to Elastic Beanstalk, copy & modify the Dockerrun.aws.sample.json
to fill out the environment variables:
HOSTNAME
is eitherpriorartarchive.org
ordev.priorartarchive.org
.IPFS_HOST
is a the DNS address of an https IPFS API route (e.g. if you cancurl https://your.host/api/v0/id
, thenIPFS_HOST=your.host
). For now, we useapi.underlay.store
for both dev and prod.DATABASE_URL
is the fully-qualified postgres URI (including theusername:password@
at the beginning).AWS_REGION
isus-east-1
.AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
need to haveAmazonS3FullAccess
,AWSLambdaExecute
, andAWSLambdaRole
permission policies.CONFIGURATION_ID
is the name of the S3 notification handler that is generating the events. The name of the handlers on both theassets.priorartarchive.org
andassets.dev.priorartarchive.org
buckets isNewFile
.
In addition, edit the "image": "priorartarchive/priorart-file-parser"
line to include the tag of the docker image that you want to use: for now there's only a dev
tag but there will be a prod
tag once v2 goes live.
The file parser requires access to a remote IPFS node via HTTP API. This is dev-api.underlay.store
and api.underlay.store
for the dev and prod deployments, respectively.
Local changes should be committed to the dev
branch. When you're ready to deploy a dev version, build a local image and push to the docker hub repo:
docker build -t priorartarchive/file-parser:dev .
docker push priorartarchive/file-parser:dev
Then head over to Elastic Beanstalk and upload a Dockerrun.aws.json
file (containing URIs for the development database and elasticsearch, and referencing the priorartarchive/priorart-file-parser:dev
image) to the file-parser-dev
environment of the FileParser
application.
Changes to master
should only come a pull requests from dev
. When you're ready to deploy a prod version, build a local image and push to the docker hub repo:
docker build -t priorartarchive/file-parser:prod .
docker push priorartarchive/file-parser:prod
Then head over to Elastic Beanstalk and upload a Dockerrun.aws.json
file (containing URIs for the production database and elasticsearch, and referencing the priorartarchive/file-parser:prod
image) to the tika-server-env
environment of the TikaServer
application.
/Dockerrun.aws.sample.json
-spawnChild
is documented here. "This starts tika-server in a child process, and if there's an OOM, a timeout or other catastrophic problem with the child process, the parent process will kill and/or restart the child process."-JXmx1g
sets the max heap for the spawned child process at 1GB.-JXms256m
sets the initial heap for the spawned child process at 256MB.
In static/
there are two JSON-LD documents tika-reference.json
(aka dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u
) and tika-provenance.json
(aka dweb:/ipfs/bafybeiej4oe7qb5jhighp74mmy3st7fakznynjv62lti762bf4xqcdhmxq
). These contain "background" knowledge about Tika that are referenced in the provenance of the assertions we generate.
Specifically, we attribute the resulting transcript and metadata documents to dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u#_:c14n74
- the prov:SoftwareAgent
that is the Tika software application - with the prov:qualifiedAssociation
that the software agent had a prov:Role
of dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u#_:c14n5
(for metadata) or dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u#_:c14n54
(for text extraction). These "roles" correspond to REST API endpoints that are structured as schema.org EntryPoints and derived from the HTML API docs that the Tika server serves from GET "/"
by default. These are frighteningly & admittedly unwieldy: in the future you'll be able to paste these URIs into the Underlay Playground to get explorable visualizations (both from the source document and from subsequent published references). These sorts of references are a low-level representation that should rarely be seen; it's our job to build better tools for referencing them.
dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u
(aka tika-reference.json
) is pinned to the cluster and should be considered stable, to be changed only when absolutely necessary. tika-provenance.json
contains provenance about tika-reference.json
(via explicit reference to dweb:/ipfs/bafybeib7p2ibhncu736bewg4jw7o2j4msl72xgrd2ducrqwg5leugasx5u
as a digital document), citing the HTML API reference (that Tika itself generates!) as its source. In the (near) future we should sign (with some public KFG key) this document and publish it as well, but it's not necessary to get the Prior Art Archive working (unlike tika-reference.json
, whose hash we need to use in our assertions).
tika-context.json
is copied from and documented at this Gist.