Skip to content
forked from hyl/node-warc

Parse And Create Web ARChive (WARC) files with node.js

License

Notifications You must be signed in to change notification settings

mirrorweb/node-warc

 
 

Repository files navigation

node-warc

Parse Web Archive (WARC) files or create WARC files using Electron or chrome-remote-interface

Run npm install node-warc or yarn add node-warc to ge started

npm Package

API

Full API documentation available at n0tan3rd.github.io/node-warc

Example usage

Example 1: Both .warc and .warc.gz

const AutoWARCParser = require('node-warc')

const parser = new AutoWARCParser('<path-to-warcfile>')
parser.on('record', record => { console.log(record) })
parser.on('done', finalRecord => { console.log(finalRecord) })
parser.on('error', error => { console.error(error) })
parser.start()

Example 2: Only .warc.gz

const WARCGzParser = require('node-warc').WARCGzParser

const parser = new WARCGzParser('<path-to-warcfile>')
parser.on('record', record => { console.log(record) })
parser.on('done', finalRecord => { console.log(finalRecord) })
parser.on('error', error => { console.error(error) })
parser.start()

Example 3: Only .warc

const WARCParser = require('node-warc').WARCParser

const parser = new WARCParser('<path-to-warcfile>')
parser.on('record', record => { console.log(record) })
parser.on('done', finalRecord => { console.log(finalRecord) })
parser.on('error', error => { console.error(error) })
parser.start()

Benchmark

UN-GZIPPED

  • 145.9MB (8,026 records) took 2s. Max node process usage 22 MiB
  • 268MB (852 records) took 2s. Max node process usage 77 MiB
  • 2GB (76,980 records) took 21s. Max node process usage 100 MiB
  • 4.8GB (185,662 records) took 1m. Max node process usage 144.3 MiB

GZIPPED

  • 7.7MB (1,269 records) took 297ms. Max node process memory usage 7.1 MiB
  • 819.1MB (34,253 records) took 16s. Max node process memory usage 190.3 MiB
  • 2.3GB (68,020 records) took 45s. Max node process memory usage 197.6 MiB
  • 5.3GB (269,464 records) took 4m. Max node process memory usage 198.2 MiB

JavaScript Style Guide

Environment

  • NODEWARC_WRITE_GZIPPED - writes gzipped records to WARC outputs.

About

Parse And Create Web ARChive (WARC) files with node.js

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 100.0%