Jason the Miner

Harvesting data at the <html> mine... Here comes Jason the Miner, a versatile Web scraper for Node.js.

⛏ Features

Composable: via a modular architecture based on pluggable processors. The output of one processor feeds the input of the next one. There are 3 types of processors:
1. loaders: to fetch the data (via HTTP requests, by reading text files, etc.)
2. parsers: to parse the data (HTML by default) & extract the relevant parts according to a predefined schema
3. transformers: to transform and/or output the results (to a CSV file, via email, etc.)
Configurable: each processor can be chosen & configured independently
Extensible: you can register your own custom processors
CLI-friendly: Jason the Miner works well with pipes & redirections
Promise-based API
MIT-licensed

⛏ Installing

npm install -g jason-the-miner

If you don't want to install it, you can use it directly with npx:

npx jason-the-miner -c my-config.json

⛏ Demos

Clone the project:

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run demos

Then have a look at the demos folder, you'll find examples of scraping:

Simple GitHub search results (JSON, CSV, Markdown output)
More complex GitHub search results (including following links & paginating issues)
Goodreads books, following links to Amazon to grab their product ID
Google search results for finding mobile apps in various blogs, etc.
IMDb images gallery links with pagination
Mixcloud stats, templating them & sending them by mail
Mixcloud SPA scraping, by controlling a headless Chrome browser with Puppeteer
Avatars and downloading them
A CSV file to bulk insert data to Elasticsearch

⛏ Examples

CLI

Scraping the most popular Javascript scrapers from GitHub:

// github-config.json
{
  "load": {
    "http": {
      "url": "https://github.com/search",
      "params": {
        "q": "scraper",
        "l": "JavaScript",
        "type": "Repositories",
        "s": "stars",
        "o": "desc"
      }
    }
  },
  "parse": {
    "html": {
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos.json"
    }
  }
}

jason-the-miner -c github-config.json

Alternatively, with pipes & redirections:

// github-config.json
{
  "parse": {
    "html": {
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  }
}

curl https://github.com/search?q=scraper&l=JavaScript&type=Repositories&s=stars&o=desc | jason-the-miner -c github-config.json > github-repos.json

API

const JasonTheMiner = require('jason-the-miner');

const jason = new JasonTheMiner();

const load = {
  http: {
    url: "https://github.com/search",
    params: {
      q: "scraper",
      l: "JavaScript",
      type: "Repositories",
      s: "stars",
      o: "desc"
    }
  }
};

const parse = {
  html: {
    "repos": [".repo-list .repo-list-item h3 > a"]
  }
};

jason.harvest({ load, parse }).then(results => console.log(results));

⛏ The config file

{
  "load": {
    "[loader name]": {
      // loader options
    }
  },
  "parse": {
    "[parser name]": {
      // parser options
    }
  },
  "transform": {
    "[transformer name]": {
      // transformer options
    }
  }
}

Loaders

Jason the Miner comes with 4 built-in loaders:

Name	Description	Options
`http`	Uses axios as HTTP client	All axios request options + `[_concurrency=1]` (to limit the number of concurrent requests when following/paginating) & `[_cache]` (to cache responses on the file system)
`file`	Reads the content of a file	`path`, `[stream=false]`, `[encoding="utf8"]` & `[_concurrency=1]` (to limit the number of concurrent requests when paginating)
`csv-file`	Uses csv-parse to read a CSV file	All csv-parse options in a `csv` object + `path`+ `[encoding="utf8"]`
`stdin`	Reads the content from the standard input	`[encoding="utf8"]`

For example, an HTTP load config with responses cached in the "tests/http-cache" folder:

...
"load": {
  "http": {
    "baseURL": "https://github.com",
    "url": "/search?l=JavaScript&o=desc&q=scraper&s=stars&type=Repositories",
    "_concurrency": 2,
    "_cache": {
      "_folder": "tests/http-cache"
    }
  }
}
...

Check the demos folder for more examples.

Parsers

Currently, Jason the Miner comes with 2 built-in parsers:

Name	Description	Options
`html`	Parses HTML, built with Cheerio	A parse schema
`csv`	Parses CSV, built with csv-parse	All csv-parse options

HTML schema definition

Examples

...
  "html": {
    // Single value
    "repo": ".repo-list .repo-list-item h3 > a"

    // Collection of values
    "repos": [".repo-list .repo-list-item h3 > a"]

    // Single object
    "repo": {
      "name": ".repo-list .repo-list-item h3 > a",
      "description": ".repo-list .repo-list-item div:first-child"
    }

    // Single object, providing a root selector _$
    "repo": {
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child"
    }

    // Collection of objects
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child"
    }]

    // Following
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child",
      "_follow": {
        "_link": "h3 > a",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count",
          "stars": "li:nth-child(2) a.social-count",
          "forks": "li:nth-child(3) a.social-count"
        }
      }
    }]

    // Paginating
    "repos": [{
      "_$": ".repo-list .repo-list-item",
      "name": "h3 > a",
      "description": "div:first-child",
      "_paginate": {
        "_link": ".pagination > a[rel='next']",
        "_depth": 1
      }
    }]
  }
...

Full flavour

...
  "html": {
    "title": "title | trim",
    "metas": {
      "lang": "html < attr(lang)",
      "content-type": "meta[http-equiv='Content-Type'] < attr(content)"
    },
    "stylesheets": ["link[rel='stylesheet'] < attr(href)"],
    "repos": [{
      "_$": ".repo-list .repo-list-item ? text(crawler)",
      "_slice": "0,3",
      "name": "h3 > a",
      "last-update": "relative-time < attr(datetime)",
      "_follow": {
        "_link": "h3 > a",
        "description": "meta[property='og:description'] < attr(content) | trim",
        "url": "link[rel='canonical'] < attr(href)",
        "stats": {
          "_$": ".pagehead-actions",
          "watchers": "li:nth-child(1) a.social-count | trim",
          "stars": "li:nth-child(2) a.social-count | trim",
          "forks": "li:nth-child(3) a.social-count | trim"
        },
        "_follow": {
          "_link": ".js-repo-nav span[itemprop='itemListElement']:nth-child(2) > a",
          "open-issues": [{
            "_$": ".js-navigation-container li > div > div:nth-child(3)",
            "desc": "a:first-child | trim",
            "opened": "relative-time < attr(datetime)"
          }],
          "_paginate": {
            "_link": "a[rel='next']",
            "_slice": "0,1",
            "_depth": 2
          }
        }
      }
    }],
  }
...

As you can see, a schema is a plain object that recursively defines:

the names of the values/collection of values that you want to extract: "title" (single value), "metas" (object), "stylesheets" (collection of values), "repos" (collection of objects)
how to extract them: [selector] ? [matcher] < [extractor] | [filter] (check "Parse helpers" below)

Additional instructions can be passed to the parser:

_$ acts as a root selector: further parsing will happen in the context of the element identified by this selector
_slice limits the number of elements to parse, like Array.prototype.slice(begin[, end])
_follow tells Jason to follow a single link (fetch new data) & to continue scraping after the new data is received
_paginate tells Jason to paginate (fetch & scrape new data) & to merge the new values in the current context, here multiple links can be selected to scrape in parallel multiple pages

Parse helpers

The following syntax specifies how to extract a value:

[property name]: [selector] ? [matcher] < [extractor] | [filter]

For instance:

...
"repos": [".repo-list-item h3 > a ? text(crawler) < attr(title) | trim"]
...

Will extract a "repos" array of values from the links identified by the ".repo-list-item h3 > a" selector, matching only the ones containing the text "crawler". The values will be retrieved from the "title" attribute of each link and will be trimmed.

Matchers:

text(regexString)
html(regexString)
attr(attributeName,regexString)
slice(begin,end)

They are used to test an element in order to decide whether to include/discard it from parsing. If not specified, Jason includes every element.

Extractors:

text([optionalStaticText]) (by default)
html()
attr(attributeName)
regex(regexString)
date(inputFormat,outputFormat) (parses a date with moment)
uuid() (generates a uuid v1 with uuid)
count() (counts the number of elements matching the selector, needs an array schema definition)

Filters:

trim
single-space
lowercase
uppercase
json-parse (to parse JSON, like JSON-LD)

Transformers

Name	Description	Options
`stdout`	Writes the results to stdout	`[encoding="utf8"]`
`json-file`	Writes the results to a JSON file	`path` & `[encoding="utf8"]`
`csv-file`	Writes the results to a CSV file using csv-stringify	`csv`: same as csv-stringify + `path`, `[encoding='utf8']` and `[append=false]` (whether to append the results to an existing file or not)
`download-file`	Downloads files to a given folder using axios	`[baseURL]`, `[parseKey]`, `[folder='.']`, `[namePattern='{name}']`, `[maxSizeInMb=1]` & `[concurrency=1]`
`email`	Sends the results by email using nodemailer	Same as nodemailer, split between the `smtp` and `message` options

Jason supports a single transformer or an array of transformers:

{
  ...
  "transform": [{
    "json-file": {
      "path": "./github-repos.json"
    }
  }, {
    "csv-file": {
      "path": "./github-repos.csv"
    }
  }]
}

⛏ Bulk processing

Scraping parameters can be defined in a CSV file and applied to configure the processors:

{
  "bulk": {
    "csv-file": {
      "path": "./github-search-queries.csv",
      "csv": {
        "columns": true,
        "delimiter": ","
      }
    }
  },
  "load": {
    "http": {
      "baseURL": "https://github.com",
      "url": "/search?l={language}&o=desc&q={query}&s=stars&type=Repositories",
      "_concurrency": 2
    }
  },
  "parse": {
    "html": {
      "title": "< text(Best {language} repos)",
      "repos": [".repo-list .repo-list-item h3 > a"]
    }
  },
  "transform": {
    "json-file": {
      "path": "./github-repos-{language}.json"
    }
  }
}

github-search-queries.csv :

language,query
JavaScript,scraper
Python,scraper

⛏ API

constructor({ fallbacks = {} } = {})

fallbacks defines which processor to use when not explicitly configured (or missing in the config file):

load: 'identity',
parse: 'identity',
transform: 'identity',
bulk: null

The fallbacks change when using the CLI (see bin/jason-the-miner.js):

load: 'stdin',
parse: 'html',
transform: 'stdout',
bulk: null

loadConfig(configFile)

Loads a config from a JSON or JS file.

jason.loadConfig('./harvest-me.json');

harvest({ bulk, load, parse, transform } = {})

Launches the harvesting process:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest())
  .catch(error => console.error(error));

You can pass custom options to temporarily override the current config:

jason
  .loadConfig('./config.json')
  .then(() => jason.harvest({
    load: {
      http: {
        url: "https://github.com/search?q=scraper&l=Python&type=Repositories"
      }
    }
  }))
  .catch(error => console.error(error));

To permanently override the current config, you can modify Jason's config property:

const allResults = [];

jason
  .loadConfig('./harvest-me.json')
  .then(() => jason.harvest())
  .then((results) => {
    allResults.push(results);

    jason.config.load.http.url = 'https://github.com/search?q=scraper&l=Python&type=Repositories';

    return jason.harvest();
  })
  .then((results) => {
    allResults.push(results);
  })
  .catch(error => console.error(error));

registerHelper({ category, name, helper })

Registers a parse helper in one of the 3 categories: match, extract or filter. helper must be a function.

const url = require('url');

jason.registerHelper({
  category: 'filter',
  name: 'remove-query-params',
  helper: (href = '') => {
    if (!href || href === '#') {
      return href;
    }

    const { protocol, host, pathname } = url.parse(href);

    return `${protocol}//${host}${pathname}`;
  }
});

registerProcessor({ category, name, processor })

Registers a new processor in one of the 3 categories: load, parse or transform. processor must be a class implementing the run() method:

jason.registerProcessor({
  category: 'transform',
  name: 'template',
  processor: Templater
});

class Templater {
  constructor({ config }) {
    // receives automatically its config
  }

  /**
   * @param {*} results
   * @return {Promise.<*>}
   */
  run({ results }) {
    // must be implemented & must return a promise.
  }
}

jason.config.transform = {
  template: {
    "templatePath": "my-template.tpl",
    "outputPath": "my-page.html"
  }
};

Be aware that loaders must also implement the getConfig() and buildLoadOptions({ link }) methods. Have a look at the source code for more info.

⛏ Testing

$ git clone https://github.com/mawrkus/jason-the-miner.git
$ cd jason-the-miner
$ npm install
$ npm run test

⛏ Resources

Web Scraping With Node.js: https://www.smashingmagazine.com/2015/04/web-scraping-with-nodejs/
X-ray, The next web scraper. See through the noise: https://github.com/lapwinglabs/x-ray
Simple, lightweight & expressive web scraping with Node.js: https://github.com/eeshi/node-scrapy
Node.js Scraping Libraries: http://blog.webkid.io/nodejs-scraping-libraries/
https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-illegal/
http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/
Web scraping o rastreo de webs y legalidad: https://www.youtube.com/watch?v=EJzugD0l0Bw
Scraper API blog: https://www.scraperapi.com/blog/

⛏ A final note...

Please take these guidelines in consideration when scraping:

The content being scraped is not copyright protected.
The act of scraping does not burden the services of the site being scraped.
The scraper does not violate the Terms of Use of the site being scraped.
The scraper does not gather sensitive user information.
The scraped content adheres to fair use standards.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
bin		bin
demos		demos
es		es
.eslintrc		.eslintrc
.gitignore		.gitignore
.npmrc		.npmrc
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jason the Miner

⛏ Features

⛏ Installing

⛏ Demos

⛏ Examples

CLI

API

⛏ The config file

Loaders

Parsers

HTML schema definition

Examples

Parse helpers

Transformers

⛏ Bulk processing

⛏ API

constructor({ fallbacks = {} } = {})

loadConfig(configFile)

harvest({ bulk, load, parse, transform } = {})

registerHelper({ category, name, helper })

registerProcessor({ category, name, processor })

⛏ Testing

⛏ Resources

⛏ A final note...

About

Releases

Packages

Contributors 2

Languages

License

mawrkus/jason-the-miner

Folders and files

Latest commit

History

Repository files navigation

Jason the Miner

⛏ Features

⛏ Installing

⛏ Demos

⛏ Examples

CLI

API

⛏ The config file

Loaders

Parsers

HTML schema definition

Examples

Parse helpers

Transformers

⛏ Bulk processing

⛏ API

constructor({ fallbacks = {} } = {})

loadConfig(configFile)

harvest({ bulk, load, parse, transform } = {})

registerHelper({ category, name, helper })

registerProcessor({ category, name, processor })

⛏ Testing

⛏ Resources

⛏ A final note...

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages