From aa4bc75168349754643694c423d99836463938da Mon Sep 17 00:00:00 2001 From: Evan Lloyd New-Schmidt Date: Fri, 15 Mar 2024 12:40:44 -0400 Subject: [PATCH] Reorganize README for readability Signed-off-by: Evan Lloyd New-Schmidt --- README.md | 156 ++++++++++++++++++++++++++---------------------------- 1 file changed, 74 insertions(+), 82 deletions(-) diff --git a/README.md b/README.md index 1184142..3b0adfd 100644 --- a/README.md +++ b/README.md @@ -3,59 +3,25 @@ _Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._ Extracted articles are identified by Wikipedia article titles in url or text form (language-specific), and [Wikidata QIDs](https://www.wikidata.org/wiki/Wikidata:Glossary#QID) (language-agnostic). -OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects. +OpenStreetMap (OSM) commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects. ## Configuring -[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language. +[`article_processing_config.json`](article_processing_config.json) is _compiled with the program_ and should be updated when adding a new language. It defines article sections that are not important for users and should be removed from the extracted HTML. - -## Downloading Dumps - -[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). Please note that each language's dump is tens of gigabytes in size. - -Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects: -> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. -> This will help to ensure that everyone can access the files with reasonable download times. -> Clients that try to evade these limits may be blocked. -> Our mirror sites do not have this cap. - -See [the list of available mirrors](https://dumps.wikimedia.org/mirrors.html) for other options. Note that most of them do not include the enterprise dumps; check to see that the `other/enterprise_html/runs/` path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023: -- (US) https://dumps.wikimedia.your.org -- (Sweden) https://mirror.accum.se/mirror/wikimedia.org - -For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files. - -They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/). - -The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages. -It maintains a directory with the following layout: -``` -/ -├── latest -> 20230701/ -├── 20230701/ -│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz -│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz -│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz -│ ... -├── 20230620/ -│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz -│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz -│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz -│ ... -... -``` +There are some tests for basic validation of the file, run them with `cargo test`. ## Usage -To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation. -It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages. +> [!NOTE] +> In production, wikiparser is run with the maps generator, which is somewhat involved to set up. See [Usage with Maps Generator](#usage-with-maps-generator) for more info. -To run the wikiparser manually or for development, see below. +To run the wikiparser for development and testing, see below. First, install [the rust language tools](https://www.rust-lang.org/) -For best performance, use `--release` when building or running. +> [!IMPORTANT] +> For best performance, use `-r`/`--release` with `cargo build`/`run`. You can run the program from within this directory using `cargo run --release --`. @@ -64,7 +30,7 @@ Alternatively, build it with `cargo build --release`, which places the binary in Run the program with the `--help` flag to see all supported arguments. ``` -$ cargo run --release -- --help +$ cargo run -- --help A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags. Usage: om-wikiparser @@ -84,55 +50,81 @@ Options: Print version ``` -Each command has its own additional help: +> [!NOTE] +> Each subcommand has additional help. +The main work is done in the `get-articles` subcommand. +It takes as inputs: +- A [Wikipedia Enterprise JSON dump](#downloading-wikipedia-dumps), decompressed and connected to `stdin`. +- A directory to write the extracted articles to, as a CLI argument. +- Any number of filters for the articles: + - Use `--osm-tags` if you have an [OSM .pbf file](#downloading-openstreetmap-osm-files) and can use the `get-tags` subcommand or the `osmconvert` tool. + - Use `--wikidata-qids` or `--wikipedia-urls` if you have a group of urls or QIDs from another source. + +To test a single language in a specific map region, first get the matching tags for the region with `get-tags`: +```sh +cargo run -r -- get-tags $REGION_EXTRACT.pbf > region-tags.tsv ``` -$ cargo run -- get-articles --help -Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps. +Then write the articles to a directory with `get-articles`: +```sh +tar xzOf $dump | cargo run -r -- get-articles --osm-tags region-tags.tsv $OUTPUT_DIR +``` -Expects an uncompressed dump (newline-delimited JSON) connected to stdin. +## Downloading OpenStreetMap (OSM) files -Usage: om-wikiparser get-articles [OPTIONS] +To extract Wikipedia tags with the `get-tags` subcommand, you need a file in the [OSM `.pbf` format](https://wiki.openstreetmap.org/wiki/PBF_Format). -Arguments: - - Directory to write the extracted articles to +The "planet" file is [available directly from OSM](https://wiki.openstreetmap.org/wiki/Planet.osm) but is ~80GB in size; for testing you can [try a smaller region's data (called "Extracts") from one of the many providers](https://wiki.openstreetmap.org/wiki/Planet.osm#Extracts). -Options: - --write-new-qids - Append to the provided file path the QIDs of articles matched by title but not QID. +## Downloading Wikipedia Dumps - Use this to save the QIDs of articles you know the url of, but not the QID. The same path can later be passed to the `--wikidata-qids` option to extract them from another language's dump. Writes are atomicly appended to the file, so the same path may be used by multiple concurrent instances. +[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). - --no-simplify - Don't process extracted HTML; write the original text to disk +> [!WARNING] +> Each language's dump is tens of gigabytes in size, and much larger when decompressed. +> To avoid storing the decompressed data, pipe it directly into the wikiparser as described in [Usage](#usage). - -h, --help - Print help (see a summary with '-h') +To test a small number of articles, you can also use the [On-Demand API](https://enterprise.wikimedia.com/docs/on-demand/) to download them, which has a free tier. + +Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects: +> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2. +> This will help to ensure that everyone can access the files with reasonable download times. +> Clients that try to evade these limits may be blocked. +> Our mirror sites do not have this cap. -FILTERS: - --osm-tags - Path to a TSV file that contains one or more of `wikidata`, `wikipedia` columns. +See [the list of available mirrors](https://dumps.wikimedia.org/mirrors.html) for other options. Note that most of them do not include the enterprise dumps; check to see that the `other/enterprise_html/runs/` path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023: +- (US) https://dumps.wikimedia.your.org +- (Sweden) https://mirror.accum.se/mirror/wikimedia.org - This can be generated with the `get-tags` command or `osmconvert --csv-headline --csv 'wikidata wikipedia'`. +For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files. - --wikidata-qids - Path to file that contains a Wikidata QID to extract on each line (e.g. `Q12345`) +They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/). - --wikipedia-urls - Path to file that contains a Wikipedia article url to extract on each line (e.g. `https://lang.wikipedia.org/wiki/Article_Title`) +The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages. +It maintains a directory with the following layout: +``` +/ +├── latest -> 20230701/ +├── 20230701/ +│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz +│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz +│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz +│ ... +├── 20230620/ +│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz +│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz +│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz +│ ... +... ``` -It takes as inputs: -- A wikidata enterprise JSON dump, extracted and connected to `stdin`. -- A directory to write the extracted articles to, as a CLI argument. -- Any number of filters passed: - - A TSV file of wikidata qids and wikipedia urls, created by the `get-tags` command or `osmconvert`, passed as the CLI flag `--osm-tags`. - - A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`. - - A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`. +## Usage with Maps Generator + +To use with the [maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md), see the [`run.sh` script](run.sh) and its own help documentation. +It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages. -As an example of manual usage with the map generator: +As an example of manual usage with the maps generator: - Assuming this program is installed to `$PATH` as `om-wikiparser`. - Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`). Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded. @@ -150,8 +142,8 @@ export RUST_LOG=om_wikiparser=debug # Begin extraction. for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz do - tar xzf $dump | om-wikiparser get-articles \ - --wikidata-ids wikidata_qids.txt \ + tar xzOf $dump | om-wikiparser get-articles \ + --wikidata-qids wikidata_qids.txt \ --wikipedia-urls wikipedia_urls.txt \ --write-new-qids new_qids.txt \ descriptions/ @@ -159,8 +151,8 @@ done # Extract discovered QIDs. for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz do - tar xzf $dump | om-wikiparser get-articles \ - --wikidata-ids new_qids.txt \ + tar xzOf $dump | om-wikiparser get-articles \ + --wikidata-qids new_qids.txt \ descriptions/ done ``` @@ -172,7 +164,7 @@ om-wikiparser get-tags planet-latest.osm.pbf > osm_tags.tsv # Begin extraction. for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz do - tar xzf $dump | om-wikiparser get-articles \ + tar xzOf $dump | om-wikiparser get-articles \ --osm-tags osm_tags.tsv \ --write-new-qids new_qids.txt \ descriptions/ @@ -180,8 +172,8 @@ done # Extract discovered QIDs. for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz do - tar xzf $dump | om-wikiparser get-articles \ - --wikidata-ids new_qids.txt \ + tar xzOf $dump | om-wikiparser get-articles \ + --wikidata-qids new_qids.txt \ descriptions/ done ```