Reorganize README for readability

Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
organicmaps · Apr 28, 2024 · aa4bc75 · aa4bc75
1 parent df26010
commit aa4bc75
Showing 1 changed file with 74 additions and 82 deletions.
diff --git a/README.md b/README.md
@@ -3,59 +3,25 @@
 _Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
 
 Extracted articles are identified by Wikipedia article titles in url or text form (language-specific), and [Wikidata QIDs](https://www.wikidata.org/wiki/Wikidata:Glossary#QID) (language-agnostic).
-OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
+OpenStreetMap (OSM) commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
 
 ## Configuring
 
-[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
+[`article_processing_config.json`](article_processing_config.json) is _compiled with the program_ and should be updated when adding a new language.
 It defines article sections that are not important for users and should be removed from the extracted HTML.
-
-## Downloading Dumps
-
-[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). Please note that each language's dump is tens of gigabytes in size.
-
-Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects:
-> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2.
-> This will help to ensure that everyone can access the files with reasonable download times.
-> Clients that try to evade these limits may be blocked.
-> Our mirror sites do not have this cap.
-
-See [the list of available mirrors](https://dumps.wikimedia.org/mirrors.html) for other options. Note that most of them do not include the enterprise dumps; check to see that the `other/enterprise_html/runs/` path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023:
-- (US) https://dumps.wikimedia.your.org
-- (Sweden) https://mirror.accum.se/mirror/wikimedia.org
-
-For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files.
-
-They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/).
-
-The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages.
-It maintains a directory with the following layout:
-```
-<DUMP_DIR>/
-├── latest -> 20230701/
-├── 20230701/
-│  ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
-│  ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
-│  ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
-│  ...
-├── 20230620/
-│  ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
-│  ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
-│  ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
-│  ...
-...
-```
+There are some tests for basic validation of the file, run them with `cargo test`.
 
 ## Usage
 
-To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation.
-It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.
+> [!NOTE]
+> In production, wikiparser is run with the maps generator, which is somewhat involved to set up. See [Usage with Maps Generator](#usage-with-maps-generator) for more info.
 
-To run the wikiparser manually or for development, see below.
+To run the wikiparser for development and testing, see below.
 
 First, install [the rust language tools](https://www.rust-lang.org/)
 
-For best performance, use `--release` when building or running.
+> [!IMPORTANT]
+> For best performance, use `-r`/`--release` with `cargo build`/`run`.
 
 You can run the program from within this directory using `cargo run --release --`.
 
@@ -64,7 +30,7 @@ Alternatively, build it with `cargo build --release`, which places the binary in
 Run the program with the `--help` flag to see all supported arguments.
 
 ```
-$ cargo run --release -- --help
+$ cargo run -- --help
 A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
 
 Usage: om-wikiparser <COMMAND>
@@ -84,55 +50,81 @@ Options:
           Print version
 ```
 
-Each command has its own additional help:
+> [!NOTE]
+> Each subcommand has additional help.
 
+The main work is done in the `get-articles` subcommand.
+It takes as inputs:
+- A [Wikipedia Enterprise JSON dump](#downloading-wikipedia-dumps), decompressed and connected to `stdin`.
+- A directory to write the extracted articles to, as a CLI argument.
+- Any number of filters for the articles:
+  - Use `--osm-tags` if you have an [OSM .pbf file](#downloading-openstreetmap-osm-files) and can use the `get-tags` subcommand or the `osmconvert` tool.
+  - Use `--wikidata-qids` or `--wikipedia-urls` if you have a group of urls or QIDs from another source.
+
+To test a single language in a specific map region, first get the matching tags for the region with `get-tags`:
+```sh
+cargo run -r -- get-tags $REGION_EXTRACT.pbf > region-tags.tsv
 ```
-$ cargo run -- get-articles --help
 
-Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps.
+Then write the articles to a directory with `get-articles`:
+```sh
+tar xzOf $dump | cargo run -r -- get-articles --osm-tags region-tags.tsv $OUTPUT_DIR
+```
 
-Expects an uncompressed dump (newline-delimited JSON) connected to stdin.
+## Downloading OpenStreetMap (OSM) files
 
-Usage: om-wikiparser get-articles [OPTIONS] <OUTPUT_DIR>
+To extract Wikipedia tags with the `get-tags` subcommand, you need a file in the [OSM `.pbf` format](https://wiki.openstreetmap.org/wiki/PBF_Format).
 
-Arguments:
-  <OUTPUT_DIR>
-          Directory to write the extracted articles to
+The "planet" file is [available directly from OSM](https://wiki.openstreetmap.org/wiki/Planet.osm) but is ~80GB in size; for testing you can [try a smaller region's data (called "Extracts") from one of the many providers](https://wiki.openstreetmap.org/wiki/Planet.osm#Extracts).
 
-Options:
-      --write-new-qids <FILE>
-          Append to the provided file path the QIDs of articles matched by title but not QID.
+## Downloading Wikipedia Dumps
 
-          Use this to save the QIDs of articles you know the url of, but not the QID. The same path can later be passed to the `--wikidata-qids` option to extract them from another language's dump. Writes are atomicly appended to the file, so the same path may be used by multiple concurrent instances.
+[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/).
 
-      --no-simplify
-          Don't process extracted HTML; write the original text to disk
+> [!WARNING]
+> Each language's dump is tens of gigabytes in size, and much larger when decompressed.
+> To avoid storing the decompressed data, pipe it directly into the wikiparser as described in [Usage](#usage).
 
-  -h, --help
-          Print help (see a summary with '-h')
+To test a small number of articles, you can also use the [On-Demand API](https://enterprise.wikimedia.com/docs/on-demand/) to download them, which has a free tier.
+
+Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects:
+> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2.
+> This will help to ensure that everyone can access the files with reasonable download times.
+> Clients that try to evade these limits may be blocked.
+> Our mirror sites do not have this cap.
 
-FILTERS:
-      --osm-tags <FILE.tsv>
-          Path to a TSV file that contains one or more of `wikidata`, `wikipedia` columns.
+See [the list of available mirrors](https://dumps.wikimedia.org/mirrors.html) for other options. Note that most of them do not include the enterprise dumps; check to see that the `other/enterprise_html/runs/` path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023:
+- (US) https://dumps.wikimedia.your.org
+- (Sweden) https://mirror.accum.se/mirror/wikimedia.org
 
-          This can be generated with the `get-tags` command or `osmconvert --csv-headline --csv 'wikidata wikipedia'`.
+For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files.
 
-      --wikidata-qids <FILE>
-          Path to file that contains a Wikidata QID to extract on each line (e.g. `Q12345`)
+They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/).
 
-      --wikipedia-urls <FILE>
-          Path to file that contains a Wikipedia article url to extract on each line (e.g. `https://lang.wikipedia.org/wiki/Article_Title`)
+The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages.
+It maintains a directory with the following layout:
+```
+<DUMP_DIR>/
+├── latest -> 20230701/
+├── 20230701/
+│  ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
+│  ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
+│  ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
+│  ...
+├── 20230620/
+│  ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
+│  ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
+│  ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
+│  ...
+...
 ```
 
-It takes as inputs:
-- A wikidata enterprise JSON dump, extracted and connected to `stdin`.
-- A directory to write the extracted articles to, as a CLI argument.
-- Any number of filters passed:
-  - A TSV file of wikidata qids and wikipedia urls, created by the `get-tags` command or `osmconvert`, passed as the CLI flag `--osm-tags`.
-  - A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`.
-  - A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`.
+## Usage with Maps Generator
+
+To use with the [maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md), see the [`run.sh` script](run.sh) and its own help documentation.
+It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.
 
-As an example of manual usage with the map generator:
+As an example of manual usage with the maps generator:
 - Assuming this program is installed to `$PATH` as `om-wikiparser`.
 - Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
   Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded.
@@ -150,17 +142,17 @@ export RUST_LOG=om_wikiparser=debug
 # Begin extraction.
 for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
 do
-  tar xzf $dump | om-wikiparser get-articles \
-    --wikidata-ids wikidata_qids.txt \
+  tar xzOf $dump | om-wikiparser get-articles \
+    --wikidata-qids wikidata_qids.txt \
     --wikipedia-urls wikipedia_urls.txt \
     --write-new-qids new_qids.txt \
     descriptions/
 done
 # Extract discovered QIDs.
 for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
 do
-  tar xzf $dump | om-wikiparser get-articles \
-    --wikidata-ids new_qids.txt \
+  tar xzOf $dump | om-wikiparser get-articles \
+    --wikidata-qids new_qids.txt \
     descriptions/
 done
 ```
@@ -172,16 +164,16 @@ om-wikiparser get-tags planet-latest.osm.pbf > osm_tags.tsv
 # Begin extraction.
 for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
 do
-  tar xzf $dump | om-wikiparser get-articles \
+  tar xzOf $dump | om-wikiparser get-articles \
     --osm-tags osm_tags.tsv \
     --write-new-qids new_qids.txt \
     descriptions/
 done
 # Extract discovered QIDs.
 for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
 do
-  tar xzf $dump | om-wikiparser get-articles \
-    --wikidata-ids new_qids.txt \
+  tar xzOf $dump | om-wikiparser get-articles \
+    --wikidata-qids new_qids.txt \
     descriptions/
 done
 ```