Skip to content

Commit

Permalink
Reorganize README for readability
Browse files Browse the repository at this point in the history
Signed-off-by: Evan Lloyd New-Schmidt <evan@new-schmidt.com>
  • Loading branch information
newsch committed Apr 28, 2024
1 parent df26010 commit aa4bc75
Showing 1 changed file with 74 additions and 82 deletions.
156 changes: 74 additions & 82 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,59 +3,25 @@
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._

Extracted articles are identified by Wikipedia article titles in url or text form (language-specific), and [Wikidata QIDs](https://www.wikidata.org/wiki/Wikidata:Glossary#QID) (language-agnostic).
OpenStreetMap commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.
OpenStreetMap (OSM) commonly stores these as [`wikipedia*=`](https://wiki.openstreetmap.org/wiki/Key:wikipedia) and [`wikidata=`](https://wiki.openstreetmap.org/wiki/Key:wikidata) tags on objects.

## Configuring

[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
[`article_processing_config.json`](article_processing_config.json) is _compiled with the program_ and should be updated when adding a new language.
It defines article sections that are not important for users and should be removed from the extracted HTML.

## Downloading Dumps

[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/). Please note that each language's dump is tens of gigabytes in size.

Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects:
> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2.
> This will help to ensure that everyone can access the files with reasonable download times.
> Clients that try to evade these limits may be blocked.
> Our mirror sites do not have this cap.
See [the list of available mirrors](https://dumps.wikimedia.org/mirrors.html) for other options. Note that most of them do not include the enterprise dumps; check to see that the `other/enterprise_html/runs/` path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023:
- (US) https://dumps.wikimedia.your.org
- (Sweden) https://mirror.accum.se/mirror/wikimedia.org

For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files.

They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/).

The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages.
It maintains a directory with the following layout:
```
<DUMP_DIR>/
├── latest -> 20230701/
├── 20230701/
│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ...
├── 20230620/
│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ...
...
```
There are some tests for basic validation of the file, run them with `cargo test`.

## Usage

To use with the map generator, see the [`run.sh` script](run.sh) and its own help documentation.
It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.
> [!NOTE]
> In production, wikiparser is run with the maps generator, which is somewhat involved to set up. See [Usage with Maps Generator](#usage-with-maps-generator) for more info.
To run the wikiparser manually or for development, see below.
To run the wikiparser for development and testing, see below.

First, install [the rust language tools](https://www.rust-lang.org/)

For best performance, use `--release` when building or running.
> [!IMPORTANT]
> For best performance, use `-r`/`--release` with `cargo build`/`run`.
You can run the program from within this directory using `cargo run --release --`.

Expand All @@ -64,7 +30,7 @@ Alternatively, build it with `cargo build --release`, which places the binary in
Run the program with the `--help` flag to see all supported arguments.

```
$ cargo run --release -- --help
$ cargo run -- --help
A set of tools to extract articles from Wikipedia Enterprise HTML dumps selected by OpenStreetMap tags.
Usage: om-wikiparser <COMMAND>
Expand All @@ -84,55 +50,81 @@ Options:
Print version
```

Each command has its own additional help:
> [!NOTE]
> Each subcommand has additional help.
The main work is done in the `get-articles` subcommand.
It takes as inputs:
- A [Wikipedia Enterprise JSON dump](#downloading-wikipedia-dumps), decompressed and connected to `stdin`.
- A directory to write the extracted articles to, as a CLI argument.
- Any number of filters for the articles:
- Use `--osm-tags` if you have an [OSM .pbf file](#downloading-openstreetmap-osm-files) and can use the `get-tags` subcommand or the `osmconvert` tool.
- Use `--wikidata-qids` or `--wikipedia-urls` if you have a group of urls or QIDs from another source.

To test a single language in a specific map region, first get the matching tags for the region with `get-tags`:
```sh
cargo run -r -- get-tags $REGION_EXTRACT.pbf > region-tags.tsv
```
$ cargo run -- get-articles --help

Extract, filter, and simplify article HTML from Wikipedia Enterprise HTML dumps.
Then write the articles to a directory with `get-articles`:
```sh
tar xzOf $dump | cargo run -r -- get-articles --osm-tags region-tags.tsv $OUTPUT_DIR
```

Expects an uncompressed dump (newline-delimited JSON) connected to stdin.
## Downloading OpenStreetMap (OSM) files

Usage: om-wikiparser get-articles [OPTIONS] <OUTPUT_DIR>
To extract Wikipedia tags with the `get-tags` subcommand, you need a file in the [OSM `.pbf` format](https://wiki.openstreetmap.org/wiki/PBF_Format).

Arguments:
<OUTPUT_DIR>
Directory to write the extracted articles to
The "planet" file is [available directly from OSM](https://wiki.openstreetmap.org/wiki/Planet.osm) but is ~80GB in size; for testing you can [try a smaller region's data (called "Extracts") from one of the many providers](https://wiki.openstreetmap.org/wiki/Planet.osm#Extracts).

Options:
--write-new-qids <FILE>
Append to the provided file path the QIDs of articles matched by title but not QID.
## Downloading Wikipedia Dumps

Use this to save the QIDs of articles you know the url of, but not the QID. The same path can later be passed to the `--wikidata-qids` option to extract them from another language's dump. Writes are atomicly appended to the file, so the same path may be used by multiple concurrent instances.
[Enterprise HTML dumps, updated twice a month, are publicly accessible](https://dumps.wikimedia.org/other/enterprise_html/).

--no-simplify
Don't process extracted HTML; write the original text to disk
> [!WARNING]
> Each language's dump is tens of gigabytes in size, and much larger when decompressed.
> To avoid storing the decompressed data, pipe it directly into the wikiparser as described in [Usage](#usage).
-h, --help
Print help (see a summary with '-h')
To test a small number of articles, you can also use the [On-Demand API](https://enterprise.wikimedia.com/docs/on-demand/) to download them, which has a free tier.

Wikimedia requests no more than 2 concurrent downloads, which the included [`download.sh`](./download.sh) script respects:
> If you are reading this on Wikimedia servers, please note that we have rate limited downloaders and we are capping the number of per-ip connections to 2.
> This will help to ensure that everyone can access the files with reasonable download times.
> Clients that try to evade these limits may be blocked.
> Our mirror sites do not have this cap.
FILTERS:
--osm-tags <FILE.tsv>
Path to a TSV file that contains one or more of `wikidata`, `wikipedia` columns.
See [the list of available mirrors](https://dumps.wikimedia.org/mirrors.html) for other options. Note that most of them do not include the enterprise dumps; check to see that the `other/enterprise_html/runs/` path includes subdirectories with files. The following two mirrors are known to include the enterprise html dumps as of August 2023:
- (US) https://dumps.wikimedia.your.org
- (Sweden) https://mirror.accum.se/mirror/wikimedia.org

This can be generated with the `get-tags` command or `osmconvert --csv-headline --csv 'wikidata wikipedia'`.
For the wikiparser you'll want the ["NS0"](https://en.wikipedia.org/wiki/Wikipedia:Namespace) "ENTERPRISE-HTML" `.json.tar.gz` files.

--wikidata-qids <FILE>
Path to file that contains a Wikidata QID to extract on each line (e.g. `Q12345`)
They are gzipped tar files containing a single file of newline-delimited JSON matching the [Wikimedia Enterprise API schema](https://enterprise.wikimedia.com/docs/data-dictionary/).

--wikipedia-urls <FILE>
Path to file that contains a Wikipedia article url to extract on each line (e.g. `https://lang.wikipedia.org/wiki/Article_Title`)
The included [`download.sh`](./download.sh) script handles downloading the latest set of dumps in specific languages.
It maintains a directory with the following layout:
```
<DUMP_DIR>/
├── latest -> 20230701/
├── 20230701/
│ ├── dewiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230701-ENTERPRISE-HTML.json.tar.gz
│ ...
├── 20230620/
│ ├── dewiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── enwiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ├── eswiki-NS0-20230620-ENTERPRISE-HTML.json.tar.gz
│ ...
...
```

It takes as inputs:
- A wikidata enterprise JSON dump, extracted and connected to `stdin`.
- A directory to write the extracted articles to, as a CLI argument.
- Any number of filters passed:
- A TSV file of wikidata qids and wikipedia urls, created by the `get-tags` command or `osmconvert`, passed as the CLI flag `--osm-tags`.
- A file of Wikidata QIDs to extract, one per line (e.g. `Q12345`), passed as the CLI flag `--wikidata-ids`.
- A file of Wikipedia article titles to extract, one per line (e.g. `https://$LANG.wikipedia.org/wiki/$ARTICLE_TITLE`), passed as a CLI flag `--wikipedia-urls`.
## Usage with Maps Generator

To use with the [maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md), see the [`run.sh` script](run.sh) and its own help documentation.
It handles extracting the tags, using multiple dumps, and re-running to convert titles to QIDs and extract them across languages.

As an example of manual usage with the map generator:
As an example of manual usage with the maps generator:
- Assuming this program is installed to `$PATH` as `om-wikiparser`.
- Download [the dumps in the desired languages](https://dumps.wikimedia.org/other/enterprise_html/runs/) (Use the files with the format `${LANG}wiki-NS0-${DATE}-ENTERPRISE-HTML.json.tar.gz`).
Set `DUMP_DOWNLOAD_DIR` to the location they are downloaded.
Expand All @@ -150,17 +142,17 @@ export RUST_LOG=om_wikiparser=debug
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
--wikidata-ids wikidata_qids.txt \
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids wikidata_qids.txt \
--wikipedia-urls wikipedia_urls.txt \
--write-new-qids new_qids.txt \
descriptions/
done
# Extract discovered QIDs.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
--wikidata-ids new_qids.txt \
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids new_qids.txt \
descriptions/
done
```
Expand All @@ -172,16 +164,16 @@ om-wikiparser get-tags planet-latest.osm.pbf > osm_tags.tsv
# Begin extraction.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
tar xzOf $dump | om-wikiparser get-articles \
--osm-tags osm_tags.tsv \
--write-new-qids new_qids.txt \
descriptions/
done
# Extract discovered QIDs.
for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
tar xzf $dump | om-wikiparser get-articles \
--wikidata-ids new_qids.txt \
tar xzOf $dump | om-wikiparser get-articles \
--wikidata-qids new_qids.txt \
descriptions/
done
```

0 comments on commit aa4bc75

Please sign in to comment.