Skip to content

Information Systems 2022 Reproducibility Guide

Alexandre Quemy edited this page Mar 19, 2022 · 5 revisions

This guide aims at reproducing the experiments on the European Court of Human Rights Open Data. There are two distinct components:

  1. The dataset, generated from scratch by retrieving the documents from HUDOC,
  2. The predictions, obtained by training several algorithms on the dataset

Requirements

Combined, the dataset generation and experiments require around 50GB storage.
A minimal amount of 16GB RAM is required to generate the datasets.

Git and Docker needs to be installed regardless the operating system.
For Windows, we tested the procedure using Docker Desktop and Git Bash. Git Bash is integrated to Git for Windows.

Fast guide

For Windows, please prepend docker run commands with winpty.

  1. Open a folder where the experiments will be replicated.
  2. ECHR_FOLDER=$(pwd)
  3. git clone --depth 1 --branch InfoSys https://github.com/echr-od/ECHR-OD_process.git
  4. cd ECHR-OD_process
  5. docker build -f Dockerfile -t echr_build .
  6. (Windows only) dos2unix entrypoint.sh
  7. docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build build --workflow local --doc_ids ./build_desc/InfoSys_cases.txt
  8. cd ..
  9. git clone --depth 1 --branch InfoSys https://github.com/echr-od/ECHR-OD_predictions.git
  10. cd ECHR-OD_predictions
  11. docker build -f Dockerfile -t echr_experiments .
  12. (Windows only) dos2unix entrypoint.sh
  13. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments run binary
  14. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments run multiclass
  15. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments run multilabel
  16. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments analyze binary
  17. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments analyze multiclass
  18. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments analyze multilabel
  19. docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments reports

Dataset via ECHR_Process

As the docker image is rebuilt every month automatically, integrating the latest developement, it is not possible to guarantee reproducibility by simply using the latest image. Therefore, we require the user to build the docker image from a particular code revision.

To clone the repository at the appropriate revision, one can use the following command:

git clone --depth 1 --branch InfoSys https://github.com/echr-od/ECHR-OD_process.git

Then, once in the repository, the image is built as follows:

docker build -f Dockerfile -t echr_build .

To guarantee the reproducibility, we introduced a new mechanism of build description which allows the user to specify the list of cases to retrieve and process. At the end of any process, the list of cases in the final database is saved such that it makes easier to reproduce a particular run and share builds. For this paper, we generated a full build (workflow local) and used the list of cases as reference. The list of cases is available in ECHR-OD_process/build_desc/InfoSys_cases.txt.

To exactly reproduce the database used for the experiments, one can run the following:

docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build build --workflow local --doc_ids ./build_desc/InfoSys_cases.txt

For Windows, there might be a need to prefix the command by winpty. In case of an issue with bash due to the Windows encoding, one can use dos2unix entrypoint.sh to convert the container entrypoint before runing the container.

Depending on you CPU, it might take up to 6h to complete the entire workflow. Keep in mind that it requires at least 16GB RAM due to the NLP model step.

Experiments via ECHR_Predictions

Because of the amount of cross-validation to perform, we separated the experiment runner from the analysis. We also separated the binary, multiclass and multilabel runner. Finally, each runner is capable to be re-started without starting the experiments from scratch.

git clone --depth 1 --branch InfoSys https://github.com/echr-od/ECHR-OD_predictions.git

Then, once in the repository, the image is built as follows:

docker build -f Dockerfile -t echr_experiments .

To run the experiments:

docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=<ECHR_OD_BUILD>,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments run <task>

where <ECHR_OD_BUILD> is the absolute path to ECHR-OD build to use and <task> is the classification task to solve among binary, multiclass and multilabel.

To analyze the experiments:

docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=<ECHR_OD_BUILD>,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments analyze <task>

where <ECHR_OD_BUILD> and <task> are as defined above.

To generate the final report:

docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=$ECHR_FOLDER/ECHR-OD_process/build/echr_database/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments reports

For instance, assuming the build is located at /home/aquemy/projects/echr-od/ECHR-OD_process/build/InfoSys/, to run the binary classification experiments, one would use:

docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind --mount src=/home/aquemy/projects/echr-od/ECHR-OD_process/build/InfoSys/,dst=/tmp/echr_experiments/data/input/,type=bind echr_experiments run binary

The analysis can be performed using:

docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_experiments/,type=bind echr_experiments analyze binary

Note that it is not mandatory to have the results of all experiments to start the analysis. The partial results can help to check the reproducibility without waiting the end of the experiments.

Conversely, for multiclass and multilabel experiments:

docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_predictions run multiclass
docker run -it --rm --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_predictions analyze multiclass

docker run ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_predictions run multilabel
docker run ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_predictions analyze multilabel