The Philippines Audit System (PAS) is a 3-stage document-data extraction software package developed to study “infraction-level” corruption in provincial, municipal, and city governments. This repo contains all of the production code used to engineer the target dataset. The convert
module, however, should be unnecessary for replication.
Raw data (.zip file reports) for the project can be found on the Republic of Philippines Commision on Audit website, but we recommend using the PhilAuditStorage
drive link found in the Prerequisites section to obtain the PDF-converted data in an organized file structure.
- Background
- Prerequisites
- Installation
- Use
- FAQ
- Download PhilAuditStorage — public drive where converted, organized data is kept.
- Take note of the path (where you install it), it will be important later! Eg.
/Users/me/Downloads/philauditstorage
- Take note of the path (where you install it), it will be important later! Eg.
- Download Model Weights
- As much local device space as you can afford. Each year will require ~30GBs
- Clone Repo
git clone https://github.com/jackvaughan09/PhilippinesAuditSystem.git
-
Inside of repo run:
python3 scripts/setup.py path/to/philauditstorage
- creates the necessary directory structure inside of
philauditstorage
for all subsequent script executions (usesmkdir -p
equivalent, i.e. will not overwrite existing directories with the same name) - if this is your first time running
setup.py
, it will create and activate a virtual environment before installing the pip dependencies.
- creates the necessary directory structure inside of
Setup.py
is designed such that subsequent runs will not overwrite the created directory paths (unless deleted) or the virtual environment, so you can run this whenever you need without worry.
-
Generate metadata for a single year:
python3 scripts/generate_metadata.py path/to/philauditstorage/year # Example python3 scripts/generate_metadata.py /Users/me/Documents/PhilAuditStorage/2015
- After completing this step, there will be a file titled
20XX_metadata.csv
in the directorypath/to/philauditstorage/Metadata/
The metadata files are critical for the operation of the software package and should not be edited manually unless you know exactly what you’re doing.
- After completing this step, there will be a file titled
-
PDF to image
python3 scripts/generate_images.py /path/to/philauditstorage/year # Example python3 scripts/generate_images.py /Users/me/Documents/PhilAuditStorage/2015
Be prepared with a significant amount of storage on your local device before attempting the PDF to image conversion for a year. There will be upwards of 50,000
.png
images created! (~30GB) -
Create predictions
- Rather than using the usual
/path/to/pas/year
path as we’ve been doing, we’ll actually be using the/path/to/pas/Images/year
directory created duringsetup.py
- You’ll also need the path to the model weights downloaded above.
python3 scripts/sort.py /path/to/philauditstorage/Images/year path/to/weights.ckpt # Example python3 scripts/sort.py /Users/me/Documents/PhilAuditStorage/Images/2015 \ /Users/me/PhilRepo/weights.ckpt
During this step, images are moved from the
Images/year/All
folder intoImages/year/Include
orImages/year/Exclude
based on the detection model predictions. - Rather than using the usual
-
Manually sort predictions to obtain ground truth labels
-
Map sorted images back to pdfs
python3.8 scripts/map_images_to_pdfs.py /path/to/philauditstorage/year
-
Extraction
python3.8 scripts/extract.py /path/to/philauditstorage/year