SDG classification of texts using LDA topic modeling
This tool classifies texts based on the 17 Sustainable Development Goals (SDGs). Each SDG is defined by training texts derived from official UN publications. Since February 2024, the training set also includes synthetic texts generated by ChatGPT, enhancing the classifier's coverage.
To learn more about the methodology, see:
- "Art is long, life is short: An SDG Classification System for DESA Publications".
- "Using large language models to help train machine learning SDG classifiers".
Note: This tool does not determine if a text is SDG-related. Instead, it calculates scores based on how well the text fits within the SDG vocabulary. For details, see the section "Interpreting the Results."
- Mallet 2.0.8 (Download here)
- Text files to classify (in
.txt
format)
- Ensure your text files are cleaned to exclude irrelevant material (e.g., front matter). Cleaned data yields better classification results.
- Mac OS X (Zsh shell recommended for newer macOS versions)
- Windows (requires additional configuration, see below)
- Linux
On Windows, the Mallet bigrams command may need fixing. Refer to this GitHub issue.
-
Clone the repository:
git clone https://github.com/SeaCelo/SDGclassy.git SDGclassy cd SDGclassy chmod +x infer-scores.sh
-
Install Mallet inside the cloned SDGclassy directory:
wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip unzip mallet-2.0.8.zip rm mallet-2.0.8.zip
Keeping Mallet in the same directory ensures all dependencies remain self-contained.
-
Download the required model files:
sdgclassy.mallet
: Download from Google Driveinferring-temp.mallet
: Download from Google Drive
After downloading, manually place both files into the
/SDGclassy/classifier/
directory.These files are too large to be included in the repository, so they must be downloaded and added manually.
-
Prepare your text files:
- Ensure all files are in plain
.txt
format (no PDFs or directories). - Clean the files by removing irrelevant material (e.g., front matter or unrelated content).
- Ensure all files are in plain
-
Place the files in the predefined input directory:
/SDGclassy/target/input/
-
Run the classification script:
./infer-scores.sh
-
Find the results in the predefined output directory:
/SDGclassy/target/output/SDG-scores-out.txt
-
If your input files are stored elsewhere or you want to save results in a different location:
- Use the alternative script
infer-scores2.sh
.
- Use the alternative script
-
Run the script with custom paths:
./infer-scores2.sh -i /path/to/your/input -o /path/to/your/output
-
Check the specified output directory for the results.
-
Prepare your text files:
- Convert files to plain
.txt
format. - Clean up irrelevant content.
- Convert files to plain
-
Place the text files in the
/SDGclassy/target/input/
directory. -
Run the script:
- Right-click
infer-scores.ps1
and select "Run with Powershell".
- Right-click
-
Results will be saved in:
/SDGclassy/target/output/SDG-scores-out.txt
- The output file
SDG-scores-out.txt
lists topics (0–18) and their corresponding scores. Each topic maps to an SDG, except one filter topic, which should be ignored. - Use
/classifier/topic-sdg_mapping.csv
to match topics with SDGs. - Scores do not sum to 100% due to the extra category. Rescale them if necessary for your analysis.
- You can install Mallet elsewhere and adjust the scripts accordingly. Alternatively, add Mallet to your
$PATH
variable. - If Mallet runs out of memory during processing, allocate more memory:
- Navigate to the Mallet installation directory:
cd /path/to/mallet-2.0.8/bin
- Edit the binary file:
nano mallet
- Set the memory allocation:
MEMORY=8g
- Navigate to the Mallet installation directory: