Skip to content

SeaCelo/SDGclassy

Repository files navigation

SDGclassy

SDG classification of texts using LDA topic modeling

This tool classifies texts based on the 17 Sustainable Development Goals (SDGs). Each SDG is defined by training texts derived from official UN publications. Since February 2024, the training set also includes synthetic texts generated by ChatGPT, enhancing the classifier's coverage.

To learn more about the methodology, see:

Note: This tool does not determine if a text is SDG-related. Instead, it calculates scores based on how well the text fits within the SDG vocabulary. For details, see the section "Interpreting the Results."


Requirements

  • Mallet 2.0.8 (Download here)
  • Text files to classify (in .txt format)

Text Preparation:

  • Ensure your text files are cleaned to exclude irrelevant material (e.g., front matter). Cleaned data yields better classification results.

Supported Platforms:

  • Mac OS X (Zsh shell recommended for newer macOS versions)
  • Windows (requires additional configuration, see below)
  • Linux

On Windows, the Mallet bigrams command may need fixing. Refer to this GitHub issue.


Installation

For Mac OS and UNIX:

  1. Clone the repository:

    git clone https://github.com/SeaCelo/SDGclassy.git SDGclassy
    cd SDGclassy
    chmod +x infer-scores.sh
  2. Install Mallet inside the cloned SDGclassy directory:

    wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
    unzip mallet-2.0.8.zip
    rm mallet-2.0.8.zip

    Keeping Mallet in the same directory ensures all dependencies remain self-contained.

  3. Download the required model files:

    After downloading, manually place both files into the /SDGclassy/classifier/ directory.

    These files are too large to be included in the repository, so they must be downloaded and added manually.


Usage

For Mac OS and UNIX:

Option 1: Use Predefined Directories

  1. Prepare your text files:

    • Ensure all files are in plain .txt format (no PDFs or directories).
    • Clean the files by removing irrelevant material (e.g., front matter or unrelated content).
  2. Place the files in the predefined input directory:

    /SDGclassy/target/input/
    
  3. Run the classification script:

    ./infer-scores.sh
  4. Find the results in the predefined output directory:

    /SDGclassy/target/output/SDG-scores-out.txt
    

Option 2: Specify Custom Directories

  1. If your input files are stored elsewhere or you want to save results in a different location:

    • Use the alternative script infer-scores2.sh.
  2. Run the script with custom paths:

    ./infer-scores2.sh -i /path/to/your/input -o /path/to/your/output
  3. Check the specified output directory for the results.


For Windows:

  1. Prepare your text files:

    • Convert files to plain .txt format.
    • Clean up irrelevant content.
  2. Place the text files in the /SDGclassy/target/input/ directory.

  3. Run the script:

    • Right-click infer-scores.ps1 and select "Run with Powershell".
  4. Results will be saved in:

    /SDGclassy/target/output/SDG-scores-out.txt
    

Interpreting the Results

  • The output file SDG-scores-out.txt lists topics (0–18) and their corresponding scores. Each topic maps to an SDG, except one filter topic, which should be ignored.
  • Use /classifier/topic-sdg_mapping.csv to match topics with SDGs.
  • Scores do not sum to 100% due to the extra category. Rescale them if necessary for your analysis.

Additional Notes

  • You can install Mallet elsewhere and adjust the scripts accordingly. Alternatively, add Mallet to your $PATH variable.
  • If Mallet runs out of memory during processing, allocate more memory:
    1. Navigate to the Mallet installation directory:
      cd /path/to/mallet-2.0.8/bin
    2. Edit the binary file:
      nano mallet
    3. Set the memory allocation:
      MEMORY=8g

About

SDG classification of texts using LDA topic model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •