This repository requires python 3.7.3, pip and virtualenv. Setup a virtual environment as follows:
virtualenv env
source env.sh
pip install -r requirements.txt
Whenever working with python, run source env.sh
in the current terminal session. If new packages are installed, update the list of dependencies by running pip freeze > requirements.txt
.
Scripts in subdirectories (e.g. those in scraping
) should be run as modules to avoid path and module conflicts:
python -m scraping.new_dataset # instead of `python scraping/new_dataset.py` or `cd scraping && python new_dataset.py`
The TensorFlow workflow in this repository is adapted from this boilerplate.
During and after training, the training and validation losses are plotted in TensorBoard. To visualize, run tensorboard --logdir=experiments
and open localhost:6006 in the browser.
After running a script that produces visualizations (for example, scripts.centroids
), go to projector.tensorflow.org and upload the TSV files inside the projector
directory.
-
Stormfront dataset: place the
all_files
directory and theannotations_metadata.csv
file inside this repository'sdata
directory. Renameall_files
tostormfront
, andannotations_metadata.csv
tostormfront.csv
. -
Twitter hate speech dataset: rename the file to
twitter.csv
and place it in thedata
directory. -
Google News Word2Vec: place the file directly in the
data
directory. -
Twitter moral foundations dataset: rename the directory to
twitter_mf
and place it in thedata
directory. To scrape the tweets from their id's, runpython -m scraping.twitter_mf
and then to clean the data runpython -m scripts.clean_twitter_mf
. To have a fixed heldout dataset that represents well the rest of the data, create a shuffled version of the data:
cat data/twitter_mf.clean.csv | head -1 > data/twitter_mf.clean.shuffled.csv
# macOS users: `sort` by hash is a good replacement for `shuf`.
cat data/twitter_mf.clean.csv | tail -24771 | shuf >> data/twitter_mf.clean.shuffled.csv
- WikiText: download, unzip, and place both of the word-level datasets in the
data
directory. Clean the data withpython -m scripts.clean_wikitext
.
Ping @danielwatson6 for access to the YouTube and the ambiguity corpora.
- Set up an environment variable
DATASETS=/path/to/youtube/data/dir
and name the directory with the YouTube CSV filesyoutube_right
. This is done unlike with the rest of the data to avoid the massive dataset not fitting on available SSD space. - Run
python -m scraping.new_dataset
to scrape the rest of the YouTube data. - Rename the ambiguity data to
ambiguity.csv
and place it in thedata
folder.
For the scraping scripts to work, you need your own API keys.
- Place your YouTube API key in a file
scraping/api_key
. - Place your Twitter API keys in a JSON file
scraping/twitter_api.json
with the following keys:consumer_key
,consumer_secret
,access_token_key
,access_token_secret
.