Hate Speech in Social Media Project

Setup

This repository requires python 3.7.3, pip and virtualenv. Setup a virtual environment as follows:

virtualenv env
source env.sh
pip install -r requirements.txt

Workflow

Python

Whenever working with python, run source env.sh in the current terminal session. If new packages are installed, update the list of dependencies by running pip freeze > requirements.txt.

Scripts in subdirectories (e.g. those in scraping) should be run as modules to avoid path and module conflicts:

python -m scraping.new_dataset  # instead of `python scraping/new_dataset.py` or `cd scraping && python new_dataset.py`

TensorFlow

The TensorFlow workflow in this repository is adapted from this boilerplate.

Model visualization

During and after training, the training and validation losses are plotted in TensorBoard. To visualize, run tensorboard --logdir=experiments and open localhost:6006 in the browser.

Embedding visualization

After running a script that produces visualizations (for example, scripts.centroids), go to projector.tensorflow.org and upload the TSV files inside the projector directory.

Datasets

Stormfront dataset: place the all_files directory and the annotations_metadata.csv file inside this repository's data directory. Rename all_files to stormfront, and annotations_metadata.csv to stormfront.csv.
Twitter hate speech dataset: rename the file to twitter.csv and place it in the data directory.
Google News Word2Vec: place the file directly in the data directory.
Twitter moral foundations dataset: rename the directory to twitter_mf and place it in the data directory. To scrape the tweets from their id's, run python -m scraping.twitter_mf and then to clean the data run python -m scripts.clean_twitter_mf. To have a fixed heldout dataset that represents well the rest of the data, create a shuffled version of the data:

cat data/twitter_mf.clean.csv | head -1 > data/twitter_mf.clean.shuffled.csv
# macOS users: `sort` by hash is a good replacement for `shuf`.
cat data/twitter_mf.clean.csv | tail -24771 | shuf >> data/twitter_mf.clean.shuffled.csv

WikiText: download, unzip, and place both of the word-level datasets in the data directory. Clean the data with python -m scripts.clean_wikitext.

Non-public data

Ping @danielwatson6 for access to the YouTube and the ambiguity corpora.

Set up an environment variable DATASETS=/path/to/youtube/data/dir and name the directory with the YouTube CSV files youtube_right. This is done unlike with the rest of the data to avoid the massive dataset not fitting on available SSD space.
Run python -m scraping.new_dataset to scrape the rest of the YouTube data.
Rename the ambiguity data to ambiguity.csv and place it in the data folder.

Scraping

For the scraping scripts to work, you need your own API keys.

Place your YouTube API key in a file scraping/api_key.
Place your Twitter API keys in a JSON file scraping/twitter_api.json with the following keys: consumer_key, consumer_secret, access_token_key, access_token_secret.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
data_loaders		data_loaders
models		models
scraping		scraping
scripts		scripts
.gitignore		.gitignore
README.md		README.md
argument.tex		argument.tex
boilerplate.py		boilerplate.py
env.sh		env.sh
goals.txt		goals.txt
requirements.txt		requirements.txt
run.py		run.py
train.sh		train.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hate Speech in Social Media Project

Setup

Workflow

Python

TensorFlow

Model visualization

Embedding visualization

Datasets

Non-public data

Scraping

About

Releases

Packages

Contributors 3

Languages

danielwatson6/hate-speech-project

Folders and files

Latest commit

History

Repository files navigation

Hate Speech in Social Media Project

Setup

Workflow

Python

TensorFlow

Model visualization

Embedding visualization

Datasets

Non-public data

Scraping

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages