CREATIVE --- Ad Goal Classifier

Welcome! This repo contains scripts for training a machine learning model for political ad goal classification (e.g., donate, contact, or get-out-the-vote, etc.).

This repo is part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.

To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification step in our pipeline.

1. Overview

In this repo, we classify ads into the following nine category of goals:

Donate: Request for donations
Contact: Request for viewer to contact lawmakers
Purchase: Selling of merchandise
Get-out-the-vote (GOTV): Get out to vote efforts
Event: Advertisement of an event
Poll: Ask viewer to fill a poll
Gather info: Solicitation of viewers' information
Learn more: Invite viewer to learn more about a candidate
Persuade: Promote/attack a candidate/persuade the viewer to vote for or against a specific candidate

The model is trained on hand-coded 2020 Facebook data labeled by the WMP.

2. Setup

This repo contains eight R scripts and eight Python scripts that are of interest. The scripts are numbered in the order in which they should be run. Scripts that directly depend on one another are ordered sequentially. Scripts with the same number are alternatives, usually they are the same scripts to be run on different data or with minor variations. The outputs of each script are saved, so it is possible to, for example, only run the inference script, since the model files are already present. There are also some additional scripts present, which will be discussed in the setup section.

For an example pipeline, training on 2020 Facebook, and then doing inference on 2022 Facebook data, see pipeline_2022.sh. This task should take about 20 minutes to run on a laptop.

2.1 Install R and Packages

To run the scripts in this repo install R and the scripts required packages:

First, make sure you have R installed. While R can be run from the terminal, many people find it easier to use RStudio along with R. Here is a tutorial for setting up R and RStudio. These scripts have been tested with R 4.1, 4.2, 4.3, and 4.4.
Next, make sure you have the following packages installed in R (the exact version we used of each package is listed in the requirements_r.txt file. These are the versions we tested our scripts on. Thus, scripts might also work with other versions but not tested). You can the packages install by calling:
```
install.packages('data.table')
install.packages("stringr")
install.packages("stringi")
install.packages("dplyr")
install.packages("tidyr")
```
In order to successfully run each R script you must first set your working directory. The working directory is the location on your computer that R will use for reading and writing files. You can do so by adding the line setwd("your/working/directory") to the top of the R scripts, replacing "your/working/directory" with your actual working directory. You must make sure that your working directory is in agreement with the paths to where any input files exist and where output files will be created.

For instance, in script 01_prepare_fbel.R the input and output are written as follows:
```
# Input data
path_input_data <- "data/fbel_w_train.csv"
# Output data
path_output_data <- "data/fbel_prepared.csv"
```
If you do not wish to change either of these paths, your working directory should be set as follows:
```
setwd("/local/path/to/ad_goal_classifier/")
```
where /local/path/to/ represents the location at which the ad_goal_classifier folder resides on your computer.

2.2 Install Python and Packages

First, make sure you have Python installed. The scripts have been tested with Python 3.9, and 3.10.
In addition, make sure you have the following packages installed in Python (the exact version we used of each package is listed in the requirements_py.txt file. These are the versions we tested our scripts on. Thus, scripts might also work with other versions but not tested). You can install by running the following in your command-line:
```
pip install pandas
pip install scikit-learn
pip install numpy
pip install joblib
pip install tqdm
```

NOTE: As the scripts have been tested with the package versions provided in requirements_py.txt and requirements_r.txt, we recommend using those same versions in order to avoid potential errors.

2.3 Download Input Files

NOTE: Some parts of the data in the datasets repo include TV data. Due to contractual reasons, users must apply directly to receive raw TV data. Visit http://mediaproject.wesleyan.edu/dataaccess/ and fill out the online request form for accessing TV data.

For inference scripts, you will need to download the following input files from Figshare. Please note that on Figshare, you will have to fill out a form and will then immediately get data access. :

For Facebook 2022 04_prepare_fb2022.R: fb_2022_adid_text.csv.gz
For Google 2022 04_prepare_google_2022.R: g2022_adid_01062021_11082022_text.csv.gz

Legacy (non-2022) data production:

For Facebook 2020 04_prepare_140m.R: fb_2020_140m_adid_text_clean.csv.gz. This data will be made available when it is ready.
For Google 2020 04_prepare_google_2020.R: google_2020_adid_text_clean.csv.gz

2.4 Run Scripts

2.4.1 Training

NOTE: If you do not want to train the models from scratch, you can use the trained models we provide here and skip to 2.4.2.

To run the inference scripts, you first need to train a machine learning classification model. The training data is the FBEL dataset, located here. Here is the codebook of the dataset.

We prepare the training data which is then trained on a Random Forest Classification model (80/20 train/test split). First, run 01_prepare_fbel.R and 02_create_training_data.py to prepare training data. Second, run script 03_binomial_goal_clf_train.py to train the model on the training set, and also record performance on the test set. Performance scores for each goal are saved in performance/rf/. Note that we train nine different binary classification models, one for each goal category.

2.4.2 Inference

In order to execute a Python script you can run the following command from your terminal from within the directory of the script replacing file.py with the file name of the script you want to run:

python3 file.py

In order to execute an R script you can run the following command from your terminal from within the directory of the script replacing file.R with the file name of the script you want to run:

Rscript file.R

The scripts that begin with 04 are all alternatives of each other, with each one preparing a different dataset so that it is in the same shape as the training data. Here is a list of scripts and any input files you will need along with links to those files. Please note that for the Figshare links, you will have to fill out a form and will then immediately get data access. :

For Facebook 2022 04_prepare_fb2022.R: fb_2022_adid_text.csv.gz
For Google 2022 04_prepare_google_2022.R: g2022_adid_01062021_11082022_text.csv.gz

Legacy, non-2022 data:

For Facebook 2020 04_prepare_140m.R: fb_2020_140m_adid_text_clean.csv.gz. This data will be made available when it is ready.
For Google 2020 04_prepare_google_2020.R: google_2020_adid_text_clean.csv.gz

The scripts that begin with 05 are also all alternatives of each other, with each one running inference on a different dataset.

For the scripts that begin with 04 and 05, inferencing which dataset they are being run on can be done from looking at the end of their names. For example, 05_binomial_goal_clf_inference_google_2022.py is inference on the data-post-production/google_2022 dataset. In addition, this information can be seen when looking at the scripts' input data, which will include a path to the dataset they are referencing.

Scripts beginning with 06 and 07 test whether the model trained on only Facebook ads can also be applied to TV ads. It does so by applying the models to 2020 TV ads labeled by WMP.

Scripts beginning with 11 and 12 are inference scripts for 2020 Facebook and Google electoral candidate ads only.

Scripts beginning with 99 use Facebook's regional distribution to determine the proportion of an ad's monetary spend that goes into an electoral candidate's own state. This information is intended to be used together with the goal classifier to determine, for example, what proportion of donate ads are aimed outside of the candidate's state.

3. Results Storage

The output data for the scripts in this repo is in .csv format – the name and location of each output file is specified in the script that creates it. For example, the output data for 05_binomial_goal_clf_inference_fb_2022.py is stored at data/ad_goal_rf_fb2022.csv.gz.

4. Thank You

We would like to thank our supporters!

This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.

The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
analysis		analysis
data		data
figures		figures
models		models
notes		notes
performance/rf		performance/rf
.gitignore		.gitignore
01_prepare_fbel.R		01_prepare_fbel.R
02_create_training_data.py		02_create_training_data.py
02_train_inference_2022.ipynb		02_train_inference_2022.ipynb
03_binomial_goal_clf_train.py		03_binomial_goal_clf_train.py
04_prepare_140m.R		04_prepare_140m.R
04_prepare_fb2022.R		04_prepare_fb2022.R
04_prepare_google_2020.R		04_prepare_google_2020.R
04_prepare_google_2022.R		04_prepare_google_2022.R
05_binomial_goal_clf_inference_fb_140m.py		05_binomial_goal_clf_inference_fb_140m.py
05_binomial_goal_clf_inference_fb_2022.py		05_binomial_goal_clf_inference_fb_2022.py
05_binomial_goal_clf_inference_google_2020.py		05_binomial_goal_clf_inference_google_2020.py
05_binomial_goal_clf_inference_google_2022.py		05_binomial_goal_clf_inference_google_2022.py
05_binomial_goal_clf_inference_tv.py		05_binomial_goal_clf_inference_tv.py
06_prepare_tv_validation.R		06_prepare_tv_validation.R
07_validate_tv_compare_models.py		07_validate_tv_compare_models.py
99_house_in_state_fb_2022.R		99_house_in_state_fb_2022.R
99_senators_in_state_fb_2022.R		99_senators_in_state_fb_2022.R
CREATIVE_logo.png		CREATIVE_logo.png
Creative_Pipelines.png		Creative_Pipelines.png
LICENSE		LICENSE
README.md		README.md
nsf.png		nsf.png
performance_for_goals_used_in_crossplatform_paper.R		performance_for_goals_used_in_crossplatform_paper.R
pipeline_2022.sh		pipeline_2022.sh
plt_logo.png		plt_logo.png
requirements_py.txt		requirements_py.txt
requirements_r.txt		requirements_r.txt
wmp-logo.png		wmp-logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREATIVE --- Ad Goal Classifier

Table of Contents

1. Overview

2. Setup

2.1 Install R and Packages

2.2 Install Python and Packages

2.3 Download Input Files

2.4 Run Scripts

2.4.1 Training

2.4.2 Inference

3. Results Storage

4. Thank You

About

Releases

Packages

Contributors 11

Languages

License

Wesleyan-Media-Project/ad_goal_classifier

Folders and files

Latest commit

History

Repository files navigation

CREATIVE --- Ad Goal Classifier

Table of Contents

1. Overview

2. Setup

2.1 Install R and Packages

2.2 Install Python and Packages

2.3 Download Input Files

2.4 Run Scripts

2.4.1 Training

2.4.2 Inference

3. Results Storage

4. Thank You

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages