NOTE: This repository is for usability study purposes only. The main Entity-Linking repository is here: https://github.com/Wesleyan-Media-Project/entity_linking_2022.
Welcome! This repository contains scripts for identifying and linking election candidates and other political entities in political ads on Google and Facebook. The scripts provided here are intended to help journalists, academic researchers, and others interested in the democratic process to understand which political entities are connected and how.
This repository is a part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE is an academic research project that has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.
To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Classification Step in our pipeline.
1. Video Tutorial
2. Overview
3. How to Run the Scripts
4. Results Storage
5. Results Analysis
6. Thank You
NOTE: that this video corresponds to an earlier version of the repository. We no longer use the files shown for organization, but all the code is the same, simply moved over to the corresponding sections of the jupyter notebooks.
Entity_Linking_Tutorial_Draft1.mp4
If you are unable to see the video above (e.g., you are getting the error "No video with supported format and MIME type found"), try a different browser. The video works on Google Chrome. Or, you can also watch this tutorial through YouTube.
This repository contains an entity-linker for 2022 election data. The entity-linker is a machine-learning classifier and was trained on data that contains descriptions of people and their names, along with their aliases. Data are sourced from the 2022 WMP person_2022.csv and wmpcand_120223_wmpid.csv --- two comprehensive files with names of candidates and other people in the political process. Data are restricted to general election candidates and other non-candidate people of interest (sitting senators, cabinet members, international leaders, etc.).
While this repository applies the trained entity-linker to 2022 US elections ads, you can also apply it to your own political ad text datasets to identify which people of interest are mentioned in ads. It is especially useful if you have a large amount of ad text data and you do not want to waste time counting how many times a political figure is mentioned within these ads. You can follow the setup instructions below to apply the entity-linker to your own data.
There are separate folders for running the entity-linker depending on whether you want to run it on Facebook or Google data. For both Facebook and Google, the scripts need to be run in the order of three tasks: (1) constructing a knowledge base of political entities, (2) training the entity-linking model, and (3) making inferences with the trained model. The repo provides reusable code for these three tasks. For your overview, we describe the three tasks in the following.
Note that we provide a pre-trained entity-linking model that is ready for your use on Google and Facebook 2022 data. If you are using this pre-trained model, you can bypass the knowledge base and train steps and skip straight to making inferences. However, if you want to apply our inference scripts to a different time period (for example, another election cycle) or in a different context (for example, a non-U.S. election), then you would need to create your own knowledge base and train your own models.
-
Constructing a Knowledge Base of Political Entities
The first task is to construct a knowledge base of political entities (people) of interest.
The knowledge base of people of interest is constructed from the
knowledge_base
section offacebook/facebook.ipynb
. The input to the file is the data sourced from the 2022 WMP persons file person_2022.csv. The script constructs one sentence for each person with a basic description. Districts and party are sourced from the 2022 WMP candidates file wmpcand_120223_wmpid.csv, a comprehensive file with names of candidates.The knowledge base has four columns that include entities'
id
,name
,descr
(for description), andaliases
. Examples of aliases include Joseph R. Biden being referred to as Joe or Robert Francis O’Rourke generally being known as Beto O’Rourke. Here is an example of one row in the knowledge base:id name descr aliases WMPID1770 Adam Gray Adam Gray is a Democratic candidate for the 13rd District of California. Adam Gray,Gray,Adam Gray's,Gray's,ADAM GRAY,GRAY,ADAM GRAY'S,GRAY'S -
Training the Entity-Linking Model
The second task is to train an entity-linking model using the knowledge base.
Once the knowledge base of people of interest is constructed, the entity-linker can be initialized with spaCy, a natural language processing library we use, in the
train
section offacebook/facebook.ipynb
.After successfully running the above scripts, you would see the following trained models:
intermediate_kb
trained_entity_linker
-
Making Inferences with the Trained Model
The third task is to make inferences with the trained model to automatically identify and link entities mentioned in new political ad text.
To perform this task you can use the scripts in the
inference
sections of the respective notebooks,facebook/facebook.ipynb
andgoogle/google.ipynb
. The sections incluced variations of scripts to disambiguate people, for example, multiple "Harrises" (e.g., Kamala Harris and Andy Harris).
Note: If you are using our own pre-trained entity-linker model, you should skip the knowledge_base
and train
sections of the notebooks. The model is available for download on our Figshare, which you can access by following this link and completing the Data Access Form.
-
Open up the appropriate notebook:
a. Click here to open up
facebook.ipynb
in a Google colab environmentb. Click here to open up
google.ipynb
in a Google colab environment. -
Run the
Environment Setup
section of the notebook to install the packages necessary for running the code. -
Proceed to the appropriate section and follow the instructions provided in the notebook.
After successfully running the above scripts in the inference folder, you should see the entity-linking results in the data
folder. The data will be in csv.gz
and csv
format. The various Facebook results, for instance, are as follows:
entity_linking_results_fb22.csv.gz
: Ad ID - text field level political entity detection results. Detected entities in each textual variable (e.g., disclaimer, creative boides, detected OCR text) are stored in a list. Each textual variable can have multiple detected entities or no detected entities. Entities are represented by their WMPIDs, which are WMP's unique identifiers for political figures.entity_linking_results_fb22_notext.csv.gz
: This file drops the text column fromentity_linking_results_fb22.csv.gz
for space saving purpose (see below preview table as an example).detected_entities_fb22.csv.gz
: A compact ad ID level entity-linking results file. It concatenated all detected entities (given byentity_linking_results_fb22.csv.gz
) from all textual fields of each ad ID.detected_entities_fb22_for_ad_tone.csv.gz
: Filtered entity-linking results (compared todetected_entities_fb22.csv.gz
) prepared as input for ad tone detection (a downstream classification task). It excluded detected entities from page names and disclaimers and aggregated text field level results to ad ID level (see this script).
Here is an example of the entity-linking results facebook/data/entity_linking_results_fb22.csv.gz:
text | text_detected_entities | text_start | text_end | ad_id | field |
---|---|---|---|---|---|
Senator John Smith is fighting hard for Californians. | WMPID1234 | [8] | [18] | x_1234 | ad_creative_body |
In this example,
- The
text
field contains the raw ad text where entities were detected. - The
text_detected_entities
field contains the detected entities in the ad text. They are listed by their WMPID. WMPID is the unique id that Wesleyan Media Project assigns to each candidate in the knowledge base(e.g. Adam Gray: WMPID1770). The WMPID is used to link the detected entities to the knowledge base. - The
text_start
andtext_end
fields indicate the character offsets where the entity mention appears in the text. - The
ad_id
field contains the unique identifier for the ad. - The
field
field contains the field in the ad where the entity was detected. This could be, for example, thepage_name
,ad_creative_body
, orgoogle_asr_text
(texts that we extract from video ads through Google Automatic Speech Recognition).
The csv.gz
files produced in this repo are usually large and may contain millions of rows. To make it easier to read and analyze the data we have provided two scripts, readcsv.py and readcsvGUI, in the analysis
folder of this repo.
The script readcsv.py
is a Python script that reads and filters the csv.gz
files and saves the filtered data in an Excel file. It has the following features:
- Load a specified number of rows from a CSV file.
- Skip a specified number of initial rows to read the data.
- Filter rows based on the presence of a specified text (case-insensitive).
Both the facebook/facebook.ipynb
and google/google.ipynb
notebooks have Results Analysis Example
sections, complete with examples of commands that can be run on the various data files. Additional instructions on how to run this section can be found in each respective notebook.
You can furthermore customize the behavior of the readcsv.py
script by providing any of these additional command-line arguments:
--file
: Path to the csv file (required).--skiprows
: Number of rows to skip at the start of the file (default: 0).--nrows
: Number of rows to read from the file (default: Read 10000 rows in the data).--filter_text
: Text to filter the rows (case-insensitive). If empty, no filtering is applied (default: No filter).
For example, to filter rows containing the text "Biden", starting from row 0 and reading 100000 rows:
!python /content/readcsv.py --file /content/entity_linking_results_fb22.csv.gz --nrows 100000 --filter_text Biden
To see a help message with the description of all available arguments, you can run the following command:
!python /content/readcsv.py --h
Please note that this script may take a while (>10 min) to run depending on the size of the data and the number of rows you requested. If you request the script to read more than 1048570 rows, the output would be saved in multiple Excel files due to the maximum number of rows Excel can handle.
If you feel comfortable working with Terminal and would like results presented in a graphical user interface, you can read instructions on how to set up and run our analysis/readcsvGUI.py
script here.
We would like to thank our supporters!
This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.
The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.