Welcome! This repo provides scripts to collect Google ads using BigQuery. The scripts in this repo are used to create tables in BigQuery, set up scheduled queries to import data, and analyze the data.
This repo is a part of the Cross-platform Election Advertising Transparency Initiative (CREATIVE). CREATIVE has the goal of providing the public with analysis tools for more transparency of political ads across online platforms. In particular, CREATIVE provides cross-platform integration and standardization of political ads collected from Google and Facebook. CREATIVE is a joint project of the Wesleyan Media Project (WMP) and the privacy-tech-lab at Wesleyan University.
To analyze the different dimensions of political ad transparency we have developed an analysis pipeline. The scripts in this repo are part of the Data Collection Step in our pipeline.
- 1. Video Tutorial
- 2. Overview of data in BigQuery
- 3. Setup
- 4. Thank You
Google_ads_archive_Tutorial.mp4
If you are unable to see the video above (e.g., you are getting the error "No video with supported format and MIME type found"), try a different browser. The video works on Google Chrome.
Or, you can also watch this tutorial through Youtube.
Both Facebook and Google started their archives of political ads about the same time --- in May 2018. The approaches that the companies took are quite different. Facebook posts CSV files with summary statistics and has an API that Facebook-external people can use to search for the ads. Google did not create an API and instead offers a web portal and summary reports. Google's reports are available as CSV files and as a dataset hosted in the bigquery-public-data
project in BigQuery --- a data warehouse in the Google Cloud Platform (GCP). Below is a screenshot of the tables in the dataset:
The following tables are of particular interest:
advertiser_declared_stats
: provides regulatory information on the advertiser.advertiser_weekly_spend
: weekly updates on advertisers' spending. If an advertiser was not active in a specific week, there is no record for that week. The spends are reported in increments of $100.advertiser_stats
: "lifetime" total stats for advertisers.creative_stats
: "lifetime" total stats about the ads, one row per ad.
Even though, officially, the political ads archive is updated once a week, the tables in the dataset are updated more frequently. For instance, the creative_stats
table is updated several times a day. We took advantage of these frequent updates and implemented a solution that is based in Google BigQuery and collects periodic snapshots of the "lifetime" tables, i.e., the advertiser_stats
and creative_stats
.
This section will guide you through the steps of creating the tables in BigQuery so you could replicate our workflow.
-
If you have not done it yet, register with Google Cloud Platform (GCP) and create a project. The scripts in this repo expect that you have a project named
wmp-sandbox
. If you choose a different name, please don't forget to modify it in the scripts as well. For more details about creating a GCP project, please see the official documentation. -
Go to the BigQuery page in the GCP console. You can find the "BigQuery" card in the project dashboard or you can search for it in the drop-down list of services. You can also navigate there directly by following the link to BigQuery.
-
Create a BigQuery dataset. In this repo, the scripts expect you to have a dataset named
my_ad_archive
. To create a dataset, locate your project in the EXPLORER tab in the BigQuery console, click on the vertical ellipses next to the name of your project, and choose "Create Dataset". For more details, please see the documentation page for creating a dataset. Select the tab that describes the console-based workflow. -
Copy-paste the SQL statements from the file create_tables.sql into the code editor in the BigQuery console. Run the statements. They will create two tables:
google_advertiser_agg_spend
andgoogle_creative_delta
.The
google_advertiser_agg_spend
table will store the snapshots of the data from theadvertiser_spend
table from the source dataset. You will have snapshots of the cumulative spending of the advertisers. Thegoogle_creative_delta
table will store snapshots of the metrics for the ads running on the Google platform.Note: The statements from the file will perform "create or replace ..." operations: if the tables do not exist, they will be created, however, if the tables do exist, they will be overwritten and you will lose whatever data you already have in them.
Please read the "Potential adjustments and issues" section below for a discussion of how you might need to adjust the queries.
BigQuery has a functionality known as "scheduled queries" --- the user can define a data import query and that query will run on a schedule. This functionality is similar to having crontab jobs running on a regular server. A nice feature is that, in case of failure, BigQuery will send an email notification. For technical details, please see the scheduling documentation.
We take a full snapshot of the advertiser_stats
table once a day. For the creative_stats
table, we query it every hour and keep only the new records. The queries are provided in the SQL script files in this repository: add_g_advertiser_spend.sql
and daily_delta_g_creatives.sql
.
Below is the list of the columns imported from the advertiser spend table. It is assumed that you are based in the United States and are interested in US advertisers. Thus, we are importing only the column spend_usd
, which reports the advertiser spend in US dollars. If you are operating in a different country, please make sure to replace the column spend_usd
both in the definitional query (the create_tables.sql
) and in the scheduled query.
advertiser_id | advertiser_name | public_ids_list | regions | elections | total_creatives | spend_usd |
---|
In addition to the columns from the source table, the scheduled query will insert columns import_date
and import_time
. They are generated from the parameters available during the execution of the query.
ad_id | ad_url | ad_type | regions | advertiser_id | advertiser_name | ad_campaigns_list | date_range_start | date_range_end | num_of_days | impressions | spend_usd | first_served_timestamp | last_served_timestamp | age_targeting | gender_targeting | geo_targeting_included | geo_targeting_excluded | spend_range_min_usd | spend_range_max_usd |
---|
As with the advertiser stats, we focus on US elections. This is reflected in the choice of the columns: the underlying source table contains data for the currencies of all countries where Google runs political ads (for instance, spend_range_min_brl
and spend_range_max_brl
for Brazil). We ignore these columns and keep only the spend_range_min_usd
and spend_range_max_usd
. We also keep the spend_usd
column, even though it appears that Google itself is not using it, i.e., this column is always empty.
The daily_delta_g_creatives
query inserts only those records from the underlying table which are new. This may mean a record for an entirely new ad or a record for an ad that has changed. This is why the query has the "delta" in its name. This approach is similar to how we ingest Facebook ads where we store a record only if it is different from the one already in our system. See the "Exclusion of duplicate records" section in the fb_ads_import
repository.
The fields that may change are:
date_range_end
: end of the range of dates during which the ad was activenum_of_days
: number of days in the date range when the ad was activeimpressions
: a bucket for the impressions number, for instance1000-2000
last_served_timestamp
: a timestamp (date plus time down to seconds) when the ad was served the last timespend_range_min_usd
: the lower bound of the range for the USD spend on the adspend_range_max_usd
: the upper bound of the range for the USD spend on the ad
The "new" condition is implemented using the SELECT ... EXCEPT DISTINCT ...
clause in the query. It will include those rows that are not present in the bottom part of the query following the EXCEPT clause.
Our own table contains the columns for the date and time of the data insertion. They are derived from the parameters @run_date
and @run_time
available from BigQuery.
As an illustration, let's walk through the process of creating a query that will import new ad records every hour":
-
Open the SQL editor in the console. Paste the SQL statement from the file
daily_delta_g_creative.sql
. -
Click on the SCHEDULE -> Create new scheduled query menu items at the top of the editor pane. The screenshot below shows the location of the menu buttons.
-
Clicking on the "Create scheduled query" will open a pop-up tab on the right of the screen. Here you will need to enter the required parameters:
- Enter
import_creatives_delta
as the name. (Nothing in the scripts is linked to this name. It will only appear in the menu allowing you to modify or cancel query. You can change the name if you like.) - Select "Hours" as repeat frequency, and then enter
1
into the "Repeats every" field that will appear. You can modify the settings as you like. For example, we run our query every hour. - Leave the "destination table" fields empty. Our query contains an INSERT statement and it already "knows" the destination.
- Under "Notification settings", check the box for "Send notification emails". GCP/BigQuery will send an email to the email account associated with the owner of the project if the query fails.
- Click "SAVE" to save the query. BigQuery will launch the first import operation.
- Enter
The process for the add_g_advertiser_spend.sql
script is the same, except you will need to pick a different name.
The console for managing the scheduled queries is accessed by clicking on the menu item in the left part of the screen (see the screenshot below).
The menu will take you to a page that will allow you to see the history of the runs and also edit the configuration.
As an example, below is a screenshot of the "Scheduled queries" dashboard. You can see that the import_creatives_delta
has successfully finished, while add_g_advertiser_spend
was running.
We focus on US-based activity and, because of this, the table creation scripts and the scheduled query scripts import only the columns associated with US dollars. The source tables contain columns for other currencies and countries as well. If you are interested in other countries, you need to modify the list of columns --- replace spend_range_min_usd
and spend_range_max_usd
with the columns of your choice.
There are two potential issues to be on the lookout for.
Our initial iteration of the scripts would import all currency columns. We encountered two problems in this regard:
- Some of the currency columns do not have an expected data type. For instance, in the
advertiser_spend
source table, there were issues with the Indian Rupee and Hungarian Forint columns. We had to useSAFE_CAST(xxxx as STRING)
to import them as strings instead. - Changes in schema. Google may add new columns for currencies of countries that were added to the archive. This happened with New Zealand dollars and Brazilian Real. To avoid these problems, our script hard-coded the list of columns. If you choose to use
select * ...
statement to import multiple currency columns, you need to pay attention to the possible changes. Fortunately, the email notifications are very prompt and you will be aware of the problem within minutes after it happened.
The second issue does not affect the ability to import the data, but it does impact the ability to view the ad.
An ad record includes information about the advertiser, specifically the advertiser id and the full url for viewing the ad on the Ad Transparency website. In some situations Google would assign a new ID to an advertiser. (One possible scenario is when several advertisers merged.) When that happens, there would be more than one record for the same ad: the ad id would remain the same, but the advertiser information would be different.
As a result, the ad_url
field in the old ad record is not longer valid: the ad urls include the advertiser ID, and once the old ID is retired the url is no longer correct. If a user follows an old URL, they will land on a page that will say "ad no longer available" which is not the case.
Once you have your scheduled queries running, you can start analyzing the data directly in the browser. Here is an example query that will return data on 10 ads that had more than one record. This kind of data is useful in determining the cost per impression of an ad:
with a as (select ad_id, count(*) as N
from wmp-sandbox.my_ad_archive.google_creative_delta
where regions = 'US'
group by ad_id
having N > 1
order BY N desc
limit 10)
select ad_id, advertiser_name, ad_type, ad_url, impressions, spend_range_min_usd, spend_range_max_usd, import_time
from wmp-sandbox.my_ad_archive.google_creative_delta
inner join a using (ad_id)
order by ad_id, import_time;
The query below will return the latest impressions for 10 ads owned by DCCC
- Democratic Congressional Campaign Committee:
with a as (select ad_id, max(import_time) as max_time
from my_ad_archive.google_creative_delta
where regions = 'US'
and advertiser_name = 'DCCC'
group by ad_id
limit 10)
select x.ad_id, advertiser_name, ad_type, ad_url, impressions, spend_range_min_usd, spend_range_max_usd, import_time
from my_ad_archive.google_creative_delta as x
inner join a
on a.ad_id = x.ad_id
and a.max_time = x.import_time
order by ad_id, import_time;
You can also use the BigQuery connector in Google Sheets to work with the data. Read this document for instructions.
In contrast to Facebook/Meta, Google's archive does not have the content of the ads, even when the ad consists only of text. The only content-related field is ad_type
which takes the values TEXT
, IMAGE
or VIDEO
. To retrieve the contents of the ads we, with the permission of the Google Political Ads Transparency team, scrape the ads and store them in a local database on a server maintained by WMP.
We would like to thank our supporters!
This material is based upon work supported by the National Science Foundation under Grant Numbers 2235006, 2235007, and 2235008.
The Cross-Platform Election Advertising Transparency Initiative (CREATIVE) is a joint infrastructure project of the Wesleyan Media Project and privacy-tech-lab at Wesleyan University in Connecticut.