Controlling for higher degrees of polynomials for the forcing variable is a common practice in regression discontinuity analysis. Such models can overfit, resulting in substantively implausible causal inferences that randomly contribute variation to the high-degree polynomial and can lead to false estimates. The aim of this paper is to investigate the three arguments presented in Gelman and Imbens (2019).
The arguments recommends against the use of high-order polynomial approximations are:
- the implicit weights are inappropriate
- the estimates are sensitive to the degree of polynomial, and
- poor coverage of confidence intervals.
Instead of using global high-order polynomial, the paper suggests using local linear or quadratic polynomial approximations.
Regression discontinuity designs have significantly gained importance in econometric and causal inference over the past few decades. Introduced in 1960 by Donald L. Thistlethwaite and Donald T. Campbell, the regression discontinuity design aims to estimate the impact of a treatment policy on the variable of interest. The simplicity of the design enables its application across research fields, like health, labor, education, political economy, etc.
The basic idea behind the regression discontinuity (RD) design is that the allotment of the treatment is based on the value of the predictor (also known as assignment/ forcing variable) being on either side of a fixed threshold. Further, the assignment variable may be related to the potential outcome, however, this relation is assumed to be smooth which implies that any discontinuity at the cutoff point in the conditional expectation of the outcome as a function of the assignment variable is because of the treatment and this discontinuity gap represents the causal effect of the treatment. For instance, Thistlethwaite and Campbell studied the role of merit awards on future academic outcomes by allocating awards only to the students who had scored above a threshold score value in a test, while the students who scored below the cutoff point did not receive any award.
In regression discontintuity, the average treatment effect can be estimated as the discontinuity in the conditional expectation of the outcome given the assignment variable at the threshold. The equaiton then can be represented as:
which can be simply written as:
Here,
This can also be seen in figure below, which illustrates a sharp RD design with the threshold at 6. Individuals right to the threshold are given the treatment, while those to the left are the control group. The dashed line represents the value of the potential outcome for the treated group and control group if they were observable beyond their treatment status. The smooth lines indicate the observed outcome conditional on the assignment variable. Under the continuity assumption, therefore, the average treatment effect, τSRD is given by the jump in outcome value at threshold 6.
Sourced from Imbens and Lemieux (2007). X-Axis is the forcing variable, while Y-Axis is the outcome value
In the above equation the problem is the estimate the values of
Since researchers are often unsure about the efficiency of the estimates based on global linear function,
they often use polynomial functions of order fifth or sixth of the assignment variable,
The other approach to estimate
Gelman and Imbens(2019)\cite{gelman2019high} believe that the former approach, that is, to use the global high-order polynomials to estimate the causal treatment effect is flawed and rather the inference should be based on local low-order polynomials. The authors bring forth threefold arguments to show global high-order polynomials as a poor model choice for regression discontinuity analysis.
-
The first argument is that the estimates of the treatment effect can be driven by the values of the forcing variable that are away from the threshold. Since the RD estimate is the difference between the weighted average of the outcome for the treated and untreated, wherein the weights depend only on the forcing variable; the weights in a global high-order polynomial approximation are extreme and unattractive compared to a local linear/quadratic approximation.
-
The second argument put forward is that the estimates in the case of global high-order polynomials are highly sensitive to the chosen degree of the polynomial of the forcing variable
$X$ . Further, a lack of just criteria to select the order of polynomials that is suitable for the estimator to satisfy its objectives of causal inference exacerbates the issue. -
Finally, the third argument is that the confidence interval based on the estimates of global high-order polynomials is often misleading. The confidence intervals can be too narrow that they reject the null hypothesis even when it shouldn't have, that is, the probability of Type 1 error rate is much higher than the nominal rate of 5%. This implies that there is a higher bias in the case of global high-order polynomials to detect a discontinuity even if there isn't one, and hence, generate poor inference.
The target of our paper is to test the veracity of these arguments through a simulation study and empirical application (wherever possible). The paper hereby proceeds as follows. In the next section, we introduce the model specification of the RD design that is used to critique the aforementioned arguments. Thereafter, we evaluate our results based on the simulation and empirical evidence so as to comment on the feasibility of global high-order polynomials in RD design.
In practice, it is convenient to run a pooled regression to obtain the average treatment effect of the SRD
rather than to estimate the values of
Where,
Hence, we can write the regression function in the form of:
The equation stated above can therefore be also considered the regression function for the global
high-order polynomial approach. The local linear/quadratic approach is distinct from the former
approach in the sense that for a certain bandwidth value
where,
The advantage of such a regression form is that it introduces varying slopes on both sides of the threshold
through the interaction term between
The objective of the simulation study is to test the arguments presented by Gelman and Imbens (2019) in favor of the local low-order polynomial approximation for an RD analysis. We conduct the study for a Sharp Regression Discontinuity framework.
To conduct a Monte Carlo study, we first develop an SRD model. We consider a model with a constant average treatment effect, which means that the only change for the treated group compared to the control is of an additive shift by a value equal to the treatment effect. Further, we consider a functional form of up to sixth-order polynomial. The model developed hereby is based on the data generating process in Imbens and Kalyanaraman 2012
The data generating process for
where, the forcing variable is derived from a beta distribution
We introduce the threshold point at
And, the random error term follows a normal distribution with mean
Based on the above specification, we simulate a dataset of sample size,
In an RD design, researchers are interested in estimating the impact of treatment at the cutoff point. Since the treatment effect estimates can be represented as the difference between the weighted average of outcomes for the treated and the control, one should expect an allocation of higher weights to the observations closer to the cutoff point. However, Gelman and Imbens (2019) argue that for the global polynomial approach, high and erratic weights are allotted to observations that are far from the threshold. This implies that the estimates calculated from such a model framework might be biased.
The weights are just a function of the assignment variable
Where,
- The values of the forcing variable,
$X$ , are above the threshold$c$ , and within the bandwidth$h$ -
$e$ is the K+1 column vector, with the first element equal to 1 and the rest elements equal to$0$ - 1$ denotes the sum vector with its length equal to the number of values of forcing variable between
$c$ and$h$ . -
$w_{i}$ is the weight allotted to each individual$i$ of the assignment variable$X$ for bandwidth value$h$ and order of polynomial$K$ . We can then obtain the weight vector,$w$ , with its elements equal to the weight allotted to each individual$i$ of the assignment variable$X$ for bandwidth value$h$ and order of polynomial$K$ . The average weight of the vector has been normalized to 1.
Given the data generating process and the weights functions defined in the above formula, we calculated the weigths results for global high-order polynomial and local polynomial. In case of the local linear and quadratic regressoin the bandwidth is calculated based on Imbens and Kalyanaraman (2012) optimal bandwidth selector. These weight vectors are thereby plotted with the observations of forcing variables above the threshold on the X-Axis and the corresponding weights on the Y-Axis which can be shown in the figure below.
Weights for Global High-Order Polynomial - Simulation Study
As can be seen from figure, the weights associated with global high-order polynomials are indeed high and erratic irrespective of the polynomial order. Further, it is also noticeable that higher weights are allotted to the observations away from the threshold. We also observe that for large values of the forcing variable, the associated weights are quite sensitive to the order of polynomials. Finally, similar to the global polynomial approach, we plot the derived weight vectors in order to inspect the weights assigned to each observation of the forcing variable that lies above the threshold.
Weights for Local Linear and Quadratic Polynomial - Simulation Study
Further, we perform the above stated steps for empirical application as well. We use the New York: MDRC, 2012 dataset for this study. we arrive at similar results as of the simulation study. The detailed analysis of this empirical application can be found in the paper.
The second argument in favor of why the researchers should not use high-order global polynomial approximations for forcing variables in a regression discontinuity design setting is that the estimates obtained from such setting are sensitive to the order of the polynomial in forcing variable.
The data generating process for RD design that is used in this simulation study is defined in
data generating process simulation study. We first created the original data of a
Order of Polynomial | Estimate (Simulation Study) | (s.e.) (Simulation Study) | Estimate (Empirical Application) | (s.e.) (Empirical Application) |
---|---|---|---|---|
global 1 | ||||
global 2 | ||||
global 3 | ||||
global 4 | ||||
global 5 | ||||
global 6 | ||||
local 1 | ||||
local 2 |
We also added the results of the empirical application in the last two columns. You can see the paper for comprehensive interpretation of the results.
The third argument is that the confidence interval for the estimates of the treatment effects can have lower nominal coverage when using global high-order polynomials for the forcing variable. On the other hand can provide a better coverage when using local linear or quadratic polynomials. For this study, we created a simulation study on an artificially generated dataset as stated in data generating process above. Further we did a similar study on a Life Expectancy (WHO) Kaggle dataset. The aim of this study is to find how efficiently the confidence interval represents the true population parameter. We estimate the treatment effect for the discontinuities in the artificial setting, where there are no discontinuities to be found. Thereafter, we calculate the rejection rates rates using confidence intervals and true parameter to check the soundness of the results.
For simulation study of this argument, we take into account the data generating process is
mentioned is mentioned in argument 2 that is defined in Data Generating Process subsection,
except for the fact that the discontinuity for the conditional expectation of the dependent
variable as a function of the forcing variable at the threshold is zero. In simpler terms,
there is no discontinuity in
First, we randomly choose
We did this exercise for the global polynomial up to the sixth order and two local polynomials.
For each iteration, we use Imbens and Kalyanaraman 2012\cite{imbens2012optimal} optimal bandwidth
selector to select the bandwidth
Figure below shows the box plot representation of the estimates of the treatment effect for
the simulation study in argument 3. Here,
Box plot showing Simulation Study Estimates
The
The Table below shows the result of global and local confidence interval rejection rate for both simulation study and empirical application. You can see the paper for comprehensive interpretation of the results of argument 3.
Order of Polynomial | Rejection Rate (Simulation Study) | Median (s.e.) (Simulation Study) | Rejection Rate (Empirical Application) | Median (s.e.) (Empirical Application) |
---|---|---|---|---|
global 1 | ||||
global 2 | ||||
global 3 | ||||
global 4 | ||||
global 5 | ||||
global 6 | $ 0.073$ | |||
local 1 | ||||
local 2 |
Regression Discontinuity Designs are one of the standard techniques used in econometrics and statistics to obtain causal effects from the observed data. One of the practices in implementing regression discontinuity designs is controlling for high-order polynomial approximations for the treatment effect of the forcing variable. This paper is a review of the paper written by Gelman and Imbens (2019) and strengthens the three arguments and recommends not to use high-order polynomials to estimate the treatment effects in regression discontinuity designs. The arguments are: the implicit weights calculated for high-order polynomial approximations are unattractive, estimates are highly sensitive to the order of polynomials, and the confidence intervals have less than nominal coverage. The arguments are presented in the context of sharp regression designs, but using local linear or quadratic regression over global polynomials is recommended for fuzzy regression designs as well.
The point of the paper is to explore the three arguments suggested by Gelman and Imbens (2019). We conduct simulation studies based on regression discontinuity and empirical applications on real data that support the authors´ arguments against using high-order polynomial approximations and instead recommends using local linear or quadratic estimates for estimating treatment effect. The question that is to be asked is that, why high-order polynomial approximations are popular despite having many unattractive properties. The motivation behind using high-order polynomials for forcing variables is that they can map a smooth function onto a compact data set. The statement is true, however high-order polynomials have a tendency to overfit boundary points and have more fluctuations at the input domain's extremes, and thus may not provide good estimates. Also, knowing the true order of relationship between the forcing variable and dependent variable is somewhat difficult; as the high-order polynomials can result in overfitting of data, where the model fits the noise or idiosyncrasies in the data rather than the underlying relationship between the variables. This is mainly in the case where the forcing variable is not a good predictor of the dependent variable. This implies to the case when including more polynomials in estimation increases noise and do not reduce the bias. Thus, we move back to our discussion of Argument 2, that the results from the estimates are sensitive to the order of polynomial. Based on the simulation results and empirical application this paper supports the arguments of the paper Gelman and Imbens (2019) and suggests to use local linear and quadratic approximations for estimates of treatment effect.
You can access the paper by clicking on this link.
Rauank Mehrotra -- s6rkmehr@uni-bonn.de
Vanshaj Bindlish -- s6vabind@uni-bonn.de