-
Notifications
You must be signed in to change notification settings - Fork 2
/
matching.qmd
413 lines (328 loc) · 19.5 KB
/
matching.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "Matching"
share:
permalink: "https://book.martinez.fyi/matching.html"
description: "Business Data Science: What Does it Mean to Be Data-Driven?"
linkedin: true
email: true
mastodon: true
author:
- name: ??
- name: Ignacio Martinez
---
In the realm of causal inference, matching stands out as a powerful and popular
statistical technique. Its primary goal? To construct a valid comparison group
by pairing treated units with untreated units that are as similar as possible
based on observable characteristics. This chapter will dive deep into the world
of matching, exploring its mechanics, applications, and limitations.
## The Bootcamp Conundrum
Imagine a tech company, eager to propel its engineers forward, rolls out a shiny
new AI bootcamp. Yet, due to scheduling quirks, the bootcamp ends up heavily
skewed towards senior engineers – those with five or more years under their
belts. This poses a classic causal inference challenge.
In the potential outcomes framework, we envision each engineer with two possible
career paths: one if they attend the bootcamp ($Y_1$), another if they don't
($Y_0$). The rub, of course, is that we only witness one reality per engineer.
The non-random enrollment in our bootcamp muddies the waters. Simply comparing
bootcamp graduates to non-participants would be like judging a footrace where
one runner had a head start. The bootcamp group, on average, boasts more
experience – a factor we know can independently turbocharge careers.
## Matching to the Rescue
To level the playing field, we construct a matched control group. For each
bootcamp attendee, we seek out a non-attendee with a similar experience level.
By comparing outcomes within these matched pairs, we can tease out the
bootcamp's true impact, disentangling it from the effects of experience.
Yet, the plot thickens. What if bootcamp participation wasn't solely about
experience? In a global company, time zones could play a role. Attending a
bootcamp during US business hours is far more convenient for an engineer in New
York than one in Tokyo. Here, time zone becomes a confounder, potentially
influencing both bootcamp attendance and career trajectory.
One might try to match on both experience and location, but this quickly becomes
unwieldy as more factors enter the picture. The elegant solution is to estimate
a propensity score – the probability of each engineer attending the bootcamp
based on their various characteristics. By matching on this propensity score, we
create comparable groups, even when those groups differ on a multitude of
individual attributes.
## The Mechanics of Matching
Matching typically involves four key steps:
1. Choose a distance measure to quantify the similarity between units.
2. Match treated units to untreated units based on this distance measure.
3. Assess the quality of the matches and iterate if necessary.
4. Estimate treatment effects using the matched sample.
Let's explore two common distance measures in detail: Mahalanobis distance and
propensity scores.
### Mahalanobis Distance: Accounting for Covariate Relationships
Mahalanobis distance is a multivariate measure of the distance between a point
and the center of a distribution. It's particularly useful in matching because
it accounts for the correlations between variables.
Key features of Mahalanobis distance include:
- Scale-invariance: It's unaffected by the scale of measurement.
- Covariance consideration: It accounts for relationships between variables.
- Euclidean equivalence: For uncorrelated variables with unit variance, it
reduces to Euclidean distance.
Mathematically, the Mahalanobis distance between two points $x$ and $y$ in
p-dimensional space is:
$$D_M(x,y) = \sqrt{(x-y)^T S^{-1} (x-y)}$$
Where $S$ is the covariance matrix of
the variables.
### Propensity Scores: Collapsing Dimensions
The propensity score represents the probability of receiving treatment given
observed covariates, often estimated using logistic regression. Key features of
propensity scores include:
- Dimension reduction: They collapse multiple covariates into a single score.
- Balance assessment: They make it easier to check balance on a single
dimension.
- Interpretability: They represent the probability of treatment.
The propensity score is given by: $$ e(X) = P(T=1|X)$$
Where $T$ is the treatment indicator and $X$ is the vector of covariates.
### Key Differences Between Mahalanobis Distance and Propensity Score
| Feature | Mahalanobis Distance | Propensity Score |
| ----------------------- | ------------------------------------ | --------------------------------------------------- |
| Dimensionality | Operates in original covariate space | Reduces matching to a single dimension |
| Interpretation | Measures multivariate similarity | Represents probability of treatment |
| Covariate relationships | Explicitly accounts for covariance | Implicitly captures relationships through the model |
| Model specification | Doesn't require a model | Can be sensitive to estimation method |
| Categorical variables | Can struggle with them | Naturally incorporates them |
| Curse of dimensionality | Can suffer in high dimensions | Handles higher dimensions more easily |
### When to Use Each
- **Mahalanobis distance:** Ideal when you have few continuous covariates,
relationships between covariates are important, and you want to avoid
specifying a treatment model.
- **Propensity scores:** Better suited when you have many covariates
(including categorical ones), the treatment mechanism is of interest, and
you want to easily assess balance and overlap.
### Matching Algorithms: Putting Theory into Practice
Once we've chosen a distance measure, we need an algorithm to perform the actual
matching. Three common approaches are:
- Nearest neighbor matching: Matches each treated unit to the closest
untreated unit.
- Optimal matching: Minimizes the total distance across all matched pairs.
- Full matching: Creates matched sets, each containing at least one treated
and one untreated unit.
## The Limits of Matching: Avoiding Matching Charles to Ozzy
As with any causal inference method, matching is not a magic bullet. It works
best when you have the right data to model treatment assignment. Essentially,
after matching, whether someone is in the treatment group should be effectively
random.
For example, in our bootcamp scenario, imagine that participation is largely
explained by an engineer's "grit" – a trait we cannot directly observe or match
on. If career trajectory is also a function of grit, we might mistakenly
conclude that the bootcamp has a larger impact than it truly does. Conversely,
if procrastinators are more likely to participate, we might wrongly infer that
the bootcamp hurts career success.
A memorable way to understand this limitation is through the "Ozzy Osbourne
Conundrum." Consider these two individuals:
+--------------------------------------------------+--------------------------------------------+
| Charles | Ozzy |
+:================================================:+:==========================================:+
| | |
| ![](img/charles.webp){width=400px, height=500px} |![](img/ozzy.png){width=400px, height=500px}|
| | |
| Male | Male |
| | |
| Born in 1948 | Born in 1948 |
| | |
| Raised in the UK | Raised in the UK |
| | |
| Lives in a castle | Lives in a castle |
| | |
| Wealthy & famous | Wealthy & famous |
+--------------------------------------------------+--------------------------------------------+
: Matching Charles to Ozzy {#tbl-ozzy_and_charles}
Ozzy and Charles share many observable characteristics: they're both males, born
in 1948, raised in the UK, live in castles, and are wealthy and famous. However,
Ozzy would clearly not be a good match for Charles in most studies. This example
illustrates how matching on observables can sometimes be misleading.
The key takeaway? Matching is a powerful tool, but it relies on the assumption
that after matching, the remaining differences between groups are essentially
random. If this assumption doesn't hold, our conclusions may be misleading.
## The Propensity Score Paradox: A Critique by King and Nielsen
In their influential paper, @King_Nielsen_2019 present a compelling critique
of propensity score matching (PSM). Their findings challenge conventional wisdom
and offer important insights for practitioners of matching methods.
### The PSM Paradox
At the heart of King and Nielsen's argument is what they term the "PSM paradox."
They demonstrate that under certain conditions, PSM can actually increase
imbalance, model dependence, and bias. This occurs because PSM approximates a
completely randomized experiment, rather than a more efficient fully blocked
randomized experiment.
Key findings include:
1. Increased Imbalance: As PSM prunes observations to improve balance, it can
paradoxically increase imbalance on the original covariates after a certain
point.
2. Model Dependence: PSM can lead to greater model dependence, meaning that
different model specifications can yield substantially different causal
estimates.
3. Bias: The combination of increased imbalance and model dependence can result
in biased causal estimates.
### The Mechanics Behind the Paradox
King and Nielsen explain that PSM's shortcomings stem from its attempt to
approximate complete randomization. In contrast, other matching methods aim to
approximate full blocking, which is generally more efficient and precise.
1. Information Loss: PSM collapses multi-dimensional covariate information into
a single dimension (the propensity score), potentially discarding valuable
information.
2. Random Pruning: Once PSM achieves its goal of approximate randomization,
further pruning of observations becomes essentially random with respect to
the original covariates. This random pruning can increase imbalance.
3. Dimensionality: The problems with PSM become more pronounced as the number
of covariates increases.
### Empirical Evidence
The authors provide evidence from both simulations and real-world datasets to
support their claims. They show that as PSM prunes more observations, other
matching methods (like Mahalanobis distance matching) continue to improve
balance, while PSM begins to worsen it.
### Recommendations
Based on their findings, King and Nielsen offer several recommendations:
1. Avoid PSM for Matching: They suggest using other matching methods that
better approximate full blocking, such as Mahalanobis distance matching or
coarsened exact matching.
2. Use PSM Carefully: If using PSM, researchers should be aware of its
limitations and stop pruning before the paradox kicks in.
3. Balance Checking: Regardless of the matching method used, researchers should
always check covariate balance before and after matching.
4. Consider Alternative Uses: While discouraging PSM for matching, the authors
note that propensity scores can be useful in other contexts, such as
weighting or subclassification.
### Implications for Practice
This critique has significant implications for how we approach matching in
causal inference:
1. Method Selection: When choosing a matching method, consider how well it
approximates full blocking rather than complete randomization.
2. Iterative Process: Matching should be an iterative process, with continuous
checks on balance and careful consideration of when to stop pruning
observations.
3. Multidimensional Balance: Pay attention to balance on the original
covariates, not just the propensity score.
4. Transparency: Given the potential for increased model dependence, it's
crucial to be transparent about the matching process and to consider
multiple model specifications.
## Practical Examples with MatchIt
The R package [{MatchIt}](https://kosukeimai.github.io/MatchIt/) provides a
comprehensive set of tools for implementing various matching methods. It was
developed based on the recommendations of [@ho2007matching] for improving
parametric models through nonparametric preprocessing.
MatchIt supports a wide range of matching techniques, including:
- Exact matching
- Nearest neighbor matching
- Optimal matching
- Full matching
- Genetic matching
- Coarsened exact matching
### Cautionary tale: Unmeasured Confounders.
Imagine you're a data scientist at the illustrious TechGiant Inc., a company
that recently rolled out an intensive AI bootcamp program for its engineers.
This ambitious initiative aims to elevate the workforce's skills and propel
innovation to new heights. You've been entrusted with a crucial task: to
evaluate the program's effectiveness by examining its impact on engineers'
salaries.
```{r unmeasure_confounders, message=FALSE, warning=FALSE}
library(MatchIt)
library(dplyr)
library(ggplot2)
set.seed(123)
# Generate synthetic data
n <- 1000
experience <- runif(n, 0, 10) # Years of experience
procrastination <- rnorm(n) # Unobserved procrastination level
bootcamp <- rbinom(n, 1, plogis(-0.3 * experience + 0.5 * procrastination)) # Bootcamp participation
salary_increase <- 2000 * bootcamp + 1000 * experience - 9000 * procrastination + rnorm(n, 0, 5000)
# True average treatment effect is $2000
data <- data.frame(experience = experience,
bootcamp = bootcamp,
salary_increase = salary_increase)
# Naive estimate
naive_model <- lm(salary_increase ~ bootcamp, data = data)
naive_ate <- coef(naive_model)["bootcamp"]
# Matching on experience (ignoring unobserved procrastination)
m.out <- matchit(bootcamp ~ experience,
data = data,
method = "nearest",
ratio = 1)
matched_data <- match.data(m.out)
# Estimate ATE on matched data
matched_model <- lm(salary_increase ~ bootcamp,
data = matched_data,
weights = weights)
matched_ate <- coef(matched_model)["bootcamp"]
# Print results
cat("True ATE: $2000\n")
cat("Naive ATE estimate:", round(naive_ate, 2), "\n")
cat("Matched ATE estimate:", round(matched_ate, 2), "\n")
# Visualize results
ggplot(data, aes(x = experience,
y = salary_increase,
color = factor(bootcamp))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "AI Bootcamp Effect on Salary Increase",
subtitle = "True effect is positive, but observed relationship appears negative",
x = "Years of Experience",
y = "Salary Increase ($)",
color = "Bootcamp Participation") +
theme_minimal()
```
What's happening in this scenario? Let's break it down:
1. **The True Impact:** In reality, the bootcamp program is a success. It
genuinely enhances skills and, consequently, leads to higher salary
increases.
2. **Experience and Participation:** Less experienced engineers are more likely
to enroll in the bootcamp, perhaps viewing it as a way to bridge the gap
with their seasoned colleagues.
3. **Procrastination as a Hidden Factor:** These same less experienced
engineers, possibly due to feeling overwhelmed or uncertain in their roles,
tend to have higher levels of procrastination.
4. **Motivation's Influence on Salary:** This inherent motivation leads to
exceptional performance and subsequent salary raises, whether or not they
participate in the bootcamp.
5. **Matching Gone Awry:** By focusing on matching solely based on experience
and overlooking motivation, you inadvertently compare highly motivated
non-participants with a mix of motivated and less motivated participants.
The consequence? Your analysis paints a deceptive picture, indicating a negative
effect of the bootcamp when the true effect is, in fact, positive.
This example illustrates a critical lesson in causal inference: the danger of
unmeasured confounders. In this case, motivation acts as an unmeasured
confounder, influencing both the likelihood of bootcamp participation and salary
increases. As a business data scientist, this scenario highlights the importance
of:
1. Thinking critically about all factors that might influence both your
treatment (bootcamp participation) and outcome (salary increases).
2. Recognizing the limitations of your data and analysis methods.
3. Communicating these nuances to stakeholders who might otherwise make
decisions based on misleading results.
4. Considering additional data collection or alternative analysis methods to
account for potential unmeasured confounders.
In the end, your role isn't just to crunch numbers, but to uncover the true
story behind the data and guide your company towards informed decisions. This
might involve recommending a more comprehensive study that includes measures of
motivation, or suggesting a randomized pilot program for future iterations of
the bootcamp.
## Conclusion: The Power and Pitfalls of Matching
Matching is a powerful tool in the causal inference toolkit, offering a way to
construct valid comparison groups and tease out causal effects from
observational data. However, as we've seen, it's not without its complexities
and potential pitfalls.
From the basic concept of pairing similar units to the intricacies of different
distance measures and matching algorithms, we've explored the mechanics of how
matching works. We've also delved into its limitations, illustrated vividly by
the Ozzy Osbourne Conundrum, which reminds us that observable characteristics
don't always tell the full story.
The critique by King and Nielsen serves as a important cautionary tale,
particularly regarding the use of propensity score matching. Their work
underscores the importance of understanding the theoretical underpinnings of our
methods and approaching them critically.
As data scientists, our task is to navigate these complexities, understanding
when and how to apply matching methods appropriately. We must be aware of their
strengths and limitations, always striving for transparency in our processes and
robustness in our results.
Matching, when used judiciously, can be a powerful ally in our quest to uncover
causal relationships. But like any tool, its effectiveness depends on the skill
and understanding of those who wield it. As we continue to push the boundaries
of causal inference, let's carry forward this nuanced understanding of matching,
always remaining open to new developments and critiques that can refine our
methodological toolkit.
::: {.callout-tip}
## Learn more
- @stuart2011matchit {MatchIt}: Nonparametric Preprocessing for Parametric Causal Inference.
- @King_Nielsen_2019 Why Propensity Scores Should Not Be Used for Matching.
:::