-
Notifications
You must be signed in to change notification settings - Fork 11
/
04-isolated_immune_cell_reconstruction.Rmd
466 lines (397 loc) · 16.1 KB
/
04-isolated_immune_cell_reconstruction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
---
title: "Isolated immune cell reconstruction"
output:
html_notebook:
toc: true
toc_float: true
---
**J. Taroni 2018**
In this notebook, we'll be evaluating reconstruction of [`E-MTAB-2452`](https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2452/)
sorted peripheral blood cells (CD4+ T cells, CD14+ monocytes, CD16+ neutrophils)
profiled on microarray from several autoimmune diseases.
as compared to isolated immune cell data included in the recount2.
[`SRP045500`](https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP045500) is a
comparison of 6 cell subsets in "immune-associated diseases."
## Functions and directory set up
```{r}
`%>%` <- dplyr::`%>%`
# custom functions
source(file.path("util", "plier_util.R"))
```
```{r}
# Noticed I needed this functionality repeatedly
ReconstructionEvalWrapper <- function(z.mat, b.mat, input.exprs,
dataset.name, lv.type = "All") {
# This function is a wrapper for reconstructing gene expression data given
# gene loadings (Z) and LVs (B) from a PLIER model. It also takes the
# input expression data so evaluations (Spearman correlation & MASE) can be
# performed. Finally, it returns a tidy data.frame of these measures.
# See also: the compare reconstructed to input section of util/plier_util.R
#
# Args:
# z.mat: a matrix of gene loadings from PLIER
# b.mat: a matrix of latent variables from PLIER
# input.exprs: gene expression matrix before reconstruction (input into
# PLIER), used as the `true.mat` argument for the reconstruction
# evaluations (should be row-normalized, can use
# GetOrderedRowNorm to obtain this matrix)
# dataset.name: a string that indicates the dataset being reconstructed,
# this will end up being the value in the `Dataset` column of
# data.frame returned by this function
# lv.type: a string that indicates what LVs were use for reconstruction,
# this will end up being the value in the
# `LVs used in reconstruction` column of the data.frame returned
#
# Returns:
# recon.eval.df: A data.frame with the following columns - Sample, MASE,
# Spearman correlation, LVs used in reconstruction, Dataset
#
## Reconstruction
recon.exprs <- GetReconstructedExprs(z.matrix = z.mat,
b.matrix = b.mat)
## Evaluations
recon.error <- GetReconstructionMASE(true.mat = input.exprs,
recon.mat = recon.exprs)
recon.cor <- GetReconstructionCorrelation(true.mat = input.exprs,
recon.mat = recon.exprs)
## get data.frame
num.samples <- ncol(input.exprs)
recon.eval.df <- as.data.frame(cbind(colnames(input.exprs),
recon.error, recon.cor,
rep(lv.type, num.samples),
rep(dataset.name, num.samples)))
colnames(recon.eval.df) <- c("Sample", "MASE", "Spearman correlation",
"LVs used in reconstruction", "Dataset")
recon.eval.df <- recon.eval.df %>%
dplyr::mutate(MASE = as.numeric(as.character(MASE)),
`Spearman correlation` =
as.numeric(as.character(`Spearman correlation`)))
return(recon.eval.df)
}
```
```{r}
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "04")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "04")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
```
## Load data
### E-MTAB-2452
#### Expression data
```{r}
gs.file <- file.path("data", "expression_data",
"E-MTAB-2452_hugene11st_SCANfast_with_GeneSymbol.pcl")
exprs.df <-readr::read_tsv(gs.file)
# drop gene identifiers
exprs.mat <- as.matrix(exprs.df[, 3:ncol(exprs.df)])
rownames(exprs.mat) <- exprs.df$GeneSymbol
rm(exprs.df)
```
#### B matrix
From recount2 model
```{r}
iso.b.file <- file.path("results", "03",
"E-MTAB-2452_B_matrix_recount2_model.txt")
iso.b.matrix <- read.delim(iso.b.file)
```
### recount2
#### PLIER model
```{r}
plier.results <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
```
#### Input expression data
```{r}
# data that was prepped for use with PLIER
recount.list <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_data_prep_PLIER.RDS"))
# input expression data from intermediate file
recount.input.exprs <- recount.list$rpkm.cm
rm(recount.list)
```
#### Evaluation of reconstruction
Spearman correlation, MASE
```{r}
eval.file <- file.path("results", "02",
"recount2_recount2_model_recon_eval_df.tsv")
recount.eval.df <- readr::read_tsv(eval.file)
recount.eval.df$Dataset <- rep("recount2", nrow(recount.eval.df))
```
Identify the samples that belong to `SRP045500`, we'll duplicate these values
in `recount.eval.df` so we can plot them separately.
```{r}
srp.df <- recount.eval.df %>%
dplyr::filter(grepl("SRP045500", recount.eval.df$Sample)) %>%
dplyr::mutate(Dataset = "SRP045500")
recount.eval.df <- dplyr::bind_rows(recount.eval.df, srp.df)
```
## Reconstruction of E-MTAB-2452 with all LVs
```{r}
# recount2 model Z matrix
z.matrix <- plier.results$Z
```
### Reconstruction and evaluation
We need to obtain a (row-normalized) gene expression matrix as the "true"
expression matrix.
We use this for evaluation with Spearman and MASE as in previous notebooks.
We'll use this for the other evaluations (e.g., pathway-associated LVs only) as
well.
```{r}
iso.ord.rownorm <- GetOrderedRowNorm(exprs.mat = exprs.mat,
plier.model = plier.results) #recount2
```
```{r}
iso.recon.all.df <- ReconstructionEvalWrapper(z.mat = as.matrix(z.matrix),
b.mat = as.matrix(iso.b.matrix),
input.exprs = iso.ord.rownorm,
lv.type = paste("All, n =",
ncol(z.matrix)),
dataset.name = "E-MTAB-2452")
```
```{r}
head(iso.recon.all.df)
```
### Plotting
Add the isolated cell type reconstruction evaluation measures to the recount
data.frame
```{r}
eval.df <- dplyr::bind_rows(recount.eval.df, iso.recon.all.df)
rm(iso.recon.all.df)
```
#### Correlation
```{r}
dplyr::filter(eval.df, `LVs used in reconstruction` ==
paste("All, n =", ncol(z.matrix))) %>%
ggplot2::ggplot(ggplot2::aes(x = `Spearman correlation`,
group = Dataset, fill = Dataset)) +
ggplot2::geom_density(alpha = 0.3) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "gray50", "black")) +
ggplot2::ggtitle(paste("All, n =", ncol(z.matrix)))
```
#### MASE
```{r}
dplyr::filter(eval.df, `LVs used in reconstruction` ==
paste("All, n =", ncol(z.matrix))) %>%
ggplot2::ggplot(ggplot2::aes(x = MASE,
group = Dataset, fill = Dataset)) +
ggplot2::geom_density(alpha = 0.3) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "gray50", "black")) +
ggplot2::ggtitle(paste("All, n =", ncol(z.matrix)))
```
### Summary
The isolated immune cell samples (`SRP045500`) are among some of the best
reconstructed (high correlation, low error) in the recount2 compendium.
`E-MTAB-2452`, the microarray dataset, is reconstructed relatively poorly.
This is unsurprising, as it was not included in training and is from a different
technology.
We'll look next at reconstruction using _subsets_ of the recount2 PLIER model
LVs.
Specifically, we hypothesize that reconstruction using only pathway-associated
LVs (e.g., those LVs rooted in biological signal) may make the measures of
reconstruction performance -- correlation and error -- more similar between
the two isolated immune cell datasets.
If this is the case, it supports the idea that the PLIER-learned LVs that are
not associated with pathways are, at least in part, capturing variation that
can be attributed to technical factors.
We would expect that the technical signals in recount2 would be distinct
(e.g., an LV associated with library prep strategy) from what's in
`E-MTAB-2452` and, therefore, that including this information may negatively
impact reconstruction of this dataset.
## Reconstruction of E-MTAB-2452 with pathway-associated LVs only
```{r}
# Get the significant LVs (from recount2 model) using 0.05 as the FDR cutoff
plier.summary <- plier.results$summary
sig.summary <- plier.summary %>%
dplyr::filter(FDR < 0.05)
sig.lvs <- unique(sig.summary$`LV index`)
```
```{r}
# drop columns (LVs) from Z that are not significantly associated with prior
# info
sig.z.mat <- z.matrix[, as.integer(sig.lvs)]
# drop rows (LVs) from E-MTAB-2452 B that are not significantly associated with
# prior info
iso.sig.b.mat <- as.matrix(iso.b.matrix[as.integer(sig.lvs), ])
```
### Reconstruction and evaluation
```{r}
path.lv.type <- paste("Pathway-associated, n =", ncol(sig.z.mat))
iso.recon.sig.df <- ReconstructionEvalWrapper(z.mat = as.matrix(sig.z.mat),
b.mat = iso.sig.b.mat,
input.exprs = iso.ord.rownorm,
dataset.name = "E-MTAB-2452",
lv.type = path.lv.type)
head(iso.recon.sig.df)
```
Add to rest of evaluations
```{r}
eval.df <- dplyr::bind_rows(eval.df, iso.recon.sig.df)
rm(iso.recon.sig.df)
```
## Reconstruction with LVs that are not associated with pathways
Here, we'll have to reconstruct the recount2 data with the LVs that are not
significantly associated with pathways, as we did not perform this analysis
in `02-recount2_PLIER_exploration`.
```{r}
# some LVs don't even make it to the summary data.frame from the PLIER model
# the LVs we want here are *all* the LVs that do not have at least one pathway
# significantly associated with them
all(1:nrow(plier.results$B) %in% unique(plier.summary$`LV index`))
```
```{r}
# non-significant LVs, then, are all LVs that are not in the list of
# pathway-associated LVs
non.lvs <- setdiff(1:nrow(plier.results$B), as.integer(sig.lvs))
```
```{r}
# z matrix for just these LVs
non.z.mat <- z.matrix[, as.integer(non.lvs)]
```
### recount2
```{r}
# get B matrix
non.b.mat <- as.matrix(plier.results$B[as.integer(non.lvs), ])
# reconstruction & evaluation
non.lv.type <- paste("LVs not associated with pathway, n =", length(non.lvs))
recount.recon.non.df <-
ReconstructionEvalWrapper(z.mat = as.matrix(non.z.mat),
b.mat = non.b.mat,
input.exprs = recount.input.exprs,
lv.type = non.lv.type,
dataset.name = "recount2")
head(recount.recon.non.df)
```
#### SRP045500
```{r}
srp.non.df <- recount.recon.non.df %>%
dplyr::filter(grepl("SRP045500",
recount.recon.non.df$Sample)) %>%
dplyr::mutate(Dataset = "SRP045500")
recount.recon.non.df <- dplyr::bind_rows(recount.recon.non.df, srp.non.df)
```
```{r}
eval.df <- dplyr::bind_rows(eval.df, recount.recon.non.df)
rm(recount.recon.non.df, srp.non.df)
```
### E-MTAB-2452
```{r}
# get b matrix
iso.non.b.mat <- as.matrix(iso.b.matrix[as.integer(non.lvs), ])
# reconstruction & evaluation
iso.recon.non.df <-
ReconstructionEvalWrapper(z.mat = as.matrix(non.z.mat),
b.mat = iso.non.b.mat,
input.exprs = iso.ord.rownorm,
lv.type = non.lv.type,
dataset.name = "E-MTAB-2452")
head(iso.recon.non.df)
```
```{r}
eval.df <- dplyr::bind_rows(eval.df, iso.recon.non.df)
rm(iso.recon.non.df)
```
## Plotting
Plot all three conditions on the same plot for easy comparison
### Density
#### Correlation
```{r}
ggplot2::ggplot(eval.df,
ggplot2::aes(x = `Spearman correlation`, group = Dataset, fill = Dataset)) +
ggplot2::facet_wrap(~ `LVs used in reconstruction`, ncol = 1) +
ggplot2::geom_density(alpha = 0.3) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "gray50", "black"))
```
```{r}
plot.file <- file.path(plot.dir,
"recount2_model_isolated_cell_recon_correlation.pdf")
ggplot2::ggsave(filename = plot.file, plot = ggplot2::last_plot(),
height = 11, width = 8.5)
```
#### Error
```{r}
ggplot2::ggplot(eval.df,
ggplot2::aes(x = MASE, group = Dataset, fill = Dataset)) +
ggplot2::facet_wrap(~ `LVs used in reconstruction`, ncol = 1) +
ggplot2::geom_density(alpha = 0.3) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "gray50", "black"))
```
```{r}
plot.file <- file.path(plot.dir,
"recount2_model_isolated_cell_recon_error.pdf")
ggplot2::ggsave(filename = plot.file, plot = ggplot2::last_plot(),
height = 11, width = 8.5)
```
### E-MTAB-2452 Boxplots
```{r}
array.eval.df <- dplyr::filter(eval.df, Dataset == "E-MTAB-2452")
```
#### Correlation
```{r}
ggplot2::ggplot(array.eval.df,
ggplot2::aes(x = `LVs used in reconstruction`,
y = `Spearman correlation`)) +
ggplot2::geom_boxplot() +
ggplot2::theme_bw() +
ggplot2::scale_x_discrete(labels = c("All", "Not pathway-associated",
"Pathway-associated")) +
ggplot2::ggtitle("E-MTAB-2452 reconstruction with recount2 model")
```
```{r}
plot.file <- file.path(plot.dir,
"E-MTAB-2452_reconstruction_corr_recount2_model.pdf")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
```
```{r}
pairwise.t.test(x = array.eval.df$`Spearman correlation`,
g = array.eval.df$`LVs used in reconstruction`,
p.adjust.method = "bonferroni")
```
#### Error
```{r}
ggplot2::ggplot(array.eval.df,
ggplot2::aes(x = `LVs used in reconstruction`, y = MASE)) +
ggplot2::geom_boxplot() +
ggplot2::theme_bw() +
ggplot2::scale_x_discrete(labels = c("All", "Not pathway-associated",
"Pathway-associated")) +
ggplot2::ggtitle("E-MTAB-2452 reconstruction with recount2 model")
```
```{r}
plot.file <- file.path(plot.dir,
"E-MTAB-2452_reconstruction_error_recount2_model.pdf")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
```
```{r}
pairwise.t.test(x = array.eval.df$MASE,
g = array.eval.df$`LVs used in reconstruction`,
p.adjust.method = "bonferroni")
```
## Summary
Reconstruction of a test dataset, `E-MTAB-2452`, using only PLIER latent
variables that significantly associated with a pathway or gene set is
better in terms of both MASE and Spearman correlation than when using only
latent variables that are not associated with any prior information.
This is despite the fact that there are more non-significant LVs (`n = 788` vs.
`n = 199`).
However, neither of these subsets of LVs outperforms using all LVs
and the performance gap between reconstructing `E-MTAB-2452` and `SRP045500`
(the sorted immune cell dataset included in recount2) is not considerably
reduced when using only pathway-associated LVs, which refutes our initial
hypothesis.
It's also interesting that, when reconstructing using only non-significant LVs,
the pre- and post-reconstruction correlation values are much more similar
between `E-MTAB-2452` and `SRP045500`.
This suggests that the pathway-associated LVs are important for accurate
reconstruction of the `SRP045500` dataset.
_It's worth noting that these sorted immune cell datasets are from
patients with different diagnoses and contain different cell type populations
(in addition to different technologies) that may be more or less
well-represented in the recount2 compendium.
We expect that these differences contribute to the results.
Future analyses may focus on a microarray dataset more comparable to `SRP045500`
in these ways._