forked from esherm/geneTherapyPatientReportMaker
-
Notifications
You must be signed in to change notification settings - Fork 2
/
GTSPreport.Rmd
471 lines (404 loc) · 24.4 KB
/
GTSPreport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
# Analysis of integration site distributions and relative clonal abundance for subject `r sanitize(patient)`
`r format(Sys.Date(), "%b %d %Y")`
```{r setup,echo=FALSE}
libs <- c("knitr", "RColorBrewer", "grid", "gridExtra")
null <- suppressMessages(sapply(libs, library, character.only=TRUE))
opts_chunk$set(
fig.path=paste0(fig.path, "/"),
fig.align='left',
comment="",
echo=FALSE,
warning=FALSE,
error=TRUE,
message=FALSE,
cache=FALSE,
dpi=100,
dev=c("png","pdf","postscript"),
results="asis")
options(knitr.table.format = 'html')
n_cell_types <- barplotAbunds %>%
select(CellType) %>%
distinct(CellType) %>%
nrow(.)
```
***
<P style="page-break-before: always">
## Introduction
The attached report describes results of analysis of integration site
distributions and relative abundance for samples from gene therapy trials. For
cases of gene correction in circulating blood cells, it is possible to harvest
cells sequentially from blood to monitor cell populations. Frequency of
isolation information can provide information on the clonal structure of the
population. This report summarizes results for subject `r sanitize(patient)`
over time points `r sanitize(paste(timepoint,collapse=", "))` in UCSC genome
draft `r sanitize(freeze)`.
The samples studied in this report, the numbers of sequence reads, recovered
integration vectors, and unique integration sites available for this subject
are shown below. We quantify population clone diversity using [Gini
coefficients](https://en.wikipedia.org/wiki/Gini_coefficient), [Shannon
index](https://en.wikipedia.org/wiki/Diversity_index#Shannon_index), and UC50. The Gini
coefficient provides a measure of inequality in clonal abundance in each
sample. The coefficient equals zero when all sites are equally abundant
(polyclonal) and increases as fewer sites account for more of the total
(oligoclonal). Shannon index is another widely used measure of diversity and it
accounts for both abundance and evenness of the integration events. Alternatively, the
UC50 is the number of unique clones which make up the top 50% of the sample's abundance.
For polyclonal samples, one may expect a low Gini coefficient, high Shannon Index, and
high UC50 (proportional to the total number of unique sites identified in the sample).
Under most circumstances only a subset of sites will be sampled. We thus
include an estimate of sample size based on frequency of isolation information
from the SonicLength method [(Berry, 2012)](http://www.ncbi.nlm.nih.gov/pubmed/22238265).
The 'S.chao1' column denotes the estimated lower bound for population size derived using Chao
estimate [(Chao, 1987)](http://www.ncbi.nlm.nih.gov/pubmed/3427163). If sample
replicates were present then estimates were subjected to jackknife bias
correction.
```{r summaryTable,results="asis"}
kable(summaryTable, caption="Sample Summary Table", row.names=FALSE, format="html", digits = 4)
```
`r percent_cutoff <- 0.2`
***
<P style="page-break-before: always">
## Do any uniquely mapped clones account for greater than `r percent(percent_cutoff)` of the total?
For some trials, a reporting criteria is whether any cell clones expand to account for greater than `r percent(percent_cutoff)` of all clones. This is summarized below for subject `r patient`. Abundance is estimated using the SonicLength method. Data such as this must, of course, be interpreted in the context of results from other assays. Distances reported refer to transcription start sites (5').
```{r TwentyPercSites, results="asis"}
rows <- standardizedDereplicatedSites$estAbundProp >= percent_cutoff
if( any(rows) ) {
df <- as.data.frame(standardizedDereplicatedSites[rows,])
df$Timepoint <- sortFactorTimepoints(df$Timepoint)
df <- arrange(df, Timepoint, CellType, estAbund)
df$position <- ifelse(df$strand=="+", df$start, df$end)
df$estAbundProp <- percent(df$estAbundProp)
colnames(df)[ colnames(df)=="seqnames" ]="Chr"
colnames(df)[ colnames(df)=="estAbund" ]="SonicAbundance"
colnames(df)[ colnames(df)=="estAbundProp" ]="Fraction"
colnames(df)[ colnames(df)=="estAbundRank" ]="Rank"
df$GeneName=df$nearest_refSeq_gene
df$nearest_Txn_Start_Dist=sprintf("%s(%s) %s", df$X5pnearest_refSeq_gene, df$X5pnearest_refSeq_geneOrt, df$nearest_refSeq_geneDist)
df$nearest_Onco_Txn_Start_Dist=sprintf("%s(%s) %s", df$X5pNrstOnco, df$X5pNrstOncoOrt, df$X5pNrstOncoDist)
needed <- c("Chr",
"strand",
"position",
"GTSP",
"SonicAbundance",
"Fraction",
"Rank",
"Timepoint",
"CellType",
"GeneName",
"nearest_Txn_Start_Dist",
"nearest_Onco_Txn_Start_Dist")
df <- subset(df, select=needed)
df <- arrange(df, desc(Fraction))
kable(df, caption=sprintf("Sites >%s of the Total", percent(percent_cutoff)), format="html", row.names=FALSE)
} else {
cat(sprintf("<strong>No sites found in this patient which are greater than %s of the total data.</strong>", percent(percent_cutoff)))
}
```
## Do any multihit event account for greater than `r percent(percent_cutoff)` of the total?
Up until now, all the analysis has been looking at integration sites that can be uniquely mapped. But it is also helpful to look at reads finding multiple equally good alignments in the genome which can be reffered to as 'Multihits'. If an integration site occurred within a repeat element (i.e. Alus, LINE, SINE, etc), then it might be helpful to access those sites for potential detrimental effects. These collection of sequences are analyzed separately due to their ambiguity. To make some sense of these multihits, we bin any sequence(s) which share 1 or more genomic locations hence forming psuedo-collections which can be reffered to as OTUs (operation taxonomic units). Once the OTUs are formed, we compare breakpoints of unique sites and multihits. The idea is to see if there are any multihits which higher in abundance than a unique site in a given sample. Below is a table similar to the one shown previously except we show any events instead of genomic locations which might account for greater than `r percent(percent_cutoff)` of all clones in the data.
```{r TwentyPercSitesMulti, results="asis"}
if( nrow(sites.multi) < 1 ) {
cat(sprintf("<strong>No multihits sites found in this patient which are greater than %s of the total data.</strong>", percent(percent_cutoff)))
} else {
total.abundance <- sum(mcols(standardizedDereplicatedSites)$estAbund,
sites.multi$estAbund)
is.multi.expanded <- sites.multi$estAbund > total.abundance*percent_cutoff
if( !any(is.multi.expanded) ) {
cat(sprintf("<strong>No multihits sites found in this patient which are greater than %s of the total data.</strong>", percent(percent_cutoff)))
} else {
df <- subset(sites.multi, is.multi.expanded)
df <- data.frame(
##multihitEvent=paste0("multihit", df$multihitID),
multihitEvent="multihit",
GTSP=df$GTSP,
SonicAbundance=df$estAbund,
Fraction=df$estAbund/total.abundance,
Rank=df$Rank,
Timepoint=df$Timepoint,
CellType=df$CellType
)
kable(df, caption=sprintf("Multihits Sites >%s of the Total", percent(percent_cutoff)), format="html", row.names=FALSE)
}
}
```
***
<P style="page-break-before: always">
## Relative abundance of cell clones
The relative abundance of cell clones is summarized in the attached stacked bar graphs. The cell fraction studied is named at the top, the time points are marked at the bottom. The different bars in each panel show the major cell clones, as marked by integration sites. A key to the sites is shown at the right. Throughout the whole report, each integration site is assigned a ```GeneName``` given by:
<font face="Courier" color="#333399">
- ```GeneName``` refers to the closest gene to either end and strand,
- __*__ indicates the site is within a transcription unit,
- __~__ indicates the site is within 50kb of a cancer related gene,
- __!__ indicates the gene was assocaited with lymphoma in humans.
</font>
Integration sites were recovered using ligation mediated PCR after random fragmentation of genomic DNA, which reduces recovery biases compared with restriction enzyme cleavage. Relative abundance was not measured from read counts, which are known to be inaccurate, but from marks introduced into DNA specimens prior to PCR amplification using the SonicLength method [PMID:22238265](http://www.ncbi.nlm.nih.gov/pubmed/22238265).
In the barplots below, the x-axis indicates each sample type and time point, the y-axis is scaled by the number of cells sampled (SonicBreaks), where the range is taken from the most abundant sample. The top 10 sites from each cell type have been named for the nearest gene and pooled over all cell types. The remaining sites are binned as low abundance (LowAbund; grey).
```{r totalBarPlots, fig.height=ifelse(n_cell_types <= 2, 6, 2*n_cell_types), fig.width=12}
gene_names_barplot <- frequent_genes_barplot
colors_barplot <- colorRampPalette(brewer.pal(12, "Paired"))(length(gene_names_barplot))
siteColors <- structure(colors_barplot, names=gene_names_barplot)
siteColors["LowAbund"] <- "#E0E0E0"
n_timepoints_for_cell_type <- barplotAbunds %>%
distinct(CellType, Timepoint) %>%
group_by(CellType) %>%
summarize(TimePointCount = n())
n_timepoints_for_cell_type <- mutate(
n_timepoints_for_cell_type,
FractionTimePointCount = TimePointCount / max(TimePointCount)
)
barplotAbunds <- left_join(barplotAbunds, n_timepoints_for_cell_type, by='CellType')
ggplot(data=barplotAbunds, aes(Timepoint, estAbund, fill=maskedRefGeneName)) +
geom_bar(aes(), stat="identity") +
facet_grid(CellType ~ ., scales="fixed") +
scale_fill_manual(values=siteColors) +
labs(y="Sonic Abundance", x="Timepoint", fill="GeneNames") +
scale_y_continuous(labels = scales::comma) +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1),
panel.background = element_rect(fill = "white"),
panel.border = element_rect(colour = "grey", fill=NA, size=0.5)
)
```
***
<P style="page-break-before: always">
Below are similar barplots to the previous figure, but the y-axis is scaled by proportion of the total, not number of cells sampled. Comparison to the plot above helps distinguish samples with low yield of integration sites from samples with high yield and clonal expansions. The key indicates the 10 most abundant clones in each sample. Cutoff values for binning clones as LowAbund (grey) are indicated at the top of each panel.
```{r sampleBarPlots, fig.height=6*(1+as.integer((n_cell_types-0.1)/3)), fig.width=12}
barplotAbundsBySample <- left_join(barplotAbundsBySample, n_timepoints_for_cell_type, by='CellType')
barplotAbundsBySample$CellType <- paste0(
as.character(barplotAbundsBySample$CellType),
"\n(cutoff = ",
abundCutoff.barplots[barplotAbundsBySample$CellType],
" cells)")
barplotAbundsBySample <- split(barplotAbundsBySample, barplotAbundsBySample$CellType)
#Relative abundance by CellTypes
grid.arrange(
grobs = lapply(barplotAbundsBySample, function(data){
plot <- ggplot(data=data, aes(Timepoint, estAbundProp, fill=maskedRefGeneName)) +
geom_bar(aes(width=0.9*FractionTimePointCount), stat="identity") +
facet_wrap(~CellType, scales="free") +
scale_fill_manual(values=siteColors) +
labs(y="Relative Sonic Abundance", x="Timepoint", fill="GeneNames") +
scale_y_continuous(labels=percent) +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1),
panel.background = element_rect(fill = "white"),
panel.border = element_rect(colour = "grey", fill=NA, size=0.5))
ggplotGrob(plot)
}),
ncol = 3,
nrow = 1+as.integer((n_cell_types-0.1)/3)
)
```
***
<P style="page-break-before: always">
Here is another way to perceive top ranking integration sites by genes within each celltype. Any sites with Estimated Absolute Abundance below `r abundCutoff.detailed` are binned as LowAbund.
```{r}
nrows_plot <- as.integer(n_cell_types/6) + 1*ifelse( n_cell_types %% 6 > 0, 1, 0)
ncols_plot <- ifelse(
nrows_plot > 1,
as.integer(n_cell_types / nrows_plot) + ifelse(n_cell_types %% 2 == 1, 1, 0),
n_cell_types)
```
```{r sitetype_heatmap, fig.width=2*ncols_plot+2, fig.height=10*(1+as.integer((n_cell_types-0.1)/6))}
ggplot(data=detailedAbunds, aes(Timepoint, maskedRefGeneName, fill=estAbundProp)) + geom_tile() +
scale_fill_continuous(name='Relative\nAbundance', labels=percent, low="#E5F5E0", high="#2B8CBE") +
facet_wrap(~CellType, scales="free_x", ncol=ncols_plot) + labs(y="SiteType", x="Timepoint") +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))
```
***
<P style="page-break-before: always">
## Longitudinal behavior of major clones
When multiple time points are available, it is of interest to track the behavior of the most abundant clones. A plot of the relative abundances of major clones, based on output from SonicLength, is shown below. For cases where only a single time point is available, the data is just plotted as unlinked points.
```{r ParallelLines, fig.width=10, fig.height=10}
if (has_longitudinal_data) {
ggplot(longitudinal, aes(x=Timepoint, y=estAbundProp)) +
geom_point(size=3) +
geom_line(aes(colour=posid, group=posid), alpha=.5, show_guide=FALSE) +
facet_wrap(~CellType, scales="free") +
ggtitle(paste("Patient:", patient, "Trial:", trial)) + xlab("Timepoint") +
scale_y_continuous(name="Relative Sonic Abundance",
labels=percent,expand=c(0,0)) +
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1))
}else{cat(paste0("**Only one timepoint, ", unique(levels(timepointPopulationInfo$group)), ", present. Insufficient data available to plot changes of clone densities across timepoints.**"))}
```
***
<P style="page-break-before: always">
## Integration sites near particular genes of interest
Integration sites near genes that have been associated with adverse events are of particular interest. Thus, we have cataloged all integration sites for which a gene of interest is the nearest cancer-related gene.
Results are summarized below as a scatter plot where the y-axis shows relative abundance of sites and x-axis is distance to the nearest onconogene 5' end.
Negative distances indicate that the integration site is downstream from (i.e. after) the TSS. Positive distances indicate that the integration site is upstream from (i.e. before) the TSS. Note that all RefSeq splicing isoforms are used for this analysis, so the reference TSS may not be the same for each listed integration site.
```{r badActors, include=FALSE}
#this chunk has to have include=FALSE otherwise it inexplicably displays a verbatim
#copy of the longitudinal data graph... I have absolutely no idea why...
#a bit hackey but it works
badActorOut <- NULL #clear it out
badActorOut <- lapply(badActors, function(badActor){
sites <- as.data.frame(badActorData[[badActor]])
if(nrow(sites)>0){
knit_child("badActorPartial.Rmd", quiet=T, envir=environment())
}else{
knit_expand(text=paste0("### ", badActor, "\n **No sites within 100kb of any ", badActor," TSS for this patient.**\n"))
}
})
```
`r knit(text = unlist(badActorOut))`
***
<P style="page-break-before: always">
## What are the most frequently occuring gene types in subject `r patient`?
`r max_num_words=100`
The word clouds below show the abundances of integration sites as designated by genenames for each sample. Each word cloud is labled by the timepoint, celltype, and the range of sonic abundance for the top `r max_num_words` sites.
```{r wordlelist, fig.width=4, fig.height=4, fig.show='hide'}
if(! require('PubMedWordcloud')) stop("Need PubMedWordcloud package")
df <- as.data.frame(standardizedDereplicatedSites)
df <- data.frame(
Timepoint=df$Timepoint,
CellType=factor(df$CellType, levels=CellType_order),
word=df$nearest_refSeq_gene,
freq=df$estAbund)
df <- arrange(df, CellType, Timepoint)
df.wordlist <- split(df, f=paste(df$Timepoint, df$CellType))
df.wordlist <- df.wordlist[unique(paste(df$Timepoint, df$CellType))]
max_num_words=100
drange <- lapply(df.wordlist, function(df) paste(range(sort(df$freq, decreasing=TRUE)[1:100], na.rm=TRUE), collapse=":") )
names(df.wordlist) <- paste(names(df.wordlist), "<strong>", drange, "</strong>")
null <- sapply(seq(df.wordlist), function(i)
plotWordCloud(df.wordlist[[i]],
scale=c(3,0.5),
min.freq=1, max.words=max_num_words,
rot.per = 0,
colors=c(colSets("Set1")[-6],colSets("Paired")))
)
```
```{r worldbubble_Table, results='asis'}
df.tab <- data.frame(
name=names(df.wordlist),
pngfile=file.path(fig.path, sprintf("wordlelist-%i.png", seq(df.wordlist))),
stringsAsFactors=FALSE)
df.tab$html <- sprintf("<img src='%s' alt='wordBubble'> <br />%s",
df.tab$pngfile,
gsub('.', ' ', df.tab$name, fixed=TRUE) )
df.tab$isthere <- file.exists(df.tab$pngfile)
df.tab$isthere <- TRUE #Why is this here?
if(any(!df.tab$isthere)) {
message("The word bubbles are not generated for the following time point and cell types:\n", paste(df.tab$name[!df.tab$isthere], collapse="\n") )
}
df.tab <- subset(df.tab, isthere)
## add page breaks if there are too many figures
df.tab.list <- split(df.tab, (1:nrow(df.tab)-1) %/% 9)
for( i in seq(df.tab.list) ) {
df.tab <- df.tab.list[[i]]
df.kable <- matrix(df.tab$html, byrow=TRUE, ncol=3)
df.kable[which(duplicated(c(df.kable)))] <- ""
print(kable(df.kable, row.names=FALSE,format="html", escape=FALSE))
cat("<P style=\"page-break-before: always\">\n")
}
```
```{r knitexit}
knit_exit()
```
### Do any multihit account for greater than 20% of the total?
Up until now, all the analysis has been looking at unique integration sites. But it is also helpful to look at reads finding multiple equally good scoring hits/places in the genome which can be reffered to as 'Multihits'. If an integration site occurred within a repeat element (i.e. Alus, LINE, SINE, etc), then it might be helpful to access those sites for potential detrimental effects. These collection of sequences are binned and analyzed separately due to their ambiguity. To make some sense of these multihits, we bin any sequence(s) which share 1 or more genomic locations hence forming psuedo-collections which can be reffered to as OTUs (operation taxonomic units). Once the OTUs are formed, we compare breakpoints of unique sites and multihits. The idea is to see if there are any multihits which higher in abundance than a unique site in a given sample. Below is a table similar to the one shown previously except we show any site which might be greater than 20\% of all clones in the data.
```{r Top10All, results="asis"}
rows <- sites.all$estAbundance1Prop >= .2
if(any(rows)) {
toprint <- unique(subset(sites.all, estAbundance1Prop >= .2)
[,setdiff(names(sites.all), c('posID','Chr','strand','Position',
'Sequence','otuID','Alias',
'AliasOTUid','estAbundance1PropRank',
'timepointDay'))
])
toprint$Aliasposid <- NULL
tps <- sortTimePoints(toprint$timepoint)
toprint$timepoint <- factor(toprint$timepoint, levels=names(tps))
toprint <- arrange(toprint, patient,timepoint,celltype,estAbundance1Rank)
toprint$isMultiHit <- NULL
names(toprint) <- col.keys[names(toprint)]
toprint$RelativeAbundance <- percent(toprint$RelativeAbundance)
kable(toprint, caption="All Sites >20% of the Total",
format="html", row.names=FALSE)
} else {
cat("<strong>No sites found in this patient which are >20% of the total data after combining multihits.</strong>")
}
```
### SiteTypes
The plot in previous section summarizes overlapping sites at the genomic coordinate level. However, integration sites are often represented by the gene they are in or nearby (SiteType). The plot below summarizes which 'SiteTypes' are often found to be abundant across samples relative to the entire landscape. The sites with abundance greater than 5% and rank within top two are colored.
```{r global_siteType, fig.width=10, fig.height=9}
sums <- aggregateSiteTypes(sites.qc, siteTypeVar="geneType")
sums$timepoint <- sub("(.+):(.+)","\\1",sums$Alias)
sums$celltype <- sub("(.+):(.+)","\\2",sums$Alias)
sums <- merge(sums, with(sums, getRanks(Props, siteType, Alias, "Ranks2")),
by.x=c("siteType", "Alias"),by.y=c("posID","grouping"))
sums$SiteRank <- factor(with(sums, ifelse(Ranks2<4, Ranks2,">3")), levels=c(1:3,">3"))
sums$siteType2 <- with(sums, ifelse(Props>=0.05 & Ranks2<3, siteType, ""))
tps <- names(sortTimePoints(as.character(sums$timepoint)))
sums$timepoint <- factor(as.character(sums$timepoint), levels=tps)
counts <- count(sums, c("timepoint","celltype"))
sums$celltype <- factor(sums$celltype, levels=unique(counts$celltype))
# set custom color scale #
siteTypeCols <- structure(gg_color_hue(length(unique(sums$siteType2))),
names=unique(sums$siteType2))
siteTypeCols[names(siteTypeCols)==""] <- "grey70"
p <- qplot(data=sums, x=timepoint, y=Props, colour=siteType2, xlab="Timepoint",
geom="jitter", position = position_jitter(h = 0.001)) +
geom_hline(y=0.05,linetype='dotted') +
scale_colour_manual(values=siteTypeCols) +
scale_y_continuous(name="Relative Abundance", labels=percent, expand=c(0,0.01)) +
facet_grid(.~celltype, scales="free_x", space="free_x") +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))
p <- direct.label(p,"smart.grid")
print(p)
```
What is the most frequently occuring SiteType in subject `r patient`?
```{r wordle, fig.width=8, fig.height=8}
counts <- count(sums,"siteType"); names(counts)[1] <- "word"
suppressWarnings(plotWordCloud(counts, scale=c(3,0.5), min.freq=1, max.words=500,
rot.per = 0,
colors=c(colSets("Set1")[-6],colSets("Paired"))))
```
```{r overlaps_REvsFrag, results='hide', eval=FALSE}
### Overlap Analysis Restriction Enzyme Vs Fragmentase
Often it is of interest to investigate whether integration sites recovered using Restriction Enzyme(s) are seen again using Fragmentase method or not. In the analysis to follow, we divide each sample data by the isolation method and test how many integration sites overlap using window size of 5bp. For comparability, samples labelled with general celltypes such as WB/Blood were replaced with PBMC.
test <- sites.qc
test$isFrag <- grepl('FRAG',as.character(test$enzyme))
## change celltype WB/Blood to PBMC for comparability ##
test$Alias <- gsub("WB|Blood","PBMC",test$Alias,ignore.case=T)
toCheck <- pmin(xtabs(~Alias+isFrag,test),1)
toCheck <- toCheck[rowSums(toCheck)>1,]
if(is.null(dim(toCheck))) {
cat("<strong>No samples found which used both isolation methods.</strong>")
} else {
toCheck.aliases <- rownames(toCheck)
sites.gr <- with(unique(test[test$Alias %in% toCheck.aliases,
c('Chr','strand','Alias','Position','isFrag')]),
GRanges(seqnames=Chr, strand=strand, Alias=Alias, isFrag=isFrag,
IRanges(start=Position, width=1)))
test.gr <- split(sites.gr, paste(mcols(sites.gr)$Alias))
overlap.res <- sapply(test.gr, findOverlaps, maxgap=5,
ignoreSelf=T, ignoreRedundant=T)
## find overlap & get union of sites per isolation method for percent total ##
overlap.res <- lapply(test.gr,
function(x) {
res <- as.data.frame(findOverlaps(x, maxgap=5,
ignoreSelf=TRUE,
ignoreRedundant=TRUE))
res$isFrag1 <- mcols(x)$isFrag[res$queryHits]
res$isFrag2 <- mcols(x)$isFrag[res$subjectHits]
union.res <- length(union(subset(x,mcols(x)$isFrag),
subset(x,!mcols(x)$isFrag)))
cbind(Sample=as.character(x$Alias[1]),
count(res,c("isFrag1","isFrag2")),
UnionSites=union.res)
})
rm("sites.gr","test.gr")
cleanit <- gc()
overlap.res <- do.call(rbind, overlap.res)
names(overlap.res)[grepl("freq",names(overlap.res))] <- "TotalOverlap"
overlap.res$PercentOverlap <- percent(with(overlap.res,TotalOverlap/UnionSites))
overlap.res$Tp <- sub("(.+):.+","\\1",overlap.res$Sample)
overlap.res$Cell <- sub(".+:(.+)","\\1",overlap.res$Sample)
tps <- sortTimePoints(overlap.res$Tp)
overlap.res$Tp <- factor(overlap.res$Tp, levels=names(tps))
overlap.res <- arrange(overlap.res, Tp, Cell)
wanted.cols <- c("Tp", "Cell", "TotalOverlap", "UnionSites", "PercentOverlap")
kable(overlap.res[,wanted.cols], row.names=FALSE, format="html", digits = 0,
caption="Sites Overlaping between Isolation Methods")
}
```