-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mgeneSim with Resnik measure produces incorrect results #23
Comments
I can reproduce this: library("GOSemSim")
#>
#> GOSemSim v2.10.0 For help: https://guangchuangyu.github.io/GOSemSim
#>
#> If you use GOSemSim in published research, please cite:
#> Guangchuang Yu, Fei Li, Yide Qin, Xiaochen Bo, Yibo Wu, Shengqi Wang. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products Bioinformatics 2010, 26(7):976-978
my_gene_list <- c("150", "151", "152", "1814", "1815", "1816")
d = godata('org.Hs.eg.db', ont="BP", computeIC=TRUE)
#> Loading required package: org.Hs.eg.db
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, append, as.data.frame, basename, cbind,
#> colnames, dirname, do.call, duplicated, eval, evalq, Filter,
#> Find, get, grep, grepl, intersect, is.unsorted, lapply, Map,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
#> setdiff, sort, table, tapply, union, unique, unsplit, which,
#> which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#>
#> expand.grid
#>
#> preparing gene to GO mapping data...
#> preparing IC data...
gsimmat3 = mgeneSim(my_gene_list, semData=d, drop="NULL", measure="Resnik", combine="BMA")
#> |=================================================================| 100%
gsimmat3
#> 150 151 152 1814 1815 1816
#> 150 0.638 0.502 0.502 0.376 0.379 0.352
#> 151 0.502 0.641 0.605 0.311 0.322 0.349
#> 152 0.502 0.605 0.645 0.311 0.320 0.335
#> 1814 0.376 0.311 0.311 0.667 0.507 0.424
#> 1815 0.379 0.322 0.320 0.507 0.673 0.408
#> 1816 0.352 0.349 0.335 0.424 0.408 0.701 Created on 2019-06-07 by the reprex package (v0.3.0) I though it could be an error on lr <- geneSim("150", "150", semData = d, measure = "Resnik", combine = "BMA")
lr$geneSim
#> [1] 0.639 The problem seems to come from the Resnik similarity as calculating the similarity between the same GOs with Resnik reports a value different from 0: mgoSim(lr$GO1[1:5], lr$GO2[1:5], semData = d, measure = "Resnik", combine = NULL)
#> GO:0001819 GO:0007186 GO:0007189 GO:0007193 GO:0007265
#> GO:0001819 0.415 0.075 0.075 0.075 0.075
#> GO:0007186 0.075 0.367 0.367 0.367 0.181
#> GO:0007189 0.075 0.367 0.594 0.542 0.276
#> GO:0007193 0.075 0.367 0.542 0.627 0.181
#> GO:0007265 0.075 0.181 0.276 0.181 0.495 However I'm not sure if the Resnik similarity between the same element must be 1. I couldn't understand what happens in that case from the original article. |
I also can reproduce this with Rel also. hsGO <- godata('org.Hs.eg.db', ont="BP", keytype = "UNIPROT")
|
I'd like to confirm that I'm having a similar issue with Resnik scores not being calculated correctly. library(GOSemSim)
library(org.Hs.eg.db)
GO.org <- org.Hs.eg.db
GO.anno <- godata(GO.org, ont = "BP")
# Term 1 vs. Term 2
> goSim("GO:0042060", "GO:0061041", semData = GO.anno, measure = "Resnik")
[1] 0.469
# Term 1 vs. Term 1
> goSim("GO:0042060", "GO:0042060", semData = GO.anno, measure = "Resnik")
[1] 0.469
# Term 2 vs. Term 2
> goSim("GO:0061041", "GO:0061041", semData = GO.anno, measure = "Resnik")
[1] 0.569 As you can see, these values make no sense, and should be much greater, especially if one term is being compared to itself. When testing Resnik similarity using NaviGO's web portal, it accurately determines Term 1 (wound healing) and Term 2 (regulation of wound healing) are highly similar with a Resnik score of 2.4. I'm surprised there aren't more people raising concerns about this, but perhaps Resnik isn't used by many and most just use the default Wang method. This just goes to show, never blindly believe the results you get from one tool/package. Always make sure you verify results using tools from multiple devs to be sure what you get from any one tool is accurate. |
Hi,
I'm trying to use GOSemSim to compute pairwise similarity scores for a large set of genes and it seems that the Resnik measure at least is incorrect because the self-similarities on the diagonal which should all be 1 are not 1 actually!
Here is the command I run
d = godata('org.Hs.eg.db', ont="BP", computeIC=TRUE)
gsimmat3 = mgeneSim(my_gene_list, semData=d, drop="NULL", measure="Resnik", combine="BMA")
print(gsimmat3)
and here is the output
[1] "gsimmat3"
150 151 152 1814 1815 1816
150 0.615 0.502 0.503 0.407 0.388 0.364
151 0.502 0.623 0.594 0.360 0.359 0.379
152 0.503 0.594 0.626 0.357 0.353 0.364
1814 0.407 0.360 0.357 0.678 0.509 0.451
1815 0.388 0.359 0.353 0.509 0.669 0.398
1816 0.364 0.379 0.364 0.451 0.398 0.680
As you can see the diagonal is not 1 as it should be!
Could you please fix this?
Thank you,
Denise
The text was updated successfully, but these errors were encountered: