Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lexmatch revamp #397

Merged
merged 7 commits into from
Jan 29, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 19 additions & 4 deletions src/ontology/config/prefixes.csv
Original file line number Diff line number Diff line change
Expand Up @@ -227,12 +227,27 @@ ICD10CM,http://purl.bioontology.org/ontology/ICD10CM/
ICD10CM2,https://icd.codes/icd10cm/
ICD10WHO,https://icd.who.int/browse10/2019/en#/
ICD10WHO2010,http://apps.who.int/classifications/icd10/browse/2010/en#/
OMIMPS,https://www.omim.org/phenotypicSeries/PS
OMIMPS,https://omim.org/phenotypicSeries/PS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefixes.csv updates

That will be supplanted eventually by dynamic creation of prefixes.csv but these changes seem necessary for now.

Harshad and I should have thought to check the status of this sooner.

OMIM,https://omim.org/entry/
Orphanet,http://www.orpha.net/ORDO/Orphanet_
GARD,http://purl.obolibrary.org/obo/GARD_
MEDRA,http://identifiers.org/meddra/
MESH,http://identifiers.org/mesh/
SCTID,http://identifiers.org/snomedct/
SCTID,http://snomed.info/id/
UMLS,http://purl.obolibrary.org/obo/UMLS_
UMLS2,http://linkedlifedata.com/resource/umls/id/
UMLS,http://linkedlifedata.com/resource/umls/id/
HGNC,http://identifiers.org/hgnc/
MEDDRA,http://identifiers.org/meddra/
MEDGEN,http://identifiers.org/medgen/
SCTID,http://identifiers.org/snomedct/
UMLS,http://linkedlifedata.com/resource/umls/id/
orcid,https://orcid.org/
swrl,http://www.w3.org/2003/11/swrl#
semapv,https://w3id.org/semapv/vocab/
HGNC_SYMBOL,https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/
HGNC,https://identifiers.org/hgnc/
ncbi.gene,https://www.ncbi.nlm.nih.gov/gene/
OMIMPS,https://www.omim.org/phenotypicSeries/PS
STY,http://purl.bioontology.org/ontology/STY/
sssom,https://w3id.org/sssom/
biolink,https://w3id.org/biolink/vocab/
doap,http://usefulinc.com/ns/doap#
27 changes: 13 additions & 14 deletions src/ontology/metadata/doid.metadata.sssom.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,32 +6,31 @@ license: https://creativecommons.org/licenses/by/4.0/
creator_id: https://orcid.org/0000-0002-2906-7319
curie_map:
ICD10: http://apps.who.int/classifications/icd10/browse/2010/en#/
MedDRA: https://identifiers.org/meddra/
# MedDRA: https://identifiers.org/meddra/
MESH: https://meshb.nlm.nih.gov/record/ui?ui=
OMIM: https://omim.org/entry/
OMIMPS: https://www.omim.org/phenotypicSeries/PS
Orphanet: http://www.orpha.net/ORDO/Orphanet_
# Orphanet: http://www.orpha.net/ORDO/Orphanet_
UMLS: http://linkedlifedata.com/resource/umls/id/
DOID: http://purl.obolibrary.org/obo/DOID_
EFO: http://www.ebi.ac.uk/efo/EFO_
ENVO: http://purl.obolibrary.org/obo/ENVO_
GARD: "https://bioregistry.io/reference/gard:"
GARD: "https://bioregistry.io/gard:"
ICD10CM: https://icd.codes/icd10cm/
ICD9CM: https://icd.codes/icd9cm/
ICD9CM_2005: https://icd.codes/icd9cm/
ICDO: "https://bioregistry.io/reference/gard:"
# ICD9CM_2005: https://icd.codes/icd9cm/
ICDO: "https://bioregistry.io/icdo:"
KEGG: http://www.kegg.jp/entry/
MEDDRA: https://identifiers.org/meddra/
NCI: http://purl.obolibrary.org/obo/NCIT_
OMIT: http://purl.obolibrary.org/obo/ENVO_
OMIT: http://purl.obolibrary.org/obo/OMIT_
ORDO: http://www.orpha.net/ORDO/Orphanet_
SNOMEDCT_US_2020_03_01: http://identifiers.org/snomedct/
SNOMEDCT_US_2020_09_01: http://identifiers.org/snomedct/
SNOMEDCT_US_2021_07_31: http://identifiers.org/snomedct/
SNOMEDCT_US_2021_09_01: http://identifiers.org/snomedct/
SNOMEDCT_US_2022_07_31: http://identifiers.org/snomedct/
SNOMEDCT_US_2022_10_31: http://identifiers.org/snomedct/
SNOMEDCT_US_2023_02_28: http://identifiers.org/snomedct/
# SNOMEDCT_US_2020_09_01: http://identifiers.org/snomedct/
# SNOMEDCT_US_2021_07_31: http://identifiers.org/snomedct/
# SNOMEDCT_US_2021_09_01: http://identifiers.org/snomedct/
# SNOMEDCT_US_2022_07_31: http://identifiers.org/snomedct/
# SNOMEDCT_US_2022_10_31: http://identifiers.org/snomedct/
# SNOMEDCT_US_2023_02_28: http://identifiers.org/snomedct/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn why are these commented out vs. removed or what are these comments a reminder of in the future?

SYMP: http://purl.obolibrary.org/obo/SYMP_
UMLS_CUI: http://linkedlifedata.com/resource/umls/id/

# UMLS_CUI: http://linkedlifedata.com/resource/umls/id/
41 changes: 21 additions & 20 deletions src/ontology/metadata/mondo.sssom.config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,27 +5,27 @@ curie_map:
DOID: http://purl.obolibrary.org/obo/DOID_
EFO: http://www.ebi.ac.uk/efo/EFO_
HGNC: "http://identifiers.org/hgnc/"
HGNC__2: "https://identifiers.org/hgnc:"
HGNC__3: "https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:"
HGNC_symbol: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/
# HGNC__2: "https://identifiers.org/hgnc:"
# HGNC__3: "https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:"
# HGNC_symbol: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/
HP: http://purl.obolibrary.org/obo/HP_
SCTID: http://identifiers.org/snomedct/
SCTID__2: http://snomed.info/id/
# SCTID: http://identifiers.org/snomedct/
# SCTID__2: http://snomed.info/id/
OMIM: https://omim.org/entry/
OMIM__2: http://identifiers.org/omim/
OMIM__3: http://purl.obolibrary.org/obo/OMIM_
OMIM__4: http://omim.org/entry/
# OMIM__2: http://identifiers.org/omim/
# OMIM__3: http://purl.obolibrary.org/obo/OMIM_
# OMIM__4: http://omim.org/entry/
MESH: http://identifiers.org/mesh/
Orphanet: http://www.orpha.net/ORDO/Orphanet_
Orphanet__2: https://www.orpha.net/ORDO/Orphanet_
Orphanet__3: http://purl.obolibrary.org/obo/Orphanet_
# Orphanet__2: https://www.orpha.net/ORDO/Orphanet_
# Orphanet__3: http://purl.obolibrary.org/obo/Orphanet_
oboInOwl: http://www.geneontology.org/formats/oboInOwl#
NCBITaxon: http://purl.obolibrary.org/obo/NCBITaxon_
skos: http://www.w3.org/2004/02/skos/core#
ICD10CM: http://purl.bioontology.org/ontology/ICD10CM/
ICD10CM__2: https://icd.codes/icd10cm/
#ICD10CM__2: https://icd.codes/icd10cm/
ICD10WHO: https://icd.who.int/browse10/2019/en#/
ICD10WHO__2: http://apps.who.int/classifications/icd10/browse/2010/en#/
#ICD10WHO__2: http://apps.who.int/classifications/icd10/browse/2010/en#/
OMIMPS: https://omim.org/phenotypicSeries/PS
MEDGEN: http://identifiers.org/medgen/
MedDRA: http://identifiers.org/meddra/
Expand All @@ -34,9 +34,9 @@ curie_map:
semapv: https://w3id.org/semapv/vocab/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
sssom: https://w3id.org/sssom/
oio: http://www.geneontology.org/formats/oboInOwl#
# oio: http://www.geneontology.org/formats/oboInOwl#
GTR: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/GTR/"
NCI: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/NCI/"
# NCI: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/NCI/"
NIFSTD: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/NIFSTD/"
PO_GIT: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/PO_GIT/"
CALOHA: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/CALOHA/"
Expand All @@ -45,7 +45,7 @@ curie_map:
IMDRF: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/IMDRF/"
LOINC: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/LOINC/"
MEDDRA: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/MEDDRA/"
ncithesaurus: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ncithesaurus/"
# ncithesaurus: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ncithesaurus/"
COHD: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/COHD/"
ONCOTREE: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ONCOTREE/"
ICD9: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ICD9/"
Expand All @@ -63,17 +63,18 @@ curie_map:
Wikipedia: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/Wikipedia/"
Fyler: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/Fyler/"
EPCC: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/EPCC/"
UMLS_CUI: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/UMLS_CUI/"
# UMLS_CUI: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/UMLS_CUI/"
KUPO: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/KUPO/"
OMOP: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/OMOP/"
ICD10: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ICD10/"
# ICD10: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ICD10/"
ICD10EXP: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/ICD10EXP/"
DERMO: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/DERMO/"
GARD: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/GARD/"
SNOMEDCT_US: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/SNOMEDCT_US/"
MSH: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/MSH/"
# SNOMEDCT_US: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/SNOMEDCT_US/"
# MSH: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/MSH/"
GC_ID: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/GC_ID/"
SNOMEDCT_2010_1_31: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/SNOMEDCT_2010_1_31/"
# SNOMEDCT_2010_1_31: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/SNOMEDCT_2010_1_31/"
OMIA: "http://purl.obolibrary.org/obo/mondo/mappings/unknown_prefix/OMIA/"

extended_prefix_map:
- prefix: MONDO
Expand Down
4 changes: 2 additions & 2 deletions src/ontology/metadata/omim.metadata.sssom.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,6 @@ curie_map:
OMIM: https://omim.org/entry/
Orphanet: http://www.orpha.net/ORDO/Orphanet_
UMLS: http://linkedlifedata.com/resource/umls/id/
HGNC: 'https://identifiers.org/hgnc:'
#HGNC: 'https://identifiers.org/hgnc:'
hgnc.symbol: 'https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/'
ncbigene: https://www.ncbi.nlm.nih.gov/gene/
ncbigene: https://www.ncbi.nlm.nih.gov/gene/
4 changes: 2 additions & 2 deletions src/ontology/metadata/ordo.metadata.sssom.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ curie_map:
ICD10: http://apps.who.int/classifications/icd10/browse/2010/en#/
MedDRA: https://identifiers.org/meddra/
MESH: http://identifiers.org/mesh/
MeSH: http://id.nlm.nih.gov/mesh/
#MeSH: http://id.nlm.nih.gov/mesh/
OMIM: https://omim.org/entry/
Orphanet: http://www.orpha.net/ORDO/Orphanet_
UMLS: http://linkedlifedata.com/resource/umls/id/
UMLS: http://linkedlifedata.com/resource/umls/id/
9 changes: 5 additions & 4 deletions src/ontology/mondo-ingest.Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ dependencies:
@rm -f .template.db
@rm -f .template.db.tmp
@rm -f $*-relation-graph.tsv.gz
.PRECIOUS: %.db

####################################
### Relevant signature #############
Expand Down Expand Up @@ -341,7 +342,7 @@ documentation: j2 $(ALL_DOCS) unmapped-terms-docs mapped-deprecated-terms-docs s
build-mondo-ingest:
$(MAKE) refresh-imports exclusions-all slurp-all mappings matches \
mapped-deprecated-terms mapping-progress-report \
recreate-unmapped-components sync documentation
recreate-unmapped-components documentation
matentzn marked this conversation as resolved.
Show resolved Hide resolved
$(MAKE) prepare_release

.PHONY: build-mondo-ingest-no-imports
Expand Down Expand Up @@ -429,8 +430,8 @@ tmp/merged.owl: tmp/mondo.owl mondo-ingest.owl tmp/mondo.sssom.ttl
$(MAPPINGSDIR)/mondo-sources-all-lexical.sssom.tsv: $(SCRIPTSDIR)/match-mondo-sources-all-lexical.py tmp/merged.db $(MAPPINGSDIR)/rejected-mappings.tsv
rm -f $(MAPPINGSDIR)/mondo-sources-all-lexical.sssom.tsv
rm -f $(MAPPINGSDIR)/mondo-sources-all-lexical-2.sssom.tsv
pip install bioregistry
python $< run tmp/merged.db \
pip install -U bioregistry curies
matentzn marked this conversation as resolved.
Show resolved Hide resolved
python $(SCRIPTSDIR)/match-mondo-sources-all-lexical.py run tmp/merged.db \
-c metadata/mondo.sssom.config.yml \
-r config/mondo-match-rules.yaml \
--rejects $(MAPPINGSDIR)/rejected-mappings.tsv \
Expand All @@ -443,7 +444,7 @@ lexical-matches: $(MAPPINGSDIR)/mondo-sources-all-lexical.sssom.tsv
###################################
lexmatch/README.md: $(SCRIPTSDIR)/lexmatch-sssom-compare.py $(MAPPINGSDIR)/mondo-sources-all-lexical.sssom.tsv $(ALL_EXCLUSION_FILES)
find lexmatch/ -name "*.tsv" -type f -delete
python $< extract_unmapped_matches $(ALL_COMPONENT_IDS) \
python $(SCRIPTSDIR)/lexmatch-sssom-compare.py extract_unmapped_matches $(ALL_COMPONENT_IDS) \
--matches $(MAPPINGSDIR)/mondo-sources-all-lexical.sssom.tsv \
--output-dir lexmatch \
--summary $@ \
Expand Down
2 changes: 1 addition & 1 deletion src/scripts/lexmatch-sssom-compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@ def extract_unmapped_matches(input: str, matches: TextIO, output_dir: str, summa
combined_df = pd.concat(ont_df_list)

combined_msdf = MappingSetDataFrame(
df=combined_df, prefix_map=msdf_lex.prefix_map, metadata=msdf_lex.metadata
df=combined_df, converter=msdf_lex.converter, metadata=msdf_lex.metadata
matentzn marked this conversation as resolved.
Show resolved Hide resolved
)
df_dict = split_dataframe(combined_msdf)
summary.write("## mondo_XXXXmatch_ontology")
Expand Down
3 changes: 2 additions & 1 deletion src/scripts/match-mondo-sources-all-lexical.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,8 @@ def main(verbose: int, quiet: bool):
def run(input: str, config: str, rules: str, rejects: str, output: str):
# Implemented `meta` param in `lexical_index_to_sssom`

meta = get_metadata_and_prefix_map(config)
#meta = get_metadata_and_prefix_map(config)
meta = None
Copy link
Contributor

@joeflack4 joeflack4 Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lexical_index_to_sssom(meta=None)

Nico wrote:

skips meta in src/scripts/match-mondo-sources-all-lexical.py which is not urgently needed and was causing all the problems @joeflack4 is fixing

You are more familiar with all of the lexmatch outputs and you say that running it locally was fine, so it seems there are no significant negative side effects from this.

You say it is not urgently needed, so I take it that this means that there is still a desire to eventually do what I did in #394 (pass a Converter) or similar, perhaps after addressing INCATools/ontology-access-kit#698. So perhaps I should make like a medium priority ticket for this, or keep my current PR open and rebase from / conform to what you have here.

Not important but to clarify, this wasn't causing all the problems I am fixing. Rather it is a solution for the last, only, conditional, remaining unfixed problem in #394.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trish wrote:

@joeflack4 have you also run the entire pipeline locally? Did you find that Nico's solution to "ditch meta" worked as intended?

I haven't run the whole pipeline locally. I'll give it a go but I'm terrified of it; I've heard it takes over 24 hours and sometimes this can interfere w/ other projects I'm working on. My computer wasn't actually capable of doing this until last month (without some skipping of things like NCIT). When I've asked Nico if he wants me to do this before, he's said I could hold off. But I'll at least give it a go.

Regarding checking myself the results of meta=None, I'll give that a go. I was trusting Nico and it sound like he got the results he expected, but good to have second eyes especially if you think so. Another difference I just noticed as well is that I've also been passing no meta (same as passing None), but I've been passing prefix_map instead. I'm not sure it matters in this case but I'd better remove that as well and run in a more closely to what Nico is running. I'd check out his PR as well but I think better to run on my PR because Nico made some other changes to prefixes.csv, but I want to more closely compare the outputs I got before/after with simply the meta=None/prefix_map=None change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#394 redundancy

The main difference is that #397 does passes meta=None to lexical_index_to_sssom(), whereas #394 passes a Converter. Supposedly #397 approach is fine.

Everything else in that PR is now cosmetic, so i can merge just those changes afterwards or ignore and close.

My instinct is to feel a little irked that I spent a lot of work and found a way to get the pipeline working without throwing the oio err (although #394 currently has oio removed and loses outputs), and that my work had I think the same overall effect but was supplanted by Nico and Charlie's PRs. But maybe I shouldn't feel irked; I laid the groundwork and this is just the nature of collaboration, and Nico/Charlie are more senior here and have a better idea of how things should exactly be.

Also, I see that this PR does add more than #394, such as missing prefixes.

with open(config, "r") as f:
yml = yaml.safe_load(f)

Expand Down
2 changes: 1 addition & 1 deletion src/sparql/fix_xref_prefixes.ru
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,5 @@ WHERE
?p ?v .
}
FILTER (isIRI(?entity))
BIND(REPLACE(REPLACE(REPLACE(?value, "ICD-10:", "ICD10:", "i"), "MeSH:", "MESH:", "i"), "OMIM:PS", "OMIMPS:", "i") AS ?value_fixed)
BIND(REPLACE(REPLACE(REPLACE(REPLACE(?value, "ICD-11:", "ICD11:", "i"), "ICD-10:", "ICD10:", "i"), "MeSH:", "MESH:", "i"), "OMIM:PS", "OMIMPS:", "i") AS ?value_fixed)
matentzn marked this conversation as resolved.
Show resolved Hide resolved
}
2 changes: 1 addition & 1 deletion src/sparql/rm_xref_by_prefix.ru
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,5 @@ WHERE {
oboInOwl:hasDbXref ?xref ;
?p1 ?o2 .
}
FILTER( STRSTARTS(str(?xref), "UMLS_ICD9CM_2005_AUI:") || STRSTARTS(str(?xref), "ICD11:") || STRSTARTS(str(?xref), "SNOMEDCT_US_") ||STRSTARTS(str(?xref), "IMDRF:"))
FILTER( STRSTARTS(str(?xref), "UMLS_ICD9CM_2005_AUI:") || STRSTARTS(str(?xref), "ICD11:") || STRSTARTS(str(?xref), "SNOMEDCT_US_") || STRSTARTS(str(?xref), "IMDRF:") || STRSTARTS(str(?xref), "url:") )
matentzn marked this conversation as resolved.
Show resolved Hide resolved
}