Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in match-mondo-sources-all-lexical.py #394

Closed
wants to merge 1 commit into from

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Jan 12, 2024

Updates

Bugfixes: match-mondo-sources-all-lexical.py

  • Bugfix: AttributeError: 'tuple' object has no attribute 'pop': wrong datatype for metadata was being passed to lexical_index_to_sssom()
  • Bugfix: Several other bugs in mondo-ingest, and upgrading OAK/sssom-py/curies to fix other bugs related to prefix maps.
  • Update: mondo.sssom.config.yml: Commented out duplicate prefix 'oio'
  • Update: prefixes.csv: Removed duplicate prefix oio
  • Update: Python requirements: Upgraded curies for bugfix involving get_prefixes(include_synonyms)
  • Update: Not bugfixes, but did some codestyle fixes: (i) converted regular comments to docstring, (ii) added missing docs
    tring, (iii) renamed built-in 'input' param

Sub-tasks

  • 1. Determine if OAK changes needed or not & PR if so (not needed)
  • 2. Fulfill all requests / tasks in sssom-py PR: Bugfix: clean_prefix_map() not detecting prefix synonyms mapping-commons/sssom-py#485
    • Remove passing of non-bijective map into sssom-py if possible
    • Upgrade sssom-py and curies and use those recent changes instead of my PR changes, if needed (wasn't necessary for this PR but upgraded curies anyway. may need in near future)
  • 3. If root problem needs to be addressed (oio and oboInOwl both in Mondo), make PR in mondo repo too (doesn't seems so. seems to be related to oio in config/prefixes.csv)
  • 4. Fix original bug in issue: AttributeError: 'tuple' object has no attribute 'pop'
  • 5. Add EPM to mondo-ingest
  • 6. Remove oio from mondo.sssom.config.yaml curie_map and prefixes.csv

Related


@souzadevinicius FYI. Harshad is out ATM so I'm helping out with a bug that appeared here, but it opened a can of worms, involving conflicting prefixes that are causing trouble when mondo-ingest interacts with SSSOM and OAK. And we've been decided how to fix this, involving using extended prefix maps instead of plain, flat, bijective prefix maps (key: val), etc.

@joeflack4 joeflack4 marked this pull request as draft January 12, 2024 01:54
@joeflack4 joeflack4 added the bug Something isn't working label Jan 12, 2024
@joeflack4 joeflack4 self-assigned this Jan 12, 2024
@joeflack4 joeflack4 linked an issue Jan 12, 2024 that may be closed by this pull request
@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch from 3b30be4 to a3db7fd Compare January 12, 2024 03:58
@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch from a3db7fd to 4e82092 Compare January 12, 2024 04:06
src/ontology/imports/omo_import.owl Outdated Show resolved Hide resolved
src/ontology/metadata/mondo.sssom.config.yml Show resolved Hide resolved
src/scripts/match-mondo-sources-all-lexical.py Outdated Show resolved Hide resolved
src/scripts/match-mondo-sources-all-lexical.py Outdated Show resolved Hide resolved
@joeflack4 joeflack4 requested a review from twhetzel January 16, 2024 22:08
@joeflack4 joeflack4 changed the title Bug in match-mondo-sources-all-lexical.py #390 Bug in match-mondo-sources-all-lexical.py Jan 16, 2024
@joeflack4 joeflack4 changed the title Bug in match-mondo-sources-all-lexical.py Bug in match-mondo-sources-all-lexical.py Jan 16, 2024
@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch 2 times, most recently from 2e416fb to 71fd56a Compare January 18, 2024 01:23
@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch 2 times, most recently from da2a7f4 to 2ed4612 Compare January 18, 2024 22:13
@@ -3,7 +3,7 @@ rdf,http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs,http://www.w3.org/2000/01/rdf-schema#
xsd,http://www.w3.org/2001/XMLSchema#
owl,http://www.w3.org/2002/07/owl#
oio,http://www.geneontology.org/formats/oboInOwl#
Copy link
Contributor Author

@joeflack4 joeflack4 Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugfix~: oio removed from prefixes.csv @matentzn

@matentzn It looks like this has unintended consequences; may want to revert.

Details

Overview

The advantage of removing this is that lexmatch succeeds without needing the changes that Charlie and I introduced to clean_prefix_map(), namely converter.get_prefixes(include_synonyms=True).

The negative is that now OAK is not checking for lexical matches on synonyms identified by the oio:hasExactSynonym field.

You can see a sample of the differences in the outputs by looking at the "b4" and "after" spreadsheets in the outputs GSheet. Comparing the before/after for the first of the outputs, called "oak output prefiltered", the 'after' version has 8 less entries of a total 198 entries in the 'before' set, a reduction of 4%.

Choices

a. Leave oio out of prefixes.csv. Forfeit matching on oio:hasExactSynonym.
b. Keep oio in prefixes.csv in tandem with [Charlie and Joe's bugfix](the changes that Charlie and I introduced) which fixes the perceived missing oio prefix (which is now a prefix synonym of oboInOwl).
c. Change OAK so that even if we remove oio from prefixes.csv, there is no lossiness (more below).

I think we should choose 'b', and possibly do 'c' separately, unless matching on oio:hasExactSynonym is considered unimportant, in which case we can just go with 'a'.

Additional discussion

I want to expand on 'c', possible OAK changes. I think we want OAK to somehow treat oio and oboInOwl as the same, but this is tricky and I can only begin to think about what to be done here. I can only guess and can't speak to the full nature of why, when oio is left out of prefixes.csv, lexmatches on oio:hasExactSynonym go away.

I wonder if it might have something to do with this, at the top of OAK's lexical_indexer.py:

QUALIFIER_DICT = {
    "exact": "oio:hasExactSynonym",
    "broad": "oio:hasBroadSynonym",
    "narrow": "oio:hasNarrowSynonym",
    "related": "oio:hasRelatedSynonym",
}

If you guys think OAK or SemanticSQL needs any sort of update, maybe you should tag Chris here.

Copy link
Member

@matentzn matentzn Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negative is that now OAK is not checking for lexical matches on synonyms identified by the oio:hasExactSynonym field.

This makes no sense. The oio:hasExactSynonym should not exist, if the input prefix map does not contain it.

I will look into it...

Copy link
Contributor Author

@joeflack4 joeflack4 Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn If I understand you correctly, that's what I'm saying. If prefixes.csv contains oio, then when I run the pipeline, it does matches on just on rdfs:label but also oio:hasExactSynonym, and you can see that reflected in my subsample outputs. And if I remove oio from prefixes.csv I no longer see those in the outputs.

So if we want to keep matching on oio:hasExactSynonym, it seems that we should keep oio in prefixes.csv, even though that causes it to be non-bijective. It's only passed to SemanticSQL anyway. Alternative is to ask OAK to rename oio to oboInOwl.

Also, if you change your mind and want to keep oio in prefixes.csv, then I think we need to merge mapping-commons/sssom-py#485 first, otherwise error will be thrown. So maybe best to merge this PR now and then if we want to put oio back, we can do it in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today here's where we are:
Nico has looked into further and opened: INCATools/ontology-access-kit#698

Joe looked to see other places in which merged.db is used for identifying possible side effects of changing oio to oboInOwl in prefixes.csv when constructing merged.db. These other usages were found, but do not know if unintended side effects could occur:

  • lexmatch-sssom-compare.py
  • mappings.ipynb
  • oak-lexmatch-sssom-compare.ipynb
  • oak-lexmatch-sssom-compare-checkpoint.ipynb
  • remove-mappings-test.ipynb
  • synchronizeMappingsWithMondo-checkpoint.ipynb

Nico said he would put in some time to solve this, so Joe is pausing activity on this.

src/ontology/metadata/mondo.sssom.config.yml Show resolved Hide resolved
@@ -145,6 +194,9 @@ def run(input: str, config: str, rules: str, rejects: str, output: str):
kwargs = {"subject_id": ("MONDO:%",), "object_id": prefix_args}
with open(str(Path(output.replace("lexical", "lexical-2"))), "w") as f:
filter_file(input=str(Path(output)), output=f, **kwargs)
t1 = datetime.now() # todo: temp
print('match-mondo-sources-all-lexical complete in seconds:', (t1 - t0).seconds) # todo temp
Copy link
Contributor Author

@joeflack4 joeflack4 Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outputs

@matentzn @twhetzel I will be uploading outputs here.

You'll notice there are some outputs labeled 'subsample'. This is an exercise I did to quickly compare the before/after ramifications of removing oio from `prefixes.csv.

For the full output, look at these sheets:

  • after - mondo-sources-all-lexical-2.sssom.tsv
  • after - mondo-sources-all-lexical.sssom.tsv

If you're having trouble loading those tabs (I was, probably because they're 20MB and 50MB), I uploaded the raw files that you can download here.

@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch from 2ed4612 to 80eb1d1 Compare January 19, 2024 21:39
@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch 3 times, most recently from 63adbb6 to 4d2f551 Compare January 19, 2024 23:51
joeflack4

This comment was marked as outdated.

@joeflack4 joeflack4 marked this pull request as ready for review January 20, 2024 01:19
- Bugfix: AttributeError: 'tuple' object has no attribute 'pop': wrong datatype for metadata was being passed to lexical_index_to_sssom()
- Bugfix: Several other bugs in mondo-ingest, and upgrading OAK/sssom-py/curies to fix other bugs related to prefix maps.
- Update: mondo.sssom.config.yml: Commented out duplicate prefix 'oio'
- Update: prefixes.csv: Removed duplicate prefix oio
- Update: Python requirements: Upgraded curies for bugfix involving get_prefixes(include_synonyms)
- Update: Not bugfixes, but did some codestyle fixes: (i) converted regular comments to docstring, (ii) added missing docstring, (iii) renamed built-in 'input' param
@joeflack4 joeflack4 force-pushed the bugfix-converter-pop branch from cebb558 to d2328dc Compare January 20, 2024 01:19
@joeflack4 joeflack4 marked this pull request as draft January 20, 2024 01:36
# Implemented `meta` param in `lexical_index_to_sssom`

meta = get_metadata_and_prefix_map(config)
"""Run the script"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError: 'Converter' object has no attribute 'items' (fixed)

You can mark this resolved. It's fixed.

Details

I thought this was ready to merge because it ran successfully through the debugger. However I'm got error in OAK now: AttributeError: 'Converter' object has no attribute 'items'.

ls Traceback (most recent call last):
  File "/work/src/ontology/../scripts/[match-mondo-sources-all-lexical.py](http://match-mondo-sources-all-lexical.py/)", line 176, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/[core.py](http://core.py/)", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/work/src/ontology/../scripts/[match-mondo-sources-all-lexical.py](http://match-mondo-sources-all-lexical.py/)", line 122, in run
    msdf = lexical_index_to_sssom(oi, lexical_index, ruleset=ruleset, prefix_map=converter)
  File "/usr/local/lib/python3.10/dist-packages/oaklib/utilities/lexical/[lexical_indexer.py](http://lexical_indexer.py/)", line 317, in lexical_index_to_sssom
    {k: v for k, v in prefix_map.items() if k not in meta.prefix_map.keys()}
AttributeError: 'Converter' object has no attribute 'items'

I upgraded OAK within ODK but that didn't work. However, upgrading my ODK image did.

@@ -118,9 +119,9 @@ def run(input: str, config: str, rules: str, rejects: str, output: str):
save_lexical_index(lexical_index, OUT_INDEX_DB)

if rules:
msdf = lexical_index_to_sssom(oi, lexical_index, ruleset=ruleset, meta=meta)
msdf = lexical_index_to_sssom(oi, lexical_index, ruleset=ruleset, prefix_map=converter)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundancy with #397

The main difference is that #397 does passes meta=None to lexical_index_to_sssom(), whereas #394 passes a Converter. Supposedly #397 approach is fine.

Everything else in that #394 is now cosmetic, so i can merge just those changes afterwards or ignore and close.

@joeflack4 joeflack4 mentioned this pull request Jan 22, 2024
@joeflack4
Copy link
Contributor Author

I'm going to close this, as this is essentially supplanted by #397. There's just some minor stylistic stuff in here that we can re-open and merge if we really want.

@joeflack4 joeflack4 closed this Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in match-mondo-sources-all-lexical.py
3 participants