Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization: SubClassOf #363

Merged
merged 1 commit into from
Jan 12, 2024
Merged

Synchronization: SubClassOf #363

merged 1 commit into from
Jan 12, 2024

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Sep 10, 2023

Updates

Synchronization: SubClassOf
Pipeline to keep subclass relationships in sync between Mondo and sources.
Currently handled: Case 1, 3, 5

  • Add: makefile goals: sync, sync-subclassof, and more.
  • Update: makefile goal: build-mondo-ingest
  • Add: src/scripts/sync_subclassof.py
  • Add: Outputs to reports/

General

  • Bugfix: utils.py remove_angle_brackets(): Now correctly returns a str if receives a str.
  • Update: config/prefixes.csv: Add: New entries
  • Add: utils.py: get_monarch_curies_converter()
  • Update: Python requirements: oaklib upgrade
  • Update: Minor refactor in _get_all_owned_terms()
  • Update: .gitignore: entries, ordering, and comments

Related to other PRs

  • Update: metadata/icd10who.yml: The 2 prefix maps in this file were inconsistent with each other. Corrected here and also in ICD10WHO prefix updates #377

@joeflack4 joeflack4 marked this pull request as draft September 10, 2023 23:58
@joeflack4 joeflack4 self-assigned this Sep 10, 2023
@joeflack4 joeflack4 linked an issue Sep 10, 2023 that may be closed by this pull request
13 tasks
@joeflack4 joeflack4 added the enhancement New feature or request label Sep 10, 2023
@joeflack4 joeflack4 force-pushed the sync-subclass branch 4 times, most recently from 32979c1 to 763e959 Compare September 16, 2023 00:23
@joeflack4 joeflack4 force-pushed the sync-subclass branch 3 times, most recently from c36db42 to b156653 Compare September 16, 2023 22:17
src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved
src/ontology/config/prefixes.csv Outdated Show resolved Hide resolved
@joeflack4 joeflack4 force-pushed the sync-subclass branch 2 times, most recently from bef16dd to 4336a2f Compare September 17, 2023 01:58
matentzn

This comment was marked as outdated.

@joeflack4 joeflack4 force-pushed the sync-subclass branch 8 times, most recently from 893c817 to 0d206cd Compare December 19, 2023 03:00
src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved
src/scripts/sync_subclassof.py Show resolved Hide resolved
Copy link
Member

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general things we should clean up in a future refactoring (cc @twhetzel maybe we should start developing a style guide for these):

  • Docstrings and Comments: The script includes thorough comments and docstrings, which is good for understanding the purpose and functionality of each function. However, some comments are excessively verbose and could be more concise.
  • TODO Comments: There are several 'todo' comments scattered throughout the script. Leaving numerous todos can indicate unfinished or potentially unstable code.
  • Use of Global Variables: There's extensive use of global variables (e.g., EX_DEFAULTS, ONTOLOGY_DIR, etc.). It's generally better to encapsulate global variables or move configuration settings to a separate config file or environment variables, especially if they are subject to change.
  • Function Length: Some functions are quite long and perform multiple tasks (e.g., sync_subclassof). Consider breaking down these functions into smaller, more manageable pieces. This improves readability and testability.
  • Hardcoded Values: There are hardcoded values and paths (like EX_ONTO_NAME = 'ordo'). I don't really know what their role is in the finished script - IMO they should not exist/ or exist as default values in the argparse.
  • Unused Code: There are commented-out sections and potentially unused variables (e.g., # rels_indirect_source_source = ...). It's important to remove unused code to maintain clarity.

src/ontology/reports/out_of_sync_mondo_ids.tsv Outdated Show resolved Hide resolved
tests/input/merged.owl Outdated Show resolved Hide resolved
src/scripts/sync_subclassof.py Show resolved Hide resolved
src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved
sync_subclassof(**d)


def run_defaults(use_cache=True): # todo: #remove-temp-defaults
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of this method? I don't see it used anywhere and I would rather we don't add any code that is not needed..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default settings @matentzn

Review and decide.

Details

Basically there are some toggleable default settings that can be run when running the Python script directly rather than in the makefile.

This is very helpful during development. It helps me quickly run and debug things. And there are some things I can do with this defaults setup that I can't do using a debug config with saved params.

We've discussed this type of thing in the past and you've indicated that you prefer to leave out such defaults code from the final product. That's what this todo is. It's an indicator that this stuff should be removed when development is done.

Of course, this feature is too big for a single PR, and there are perhaps it will undergo development for many months, so I'd like to leave the defaults there so that I can toggle their usage when I develop further.

Note that these are off by default. So they will have no effect unless I'm doing local development.

Options:
a. Delete them now; Will slow Joe down when he continues to develop in the future, but he can temporarily add them back in another PR.
b. Leave them for now. Delete at some point in the future when other cases are completed.
c. Simply leave them indefinitely into the future. They are no harm in production, and useful for further development and debugging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed analysis. This is TMD for me, I will trust you to make the right call!

src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved
src/scripts/sync_subclassof.py Outdated Show resolved Hide resolved
@joeflack4 joeflack4 force-pushed the sync-subclass branch 4 times, most recently from 0015e59 to 348f5c5 Compare December 29, 2023 01:00
@@ -498,6 +499,41 @@ slurp-all-no-updates: $(foreach n,$(ALL_COMPONENT_IDS), slurp-no-updates-$(n))
.PHONY: slurp-all
slurp-all: $(foreach n,$(ALL_COMPONENT_IDS), slurp-$(n))


#############################
###### Synchronization ######
Copy link
Contributor Author

@joeflack4 joeflack4 Dec 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General review comment discussion

@matentzn Just responding to your general / review closing comment as a thread instead so that it can be marked resolved

Context (Nico's review comment):

Details

Some general things we should clean up in a future refactoring (cc @twhetzel maybe we should start developing a style guide for these):

  • Docstrings and Comments: The script includes thorough comments and docstrings, which is good for understanding the purpose and functionality of each function. However, some comments are excessively verbose and could be more concise.
  • TODO Comments: There are several 'todo' comments scattered throughout the script. Leaving numerous todos can indicate unfinished or potentially unstable code.
  • Use of Global Variables: There's extensive use of global variables (e.g., EX_DEFAULTS, ONTOLOGY_DIR, etc.). It's generally better to encapsulate global variables or move configuration settings to a separate config file or environment variables, especially if they are subject to change.
  • Function Length: Some functions are quite long and perform multiple tasks (e.g., sync_subclassof). Consider breaking down these functions into smaller, more manageable pieces. This improves readability and testability.
  • Hardcoded Values: There are hardcoded values and paths (like EX_ONTO_NAME = 'ordo'). I don't really know what their role is in the finished script - IMO they should not exist/ or exist as default values in the argparse.
  • Unused Code: There are commented-out sections and potentially unused variables (e.g., # rels_indirect_source_source = ...). It's important to remove unused code to maintain clarity.


Joe's responses:

Docstrings and Comments

The only long one is in _convert_edge_namespace():

    failed_lookups: Each entry in this set can be 1 of 2 cases: (a) the edge exists in one ontology, represented by the
    namespace of IDs in rels & the keys in ns_map, but it does not exist in the mapped otology, represented by the
    namespace of IDs in rels2 & the values in ns_map; it may simply be that there are some terms in the 1st ontology
    that do not exist in the 2nd ontology, or (b) the edges should be in both ontologies, but if the inputs
    to this pipeline are out of sync, the mapped terms did not appear in ns_map.

Probably could be a bit more concise. But generally, I am fond of detailed docstrings. If you look at popular libraries, they often have very long docstrings. I don't mind them because they can be collapsed more easily, and there is really nowhere better to put them. You could put in external docs, but that's a hassle because it's not linked to the code, and it also just pollutes the docs with stuff that's not going to be relevant to most people. I think erring on the side of verbose is good. Generally people only read these on the rare occasion where they're digging deeply for something, and when it comes to digging deeply, you want to have access to as much information as possible.

todo comments

I have a convention where I distinguish between TODO and todo. I use TODO for something that is important; e.g. it is necessary for completing the development objective, or it is related to a bug. I use todo for things like optimizations, refactoring, additional testing, and ideas.

Everything that's left here is a todo, so not a major worry.

Also I should note that I differ from other developers in that I am in favor of leaving (lots of) todos even in production. For these low priority things, there are basically 3 choices:
a. Don't leave anything written about them
b. Put in a GitHub issue
c. Leave in the code
d. Leave written elsewhere

For 'b'. Basically, I don't think everything is important enough to go into a GitHub issue, though perhaps these kind of things could go into 1+ issues marked such as "low priority misc".

I think 'd' is a bad idea.

I think 'a' is not a good idea. Similar to my advocation of long docstrings, I think it is better when more is said than not.

So I advocate for 'c'.

Global variables

Not confident on the best solutions but I offer my thoughts here.

Mostly all of the global variables are used only within the global dictionary EX_DEFAULTS. The design decision for this being a global variable, even though it's really only used in cli(), is because it is essentially a kind of configuration, configs are typically placed at the top of a file by convention for easy access.

I would advocate against moving to a separate file in this case because this config is specific to the sync subclass script and no other files. (edit: Changed; see update below)

I suppose it is actually not great that you can't tell just looking at the code whether or not the globals are re-used in a lot of places. I am not a fan of global variables, myself.

A potential solution is to create a config() or defaults() function and place everything in there. I don't like it aesthetically, but perhaps it is better.

I'd advocate against environmental variables because this makes collaboration more difficult. Even for me, I program on multiple machines sometime, and keeping these files in sync would add additional hassle for me.

Also there can probably be a little cleanup here. Some of these vars could be removed.

Update: I created a config file and moved these variables there. Reason I did that was because I split the CLI into 2 files, and so this change became necessary. Further DRYification is also possible, actually. Many of these path variables already exist in utils.py and are being imported by other scripts. I should have utilized that before. But I didn't refactor further at this time because (i) really those variables should be in a config.py anyway, (ii) those variables are str and not Path objects, so it would require more refactoring, decision making, and possibly testing to make that change.

Function length

Yeah I think sync_subclassof() is really the only long one. Actually I was thinking it needed to be broken down too. It's actually only about 200 lines, and about 50% of that is comments, so it is not actually that long. But it is pretty complex.

I've been thinking about making it smaller, and this was the idea I had. Maybe you agree?: Each of the "cases" (about 20 lines each) gets its own function.

Beyond that, I think it would be difficult to make this much smaller, though I do have a todo related to doing an OO refactor that would simplify things a bit / reduce lines here, but it would not be a quick change.

Hardcoded Values

I put EX_ to stand for "example". The concerns you bring up here, I have addressed in this comment.

Unused Code

I generally agree with this, and I have a todo about this at the top of the block of unused code. Much of this code would become relevant should we do some additional cases. But I can stash this code in my notes somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your thoughtful comments! I wanted you to be aware of my thoughts, and trust you with the execution! I am quite liberal, maybe to the border of anarchical, when it comes to code, and @twhetzel may be implementing stricter code standards moving forward. This was the last PR where I set the tone of the review, so I am happy if you are.

@joeflack4 joeflack4 force-pushed the sync-subclass branch 4 times, most recently from 8f56a9a to 01b3033 Compare January 5, 2024 17:13
@joeflack4 joeflack4 force-pushed the sync-subclass branch 2 times, most recently from 3a90496 to ba99144 Compare January 6, 2024 01:09

# Case 3: SCR is direct in the source, but not at all in Mondo
subheader = deepcopy(ROBOT_SUBHEADER)
subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf'
Copy link
Contributor Author

@joeflack4 joeflack4 Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dual bugfix on case 3 'added' template

Addresses:

  • object_mondo_id is currently SC % in non-obsolete template
  • AI missing from MONDO:excluded_subClassOf

I did the bug fixes but I did not have the time to re-run the outputs with my limited time this weekend and with the urgent work on fixing the pipeline.

I reran and updated the outputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf'
subheader[0]['object_mondo_id'] = 'AI obo:mondo#excluded_subClassOf'

This was not the exactly correct IRI, but this one is (I also tested it).

Copy link
Contributor Author

@joeflack4 joeflack4 Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, very good! I just changed it now via amended commit.

By "tested it", I take it that this means that a CURIE is fine here; doesn't need to be a URI.

Do you want to do the honors for merging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trish and I decided today it was time to merge this!

Pipeline to keep subclass relationships in sync between Mondo and sources.
Currently handled: Case 1, 3, 5
- Add: makefile goals: sync, sync-subclassof, and more.
- Update: makefile goal: build-mondo-ingest
- Add: src/scripts/sync_subclassof.py
- Add: Outputs to reports/

General
- Bugfix: utils.py remove_angle_brackets(): Now correctly returns a str if receives a str.
- Update: config/prefixes.csv: Add: New entries
- Add: utils.py: get_monarch_curies_converter()
- Update: Python requirements: oaklib upgrade
- Update: Minor refactor in _get_all_owned_terms()
- Update: .gitignore: entries, ordering, and comments

Related to other PRs
- Update: metadata/icd10who.yml: The 2 prefix maps in this file were inconsistent with each other. Corrected here and also in #377
@joeflack4 joeflack4 merged commit edbf75a into main Jan 12, 2024
@joeflack4 joeflack4 deleted the sync-subclass branch January 12, 2024 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synchronization: subClassOf axioms (MVP)
4 participants