-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronization: SubClassOf #363
Conversation
5b9873e
to
bf07365
Compare
bf07365
to
771c2dc
Compare
32979c1
to
763e959
Compare
c36db42
to
b156653
Compare
b156653
to
f92a5c2
Compare
f92a5c2
to
64bdfe6
Compare
bef16dd
to
4336a2f
Compare
4336a2f
to
927936c
Compare
893c817
to
0d206cd
Compare
0d206cd
to
3518287
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general things we should clean up in a future refactoring (cc @twhetzel maybe we should start developing a style guide for these):
- Docstrings and Comments: The script includes thorough comments and docstrings, which is good for understanding the purpose and functionality of each function. However, some comments are excessively verbose and could be more concise.
- TODO Comments: There are several 'todo' comments scattered throughout the script. Leaving numerous todos can indicate unfinished or potentially unstable code.
- Use of Global Variables: There's extensive use of global variables (e.g.,
EX_DEFAULTS
,ONTOLOGY_DIR
, etc.). It's generally better to encapsulate global variables or move configuration settings to a separate config file or environment variables, especially if they are subject to change. - Function Length: Some functions are quite long and perform multiple tasks (e.g.,
sync_subclassof
). Consider breaking down these functions into smaller, more manageable pieces. This improves readability and testability. - Hardcoded Values: There are hardcoded values and paths (like
EX_ONTO_NAME = 'ordo'
). I don't really know what their role is in the finished script - IMO they should not exist/ or exist as default values in the argparse. - Unused Code: There are commented-out sections and potentially unused variables (e.g.,
# rels_indirect_source_source = ...
). It's important to remove unused code to maintain clarity.
sync_subclassof(**d) | ||
|
||
|
||
def run_defaults(use_cache=True): # todo: #remove-temp-defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the point of this method? I don't see it used anywhere and I would rather we don't add any code that is not needed..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default settings @matentzn
Review and decide.
Details
Basically there are some toggleable default settings that can be run when running the Python script directly rather than in the makefile.
This is very helpful during development. It helps me quickly run and debug things. And there are some things I can do with this defaults setup that I can't do using a debug config with saved params.
We've discussed this type of thing in the past and you've indicated that you prefer to leave out such defaults code from the final product. That's what this todo
is. It's an indicator that this stuff should be removed when development is done.
Of course, this feature is too big for a single PR, and there are perhaps it will undergo development for many months, so I'd like to leave the defaults there so that I can toggle their usage when I develop further.
Note that these are off by default. So they will have no effect unless I'm doing local development.
Options:
a. Delete them now; Will slow Joe down when he continues to develop in the future, but he can temporarily add them back in another PR.
b. Leave them for now. Delete at some point in the future when other cases are completed.
c. Simply leave them indefinitely into the future. They are no harm in production, and useful for further development and debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed analysis. This is TMD for me, I will trust you to make the right call!
0015e59
to
348f5c5
Compare
@@ -498,6 +499,41 @@ slurp-all-no-updates: $(foreach n,$(ALL_COMPONENT_IDS), slurp-no-updates-$(n)) | |||
.PHONY: slurp-all | |||
slurp-all: $(foreach n,$(ALL_COMPONENT_IDS), slurp-$(n)) | |||
|
|||
|
|||
############################# | |||
###### Synchronization ###### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General review comment discussion
@matentzn Just responding to your general / review closing comment as a thread instead so that it can be marked resolved
Context (Nico's review comment):
Details
Some general things we should clean up in a future refactoring (cc @twhetzel maybe we should start developing a style guide for these):
- Docstrings and Comments: The script includes thorough comments and docstrings, which is good for understanding the purpose and functionality of each function. However, some comments are excessively verbose and could be more concise.
- TODO Comments: There are several 'todo' comments scattered throughout the script. Leaving numerous todos can indicate unfinished or potentially unstable code.
- Use of Global Variables: There's extensive use of global variables (e.g.,
EX_DEFAULTS
,ONTOLOGY_DIR
, etc.). It's generally better to encapsulate global variables or move configuration settings to a separate config file or environment variables, especially if they are subject to change. - Function Length: Some functions are quite long and perform multiple tasks (e.g.,
sync_subclassof
). Consider breaking down these functions into smaller, more manageable pieces. This improves readability and testability. - Hardcoded Values: There are hardcoded values and paths (like
EX_ONTO_NAME = 'ordo'
). I don't really know what their role is in the finished script - IMO they should not exist/ or exist as default values in the argparse. - Unused Code: There are commented-out sections and potentially unused variables (e.g.,
# rels_indirect_source_source = ...
). It's important to remove unused code to maintain clarity.
Joe's responses:
Docstrings and Comments
The only long one is in _convert_edge_namespace()
:
failed_lookups: Each entry in this set can be 1 of 2 cases: (a) the edge exists in one ontology, represented by the
namespace of IDs in rels & the keys in ns_map, but it does not exist in the mapped otology, represented by the
namespace of IDs in rels2 & the values in ns_map; it may simply be that there are some terms in the 1st ontology
that do not exist in the 2nd ontology, or (b) the edges should be in both ontologies, but if the inputs
to this pipeline are out of sync, the mapped terms did not appear in ns_map.
Probably could be a bit more concise. But generally, I am fond of detailed docstrings. If you look at popular libraries, they often have very long docstrings. I don't mind them because they can be collapsed more easily, and there is really nowhere better to put them. You could put in external docs, but that's a hassle because it's not linked to the code, and it also just pollutes the docs with stuff that's not going to be relevant to most people. I think erring on the side of verbose is good. Generally people only read these on the rare occasion where they're digging deeply for something, and when it comes to digging deeply, you want to have access to as much information as possible.
todo comments
I have a convention where I distinguish between TODO
and todo
. I use TODO
for something that is important; e.g. it is necessary for completing the development objective, or it is related to a bug. I use todo
for things like optimizations, refactoring, additional testing, and ideas.
Everything that's left here is a todo
, so not a major worry.
Also I should note that I differ from other developers in that I am in favor of leaving (lots of) todo
s even in production. For these low priority things, there are basically 3 choices:
a. Don't leave anything written about them
b. Put in a GitHub issue
c. Leave in the code
d. Leave written elsewhere
For 'b'. Basically, I don't think everything is important enough to go into a GitHub issue, though perhaps these kind of things could go into 1+ issues marked such as "low priority misc".
I think 'd' is a bad idea.
I think 'a' is not a good idea. Similar to my advocation of long docstrings, I think it is better when more is said than not.
So I advocate for 'c'.
Global variables
Not confident on the best solutions but I offer my thoughts here.
Mostly all of the global variables are used only within the global dictionary EX_DEFAULTS
. The design decision for this being a global variable, even though it's really only used in cli()
, is because it is essentially a kind of configuration, configs are typically placed at the top of a file by convention for easy access.
I would advocate against moving to a separate file in this case because this config is specific to the sync subclass script and no other files. (edit: Changed; see update below)
I suppose it is actually not great that you can't tell just looking at the code whether or not the globals are re-used in a lot of places. I am not a fan of global variables, myself.
A potential solution is to create a config()
or defaults()
function and place everything in there. I don't like it aesthetically, but perhaps it is better.
I'd advocate against environmental variables because this makes collaboration more difficult. Even for me, I program on multiple machines sometime, and keeping these files in sync would add additional hassle for me.
Also there can probably be a little cleanup here. Some of these vars could be removed.
Update: I created a config file and moved these variables there. Reason I did that was because I split the CLI into 2 files, and so this change became necessary. Further DRYification is also possible, actually. Many of these path variables already exist in utils.py
and are being imported by other scripts. I should have utilized that before. But I didn't refactor further at this time because (i) really those variables should be in a config.py
anyway, (ii) those variables are str
and not Path
objects, so it would require more refactoring, decision making, and possibly testing to make that change.
Function length
Yeah I think sync_subclassof()
is really the only long one. Actually I was thinking it needed to be broken down too. It's actually only about 200 lines, and about 50% of that is comments, so it is not actually that long. But it is pretty complex.
I've been thinking about making it smaller, and this was the idea I had. Maybe you agree?: Each of the "cases" (about 20 lines each) gets its own function.
Beyond that, I think it would be difficult to make this much smaller, though I do have a todo
related to doing an OO refactor that would simplify things a bit / reduce lines here, but it would not be a quick change.
Hardcoded Values
I put EX_
to stand for "example". The concerns you bring up here, I have addressed in this comment.
Unused Code
I generally agree with this, and I have a todo
about this at the top of the block of unused code. Much of this code would become relevant should we do some additional cases. But I can stash this code in my notes somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your thoughtful comments! I wanted you to be aware of my thoughts, and trust you with the execution! I am quite liberal, maybe to the border of anarchical, when it comes to code, and @twhetzel may be implementing stricter code standards moving forward. This was the last PR where I set the tone of the review, so I am happy if you are.
8f56a9a
to
01b3033
Compare
3a90496
to
ba99144
Compare
src/scripts/sync_subclassof.py
Outdated
|
||
# Case 3: SCR is direct in the source, but not at all in Mondo | ||
subheader = deepcopy(ROBOT_SUBHEADER) | ||
subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dual bugfix on case 3 'added' template
Addresses:
-
object_mondo_id
is currentlySC %
in non-obsolete template -
AI
missing fromMONDO:excluded_subClassOf
I did the bug fixes but I did not have the time to re-run the outputs with my limited time this weekend and with the urgent work on fixing the pipeline.
I reran and updated the outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subheader[0]['object_mondo_id'] = 'AI MONDO:excluded_subClassOf' | |
subheader[0]['object_mondo_id'] = 'AI obo:mondo#excluded_subClassOf' |
This was not the exactly correct IRI, but this one is (I also tested it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, very good! I just changed it now via amended commit.
By "tested it", I take it that this means that a CURIE is fine here; doesn't need to be a URI.
Do you want to do the honors for merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trish and I decided today it was time to merge this!
ba99144
to
8e9ca4b
Compare
Pipeline to keep subclass relationships in sync between Mondo and sources. Currently handled: Case 1, 3, 5 - Add: makefile goals: sync, sync-subclassof, and more. - Update: makefile goal: build-mondo-ingest - Add: src/scripts/sync_subclassof.py - Add: Outputs to reports/ General - Bugfix: utils.py remove_angle_brackets(): Now correctly returns a str if receives a str. - Update: config/prefixes.csv: Add: New entries - Add: utils.py: get_monarch_curies_converter() - Update: Python requirements: oaklib upgrade - Update: Minor refactor in _get_all_owned_terms() - Update: .gitignore: entries, ordering, and comments Related to other PRs - Update: metadata/icd10who.yml: The 2 prefix maps in this file were inconsistent with each other. Corrected here and also in #377
8e9ca4b
to
1169129
Compare
Updates
Synchronization: SubClassOf
Pipeline to keep subclass relationships in sync between Mondo and sources.
Currently handled: Case 1, 3, 5
General
Related to other PRs