Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new OBO loader wont accept KEGG OBO. #18

Open
bradfordcondon opened this issue Oct 2, 2018 · 12 comments
Open

new OBO loader wont accept KEGG OBO. #18

bradfordcondon opened this issue Oct 2, 2018 · 12 comments
Labels

Comments

@bradfordcondon
Copy link
Member

error:

Step 1: Preloading File sites/all/modules/custom/tripal_analysis_kegg/kegg.obo...
array_key_exists(): The first argument should be either a string or an integer OBOImporter.inc:2203                             [warning]
Cannot cache terms without a default DB.Array
(
    [id] => Array
        (
            [0] => KEGG:Pathway
        )

    [name] => Array
        (
            [0] => Pathway
        )

)

[site http://default] [TRIPAL ERROR] [TRIPAL_JOB] Cannot cache terms without a default DB.Array(    [id] => Array        (            [0] => KEGG:Pathway        )    [name] => Array        (            [0] => Pathway        ))

Probably some OBO header info that isnt printed out

@bradfordcondon bradfordcondon self-assigned this Nov 7, 2018
@bradfordcondon
Copy link
Member Author

simply adding ontology:KEGG to the header results in:


Step 1: Preloading File sites/all/modules/custom/tripal_analysis_kegg/kegg.obo...
A term that belongs to another ontology is used within this vocabulary.  Therefore a lookup will be performed with the EBI Ontology Lookup Service to retrieve the information for this term. Please note, that vocabularies with many non-local terms require remote lookups and these lookups can dramatically decrease loading time.
Performing EBI OLS Lookup for: KEGG:Cancers: Overview (Pathway)
WD php: TypeError: Argument 2 passed to t() must be of the type array, integer given, called in                                            [error]
/Users/bc/tripal/sites/all/modules/tripal/tripal/includes/TripalImporter.inc on line 582 in t() (line 1735 of
/Users/bc/tripal/includes/bootstrap.inc).
Cannot modify header information - headers already sent by (output started at                                                              [warning]
/Users/bc/.composer/vendor/drush/drush/includes/output.inc:40) bootstrap.inc:1486
TypeError: Argument 2 passed to t() must be of the type array, integer given, called in /Users/bc/tripal/sites/all/modules/tripal/tripal/includes/TripalImporter.inc on line 582 in t() (line 1735 of /Users/bc/tripal/includes/bootstrap.inc).
Drush command terminated abnormally due to an unrecoverable error.                                                                         [error]
________________________________________________________________________________

oh dear.


[Term]
id: KEGG:Cancers: Overview (Pathway)
name: Cancers: Overview (Pathway)
is_a: KEGG:Human Diseases (Pathway)

We defined terms that we didnt want to "take seriously" because they dont have kegg accessions like the above one. But now we're going to fail the EBI OLS lookup for each.

@spficklin this is actually a big problem. We need an "override" option for the importer that wont look up terms and will just blindly insert.

I dontk now that we have an alternative. There are literally no KEGG terms in EBI to look up, and because some terms use redundant names/accessions, we HAVE to define new DBs for them or it wont load properly.

@bradfordcondon
Copy link
Member Author

additional idea:
we provide the OBO importer from RC2 which will work with this OBO because theres no EBI lookup. We dont add it to tripal or anything, but we bundle it with this module and run it on install hardcoded for KEGG.
This makes updating the OBO a huge pain, though.

@bradfordcondon
Copy link
Member Author

background info to help:

background: kegg d iscontinued the heir files, which is what the 2.0 module imported
so  we assigned an undergrad to reverse engineer the OBO from KEGG's API
sighhhh which brings up another possibility- forgetting about the OBO and just importing the term accessions (ie inserting an ew cvterm when reading hte annotation file)
the resulting OBO is wonky, because the KEGG accessions are not real or even valid terms
the names are not unique.
furthermore only the "leaf" terms HAVE accessions.
id have to dig up the issues to provide more info on that.  but its part of why we took kind of a janky solution, in appending the sub-ontology to the DB, and using the name as the accession for accessionless terms
so for example, you havel ike 3 "metabolism" terms
with different child terms
so you cant just make them all KEGG:metabolism because db, accession is a unique constraint
so we name them KEGG_pathways:metabolism, KEGG_modules:metabolism for example
but the new loader isnt having that, since it wants to look up those terms
and i developed it prior to the term lookup

and see also : #7

@bradfordcondon
Copy link
Member Author

bradfordcondon commented Nov 8, 2018

@spficklin has brought up using local terms instead. This seems like a good, direct solution.

  • we'd have to insert the local terms on the module install
  • we'd want to make sure the accessions we create are unique to prevent term overlap
  • we'd change the OBO generated to reference these accessionless terms as local:Metabolism etc

Local DB, but KEGG CV i guess? I'd have to think about that...

@spficklin
Copy link
Member

spficklin commented Nov 8, 2018

The previous version of the KEGG module created three different databases: KEGG_PATHWAY, KEGG_BRITE, KEGG_ORTHOLOGY, and KEGG_MODULE. I see in your OBO file that you've used just a single database: KEGG. Was there a reason not to keep the previous databases?

As a second thought.... each of the leaf terms seem to have a properly formed ID already created by KEGG. For example: PATH:ko04810 and BR:ko04812. Had you considered using PATH and BR as the database names rather than KEGG. The terms defined by KEGG have these identifiers and I think we should use them as they defined them.

For example, the OBO right now converts this term: KO:K03022 to KEGG:K03022

I would suggest it just be left as KO:K03022 in the OBO file.

Oh, and I think the OBO may have other issues too. For example. This term:

id: KEGG:K03020
name: RPC19, POLR1D (K03020)
def: DNA-directed RNA polymerases I and III subunit RPAC2
is_a: KEGG:00230 Purine metabolism [PATH:ko00230] (ko00000)
is_a: KEGG:00240 Pyrimidine metabolism [PATH:ko00240] (ko00000)
is_a: KEGG:03020 RNA polymerase [PATH:ko03020] (ko00000)
is_a: KEGG:03021 Transcription machinery [BR:ko03021] (ko00000)
is_a: KEGG:04623 Cytosolic DNA-sensing pathway [PATH:ko04623] (ko00000)
is_a: KEGG:05169 Epstein-Barr virus infection [PATH:ko05169] (ko00000)
is_a: KEGG:00230 Purine metabolism [PATH:ko00230] (ko00001)
is_a: KEGG:00240 Pyrimidine metabolism [PATH:ko00240] (ko00001)
is_a: KEGG:03020 RNA polymerase [PATH:ko03020] (ko00001)
is_a: KEGG:04623 Cytosolic DNA-sensing pathway [PATH:ko04623] (ko00001)
is_a: KEGG:05169 Epstein-Barr virus infection [PATH:ko05169] (ko00001)
is_a: KEGG:03021 Transcription machinery [BR:ko03021] (ko00001)
is_a: KEGG:M00181 (Module)
is_a: KEGG:M00182 (Module)

This is for an ortholog (gene). and the relationship with pathways is is_a. But I don't think that is correct. The ortholog is not a pathway it's part_of a pathway, right?

I think the term stanza should look like this instead:

id: KO:K03020
name: RPC19, POLR1D
def: DNA-directed RNA polymerases I and III subunit RPAC2
part_of: PATH:ko00230 ! Purine metabolism [PATH:ko00230] (ko00000)
part_of: PATH:ko00240 ! Pyrimidine metabolism [PATH:ko00240] (ko00000)
part_of: PATH:ko03020 ! RNA polymerase [PATH:ko03020] (ko00000)
part_of: PATH:ko03021 ! Transcription machinery [BR:ko03021] (ko00000)
part_of: PATH:ko04623 ! Cytosolic DNA-sensing pathway [PATH:ko04623] (ko00000)
part_of: PATH:ko05169 ! Epstein-Barr virus infection [PATH:ko05169] (ko00000)
part_of: PATH:ko00230 ! Purine metabolism [PATH:ko00230] (ko00001)
part_of: PATH:ko00240 ! Pyrimidine metabolism [PATH:ko00240] (ko00001)
part_of: PATH:ko03020 ! RNA polymerase [PATH:ko03020] (ko00001)
part_of: PATH:ko04623 ! Cytosolic DNA-sensing pathway [PATH:ko04623] (ko00001)
part_of: PATH:ko05169 ! Epstein-Barr virus infection [PATH:ko05169] (ko00001)
part_of: PATH:ko03021 !vTranscription machinery [BR:ko03021] (ko00001)
part_of: MODULE:M00181
part_of: MODULE:M00182

@spficklin
Copy link
Member

spficklin commented Nov 8, 2018

So after talking with @bradfordcondon a bit on Slack it seems the desire is to have all of the KEGG vocabularies appear in a single tree like the Gene Ontology vocabularies do. This is why KEGG was used in the ID. However as these KEGG terms represent real terms created by KEGG, I do not think we should be creating new IDs to represent the terms. We should store them as KEGG created them (as in the example I gave above).

However, The problem here for putting them all in the same "tree" for viewing is that KEGG uses a different prefix ID for each vocabulary similar to the EDAM vocabulary. Chado doesn't support this and for EDAM we've had to reverse the CV and DB records to make this work and we've always programmatically loaded EDAM terms. So, while I still think we should setup the OBO to use the correct terms (created by KEGG) we would need to adjust the OBO loader to somehow "reverse" the CV and DBs when storing in Chado. This wouldn't be unprecedented because we do it for EDAM.

@spficklin
Copy link
Member

Also, for backwards compatibility we should do something with the old Tv2 vocabularies created by this module as they are in use on some sites

image

I think this module, in the install, once we settle on how the actual CV/DB records will be should rename those old CV/DB records to use the same names. Then, the OBO loader should update old terms when it's loaded.

@spficklin
Copy link
Member

Perhaps working out the solution to this issue for Chado will resolve the problem for us: GMOD/Chado#58

@bradfordcondon
Copy link
Member Author

to add to this, the new KEGG annotations being returned have new terms that weren't in the previous OBO, forcing us to ignore them or fix this problem....

@songtaogui
Copy link

Hi, I have tried to upload kegg obo using tripal v3.2, and the same error occured as @bradfordcondon described .

Is this problem solved so far?

Thank you

@bradfordcondon
Copy link
Member Author

bradfordcondon commented Jul 17, 2019 via email

@Ferrisx4
Copy link
Member

Ferrisx4 commented Feb 5, 2021

We'll be taking a look at this again. From what I can tell from all the discussion here and in other related issues, there are three main possible routes:

  1. Alter the Tripal OBO Importer to not perform a lookup against the EBI OLS and blindly insert.
  2. "reverse" the CV and DBs when storing in Chado
  3. Overhaul the OBO to use correct CVs/DBs (harder)

There was also some movement on modifying Chado to better accept KEGG (as well as EDAM)

I'm looking to revive the discussion here. Is there a preferred direction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants