-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369
Comments
thanks for reporting a bug rather than just twitter rant! maybe others can look at this too @cjfields -- its been 15+ years with that code - I think this is more about assumptions about features vs locations here that are obscuring the problem you describe. If one is reading and writing multi-exons as individual features which is the typical way the frame is encoded this all works as planned - but if a single feature is encoded as a split-location - frame isn't encoded in a multi-location genbank file location necessarily. probably If you wanted it to be computed from the data that might be helpful but it also make assumptions about the Generic feature being a CDS. This goes back to pre-GFF3 when the assumptions about how parent/child relationships were encoded and there were multiple interpretations of how to do this from gff1->gff2->gff2.5 /gtf etc. I think much better validators and correctors for GFF (perhaps http://genometools.org/ ) have implemented a more dedicated logic. maybe you can show input data that you used - are you are converting genbank to GFF and expecting frame to be computed and the assumption that it is a CDS with a frame to be carried through? |
The zip file has a simple Genbank-formatted entry and a simple program that exposes the problem -- the correct sequence of frames is 0,0,2,1,0 |
Yeah I agree w/ @hyphaltip , I suspect there's bit rot from prior logical assumptions that have changed over time. I also vaguely recall Bio::Tools::GFF was to be deprecated in preference to Bio::DB::SeqFeature, though I'm not sure that is still the case. Would it be worth looking into Bio::Tools::GFF or should we check Bio::DB::SeqFeature? If @scottcain around, maybe he would know? I think there was a GenBank-to-GFF conversion script for Bio::DB::SeqFeature (maybe within the GBrowse2 code?), we could check to see if if gives the correct frames. |
Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived. There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better. |
Chris Mungall wrote a gbk to gff script that used feature or name overlap
to assign genes mRNA CDS to common parent group. It should be in scripts
folder.
I think Tools::GFF predates you Scott and was before we had really the same
workflow and full feature implementations. I think split location support
was an add on previously it round tripped features where locations were
explicitly start/stop only. A more db style interface with Lincoln’s
DB::GFF was one solution .
now I would use NCBI tbl Format more aggressively and map to GFF / GBK / /
ASN.1 from there anyways.
Jason
On Fri, Apr 8, 2022 at 10:02 AM Scott Cain ***@***.***> wrote:
Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably
be marked as such, since I don't think it is likely to have improved with
age. It is hard to remember the logic that went into that bit of code (I
don't recall if I wrote it--I hope not--but I certainly might have!). I
think @cjfields <https://github.com/cjfields> is right about there having
been a GB to GFF3 script, but I don't recall where it lived.
There is a script with GBrowse,
https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it
loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I
don't have the time to do the code archeology to determine if it handles
strand better.
—
Reply to this email directly, view it on GitHub
<#369 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAL5O6BKOMI6UK57PLTAHDVEBRCHANCNFSM5S2GDZ3A>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Sent from Gmail Mobile
Jason Stajich - ***@***.***
|
There is There was also a |
Apologies to @krobison13 about the wait, but all of us 'old-timers' are pretty time constrained these days. Coming back around to this, I think we should deprecate Bio::Tools::GFF particularly if there are better options, but we should definitely point in the right direction regardless what we decide. |
When writing GFF, the same frame is assigned to every range in a multi-exon gene rather than correctly assigning 0,1 or 2 to specify the frame
Twitter note including image of the offending loop
The text was updated successfully, but these errors were encountered: