Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) #56

Closed
lstevens17 opened this issue Jul 1, 2020 · 4 comments
Labels
Info FYI

Comments

@lstevens17
Copy link

Hello,

I'm trying to run the following command:

agat_sp_extract_sequences.pl -g JU2526_Y39G10AR.22.gff -f JU2526*_region.fa -p

And it throws the following error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the file must be less than 65,536 characters. Line 2 is 67824 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/Root/Root.pm:447
STACK: Bio::DB::IndexedBase::_check_linelength /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:757
STACK: Bio::DB::Fasta::_calculate_offsets /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/Fasta.pm:227
STACK: Bio::DB::IndexedBase::_index_files /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:659
STACK: Bio::DB::IndexedBase::index_file /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:487
STACK: Bio::DB::IndexedBase::new /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:364
STACK: /home/lgs6452/.conda/envs/exonerate_env/bin/agat_sp_extract_sequences.pl:125
-----------------------------------------------------------

It would appear the use of BioPerl means that your scripts won't accept single-line FASTAs with sequences longer than 65kb. Would it be possible to do pre-processing (ie converting from single-line to multi-line) of the FASTAs within your scripts so that they work regardless of the input format? While it's straightforward enough to convert the FASTA file prior to running your scripts, it would be far more straightforward to have it done by the script itself. Would probably save you a tonne of time with confused users, too.

Thanks,

Lewis

PS: I've only begun using AGAT but it seems like it will largely solve the constant pain of working with GFF3 files. Huge thanks for developing it!

@lstevens17
Copy link
Author

Incidentally, if anyone bumps into the same issue, you can use FASTX-Toolkit to reformat your FASTA (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fasta_formatter_usage). It can be installed using conda.

# install with conda
conda install -c bioconda fastx_toolkit

# convert (where 60 = desired line length)
fasta_formatter -i [original.fasta] -w 60 >[new.fasta]

@Juke34
Copy link
Collaborator

Juke34 commented Jul 1, 2020

Yes I could add a patch to reformat the Fasta file in such case, but I would prefer that this type of fix is hold within Bioperl directly.
If your header in shorter than 80 character you could also directly use a bash command:
fold input.fa > output.fa

@Juke34 Juke34 changed the title Line length limit on input FASTA file (agat_sp_extract_sequences.pl) Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) Sep 7, 2020
@Juke34
Copy link
Collaborator

Juke34 commented Sep 7, 2020

See here for discussion with bioperl team: bioperl/bioperl-live#345

@oushujun
Copy link

oushujun commented May 2, 2022

I see bioperl does not have a plan to fix this issue. Here is a Perl alternative of the fastx_toolkit written by Ning Jiang: https://github.com/oushujun/LTR_retriever/blob/master/bin/fasta-reformat.pl. It's slower but free of third-party dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Info FYI
Projects
None yet
Development

No branches or pull requests

3 participants