Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) #56

lstevens17 · 2020-07-01T10:55:57Z

Hello,

I'm trying to run the following command:

agat_sp_extract_sequences.pl -g JU2526_Y39G10AR.22.gff -f JU2526*_region.fa -p

And it throws the following error:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Each line of the file must be less than 65,536 characters. Line 2 is 67824 chars.
STACK: Error::throw
STACK: Bio::Root::Root::throw /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/Root/Root.pm:447
STACK: Bio::DB::IndexedBase::_check_linelength /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:757
STACK: Bio::DB::Fasta::_calculate_offsets /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/Fasta.pm:227
STACK: Bio::DB::IndexedBase::_index_files /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:659
STACK: Bio::DB::IndexedBase::index_file /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:487
STACK: Bio::DB::IndexedBase::new /home/lgs6452/.conda/envs/exonerate_env/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm:364
STACK: /home/lgs6452/.conda/envs/exonerate_env/bin/agat_sp_extract_sequences.pl:125
-----------------------------------------------------------

It would appear the use of BioPerl means that your scripts won't accept single-line FASTAs with sequences longer than 65kb. Would it be possible to do pre-processing (ie converting from single-line to multi-line) of the FASTAs within your scripts so that they work regardless of the input format? While it's straightforward enough to convert the FASTA file prior to running your scripts, it would be far more straightforward to have it done by the script itself. Would probably save you a tonne of time with confused users, too.

Thanks,

Lewis

PS: I've only begun using AGAT but it seems like it will largely solve the constant pain of working with GFF3 files. Huge thanks for developing it!

The text was updated successfully, but these errors were encountered:

lstevens17 · 2020-07-01T11:06:06Z

Incidentally, if anyone bumps into the same issue, you can use FASTX-Toolkit to reformat your FASTA (see http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fasta_formatter_usage). It can be installed using conda.

# install with conda
conda install -c bioconda fastx_toolkit

# convert (where 60 = desired line length)
fasta_formatter -i [original.fasta] -w 60 >[new.fasta]

Juke34 · 2020-07-01T11:51:42Z

Yes I could add a patch to reformat the Fasta file in such case, but I would prefer that this type of fix is hold within Bioperl directly.
If your header in shorter than 80 character you could also directly use a bash command:
fold input.fa > output.fa

Juke34 · 2020-09-07T09:07:25Z

See here for discussion with bioperl team: bioperl/bioperl-live#345

oushujun · 2022-05-02T15:04:46Z

I see bioperl does not have a plan to fix this issue. Here is a Perl alternative of the fastx_toolkit written by Ning Jiang: https://github.com/oushujun/LTR_retriever/blob/master/bin/fasta-reformat.pl. It's slower but free of third-party dependencies.

Juke34 changed the title ~~Line length limit on input FASTA file (agat_sp_extract_sequences.pl)~~ Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) Sep 7, 2020

pmagwene mentioned this issue Jul 16, 2021

FASTA files with very large unwrapped records generate exceptions in agat_sp_extract_sequences.pl #150

Closed

Juke34 added the Info FYI label Jul 29, 2021

Juke34 closed this as completed Oct 27, 2022

kenji-yt mentioned this issue Aug 16, 2024

Each line of the file must be less than 65,536 characters bioperl/bioperl-live#345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) #56

Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) #56

lstevens17 commented Jul 1, 2020

lstevens17 commented Jul 1, 2020

Juke34 commented Jul 1, 2020

Juke34 commented Sep 7, 2020

oushujun commented May 2, 2022

Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) #56

Line length limit on input FASTA file: 65,536 characters (limit imposed by bioperl) #56

Comments

lstevens17 commented Jul 1, 2020

lstevens17 commented Jul 1, 2020

Juke34 commented Jul 1, 2020

Juke34 commented Sep 7, 2020

oushujun commented May 2, 2022