-
Notifications
You must be signed in to change notification settings - Fork 0
Sequence shuffler and other sequence-related utilities.
License
bjmt/sequence-utils
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
sequence-utils ============== Installation ------------ git clone https://github.com/bjmt/sequence-utils cd sequence-utils make The following binaries are created: bin/countfa Count length of sequences inside a fasta file bin/countlets Count k-lets in a string bin/countwin Count k-lets by window in a string bin/seqgen Generate a random string bin/shuffler Shuffle a string or sequences inside a fasta file Run these with the -h flag to see usage. countfa ------- Counts the number of characters per sequence in a fasta file. For each sequence, the name followed by the character count are returned to stdout. The aim is to count sequence lengths without taking up too much memory. To this end, only one character is loaded into memory at a time. Example usage: echo ">1\nACAAG\n>2\nGCCCGGTTAT" | bin/countfa >1 5 >2 10 countlets --------- This utility counts the total number of k-lets in the input sequence. Be aware that the total number of k-lets is n^k, where n is the alphabet length. For use cases involving memory constraints, providing the sequence alphabet ahead of time will allow countlets to count k-lets while only needing to load k + 1 letters into memory at a time. When the alphabet is provided, it will typically never take up more than several MBs of memory. Optionally, k-lets with counts of zero can be ommitted from the output. Example usage: bin/countlets -k 1 -i example/sequence.txt A 3301 C 1572 G 1619 T 3508 bin/countlets -k 1 -a ACGT -i example/sequence.txt A 3301 C 1572 G 1619 T 3508 echo ACGTGA | bin/countlets -k 4 -n ACGT 1 CGTG 1 GTGA 1 countwin -------- Count k-lets in the input string in windows. The window and step sizes can be controlled. The result is a tsv-formatted table with columns START, STOP, LET, and COUNT. Optionally, rows where the COUNT column is zero can be ommitted from the output. Example usage: echo ACGTGTGA | bin/countwin -a ACGT -k 3 -s 1 -n START STOP LET COUNT 1 3 ACG 1 2 4 CGT 1 3 5 GTG 1 4 6 TGT 1 5 7 GTG 1 6 8 TGA 1 bin/countwin -i example/sequence.txt -a ACGT -k 1 -w 5000 START STOP LET COUNT 1 5000 A 1607 1 5000 C 815 1 5000 G 828 1 5000 T 1750 5001 10000 A 1694 5001 10000 C 757 5001 10000 G 791 5001 10000 T 1758 seqgen ------ Create random sequences from any alphabet. Letters can be made up of any number of characters. Weights can be provided to modify random generation. If any of the input letters contain spaces, use quotation marks. New letters are streamed directly to the output, meaning memory usage is independent of output sequence length. Use commas to separate letters with multiple characters. Example usage: bin/seqgen -a ACGT -l 10 -s 11 CGAACTATTC bin/seqgen -a ACGT -l 1000 -s 1 | bin/countlets A 246 C 258 G 233 T 263 bin/seqgen -a "A,B,CD, " -l 10 -s 3 BCD BBBAAB bin/seqgen -a ACGT -l 10 -s 11 -w 1,0.5,0.5,1 TTAGAATTTT bin/seqgen -a ACGT -l 1000 -s 11 -w 1,0.5,0.5,1 | bin/countlets A 338 C 150 G 169 T 343 shuffler -------- Shuffles an input sequence using one of three methods. The default, euler, preserves exact k-let counts by performing a random Eulerian walk through the sequence. The implementation is based on the concept proposed by Altschul and Erickson (1985) as well as the cycle-popping algorithm proposed by Propp and Wilson (1998) for finding randomised Eulerian paths. One warning regarding this method: the number of vertices is equal to n^k, where n is the alphabet length. This means that if trying to shuffle a DNA sequence with k = 15, there will be over one billion vertices; and every vertex present in the input sequence has to be walked to the ending vertex, and those walks could potential be quite long themselves. Once a new Eulerian path has been found, the process is quite fast regardless of k; but the program may stall for quite some time trying to find such a path. The second method, linear, splits the sequence every k letters before shuffling these around. See: AAACAGATT After splitting (k = 2): AA AC AG AT T 1 2 3 4 r The k-let indices are shuffled, and the shuffled sequence reassembled from the split k-lets by their new index position. Any remainder letters which do not fit in a full k-let are left at the end. The third method, markov, works by first getting the total number of all possible k-lets. Following this, a new sequence (of equal length) is created. Every new letter is chosen based on the probability of that letter (and the last k - 1 letters) appearing in a k-let in the original sequence. This results in a new sequence with similar (but not identical) k-let counts. The idea for this method of (pseudo-)shuffling is discussed by Fitch (1983). Note: these methods only apply for k > 1. Otherwise, a simple shuffle call is performed. Example usage: euler shuffling: bin/shuffler -i example/sequence.txt -k 2 | bin/countlets A 3301 C 1572 G 1619 T 3508 markov shuffling: bin/shuffler -i example/sequence.txt -k 2 -s 1 -m | bin/countlets A 3326 C 1500 G 1613 T 3560 linear + fasta-formatted shuffling: echo ">seq\nASCASCASDASCASDASDASC" | bin/shuffler -k 2 -l -s 1 -f >seq CAASASASSCSCDAASSDDAC repeatedly shuffle input: echo AAAACCCCGGGGTTTT | bin/shuffler -n 4 ACAATAGTGCTCTCGG CACCGAGATCTTGGAT TCATGCATACGGAGTC AGATAGGAGCCTTCCT TAAATCGAGGCGTTCC A note on memory usage: This program is not terribly memory efficient, usually requiring memory several times the size of the input sequence. This means shuffling billions of characters is not recommended unless you don't mind letting shuffler take up several GBs of memory. Comparison with the command line version of uShuffle (Jiang et al. 2008): Pros: Cons: - Additional linear and markov - Memory usage does not scale as shuffling methods well with high k (>8) - Input/output can be read/written from/to disk or piped - Shuffles both strings and fasta files - No limit on string/sequence size (uShuffle is limited to ARG_MAX) shuffler VS uShuffle benchmarks (using GNU Time): A random DNA string with 100,000 characters was used as input. (e = euler, l = linear, m = markov) k Program Peak memory (kB) Elapsed time (s) -- ------------ ---------------- ---------------- 1 uShuffle 1036 0.00 shuffler 1872 0.01 2 uShuffle 3796 0.00 shuffler (e) 3184 0.02 shuffler (l) 2272 0.01 shuffler (m) 2672 0.03 3 uShuffle 3796 0.01 shuffler (e) 3504 0.02 shuffler (l) 2144 0.01 shuffler (m) 2672 0.02 4 uShuffle 3780 0.01 shuffler (e) 3728 0.02 shuffler (l) 2080 0.00 shuffler (m) 2736 0.03 5 uShuffle 3792 0.01 shuffler (e) 3808 0.02 shuffler (l) 2032 0.01 shuffler (m) 2704 0.03 6 uShuffle 3820 0.01 shuffler (e) 3824 0.03 shuffler (l) 2016 0.00 shuffler (m) 2880 0.04 7 uShuffle 3908 0.02 shuffler (e) 4224 0.03 shuffler (l) 1984 0.00 shuffler (m) 3792 0.04 8 uShuffle 4296 0.05 shuffler (e) 5248 0.04 shuffler (l) 1984 0.01 shuffler (m) 8112 0.05 9 uShuffle 5384 0.08 shuffler (e) 12080 0.05 shuffler (l) 1872 0.00 shuffler (m) 28896 0.08 10 uShuffle 6380 0.10 shuffler (e) 39600 0.09 shuffler (l) 1872 0.00 shuffler (m) 107984 0.13 shuffler (euler) is quite competitive with uShuffle in terms of elapsed time and memory usage at low k (1-8). Above that shuffler becomes a bit inefficient; this is because shuffler creates a table of all possible k-lets (vertices) whereas uShuffle only keeps those found in the sequence in memory. Though realistically, most shuffling tasks do not require such high k values. References ---------- Altschul, S.F. and Erickson, B.W. (1985). Significance of Nucleotide Sequence Alignments: A Method for Random Sequence Permutation That Preserves Dinucleotide and Codon Usage. _Molecular Biology and Evolution_, 2(6), 526-538. Fitch, W.M. (1983). Random Sequences. _Journal of Molecular Biology_, 163(2), 171-176. Jiang, M., Anderson, J., Gillespie, J. and Mayne, M. (2008). uShuffle: A Useful Tool for Shuffling Biological Sequences While Preserving the k-let Counts. _BMC Bioinformatics_, 9(192). Propp, J.G. and Wilson, D.W. (1998). How to Get a Perfectly Random Sample from a Generic Markov Chain and Generate a Random Spanning Tree of a Directed Graph. _Journal of Algorithms_, 27(2), 170-217.
About
Sequence shuffler and other sequence-related utilities.
Topics
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published