-
Notifications
You must be signed in to change notification settings - Fork 0
OrthoMCLight is an edited version of OrthoMCL v1.4 that requires less memory and has a faster runtime. It is a genome-scale algorithm for grouping orthologous protein sequences.
License
VIB-PSB/OrthoMCLight
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
OrthoMCL: Ortholog Clustering Analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ AUTHOR: Feng Chen <fengchen@sas.upenn.edu>, Li Li ## EDITED BY: Cecilia Sensalari and Michiel Van Bel ## OrthoMCLight [2024-04-17] Edited OrthoMCL Version 1.4 Copyright (C) 2004~2006 by University of Pennsylvania, Philadelphia, PA USA. All rights reserved. Before orthomclight.pl can be used, some variables (including directory variables or parameter variables) in orthomclight_module.pm need to be set, as described in README. Reference: 1. Li Li, Christian J. Stoeckert, Jr. and David S. Roos. OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes. Genome Research 13:2178-2189, 2003 http://www.genome.org/cgi/content/full/13/9/2178 2. Feng Chen, Aaron J. Mackey, Christian J. Stoeckert, Jr., and David S. Roos OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006 34: D363-8. http://nar.oxfordjournals.org/cgi/content/full/34/suppl_1/D363 Preconditions: A Unix System, Perl and some experience with Unix & Perl 1. Installation of required softwares and perl modules ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ OrthoMCL is a perl script which doesn't need compilation. However, it requires some softwares and perl modules to run, as listed below: Softwares: 1. BLAST (NCBI-BLAST, WU-BLAST, etc.) *2. MCL (Markov Clustering algorithm), available at http://micans.org/mcl/. NOTE: MCL changed the output format recently which is not compatible with OrthoMCL. Please use the MCL-02-063 version enclosed with this package, which has been the default for all test analysis. Perl Modules: 1. Storable Note: OrthoMCLight does not need Bio::SearchIO anymore. 2. Setting the variables in "OrthoMCL/orthomclight_module.pm" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Most global variables used in "OrthoMCL/orthomclight.pl" need to be set in the perl module "OrthoMCL/orthomclight_module.pm". --- $PATH_TO_ORTHOMCL: the orthomcl directory itself (example: $PATH_TO_ORTHOMCL = "/disk3/fengchen/orthomcl/";) --- $BLASTALL: your BLAST software (example: $BLASTALL = "/genomics/share/bin/blastall";) --- $BLAST_FORMAT: how BLAST result is stored Options: a) "compact" corresponds to NCBI-BLAST's -m 8 b) "full" corresponds to NCBI-BLAST's -m 0 # for WU-BLAST, make changes on subroutine executeBLASTALL --- $BLAST_NOCPU: the number of CPUs For multi-processor machine, setting it higher than 1 will significantly save time in BLAST step --- $FORMATDB: your FORMATDB software (example: $FORMATDB = "/genomics/share/bin/formatdb";) --- $MCL: your MCL software (example: $MCL = "/disk2/fengchen/mcl-02-063/shmcl/mcl";) --- $MAX_WEIGHT_DEFAULT: Weight used for protein pairs whose BLAST p-value is zero (0). This depends on the algorithm you use: if the second smallest p-value is in the order of -99, maximum_weight should be 100; if -299, maximum_weight should be 300 <DEFAULT>. Now you can run orthomclight.pl on a three-species test set, since the variables $PATH_TO_ORTHOMCL, $BLASTALL, $FORMATDB and $MCL are set. % orthomclight.pl --mode 1 --fa_files "Ath.fa,Hsa.fa,Sce.fa" Note: Here the test set Ath.fa, Hsa.fa and Sce.fa only contain 15, 16 and 11 sequences, respectively. Such a test set is selected to make sure you have everything set and OrthoMCL can run on your machine. Since it takes OrthoMCL long time to finish clustering a big data set, from BLAST to MCL, it's wise to try a very small set first. To use OrthoMCL on your data, you need to collect protein fasta files ".fa" (with each ".fa" file representing one species only, and having a simple name, e.g. "Eco.fa") and put them in the directory "data" or reset the following variable: --- $ORTHOMCL_DATA_DIR: the data directory to store the fasta files $ORTHOMCL_DATA_DIR = $PATH_TO_ORTHOMCL."/data/"; (DEFAULT) 3. Running OrthoMCL ~~~~~~~~~~~~~~~~~~~ The COMPLETE proteome data for each species should be chosen, theoretically. And you should have enough memory (>=800MB) if you have around 100,000 sequences to cluster, because this stand-alone version tries to read BLAST information into memory. There are five modes to run OrthoMCL, with each mode having a different process. We strongly suggest you to use MODE 4 for very big set, since BLAST was not programmed to run parallelly. You can simply prepare two files for mode 4, BPO file and GG file. And it's very fast, for our test set of 200,000 sequences on a Mac G5 computer, it took 8 hours to finish. The five modes of OrthoMCL are: Mode 1: OrthoMCL analysis from FASTA files. OrthoMCL starts from the beginning BLAST to final MCL. Example: % orthomclight.pl --mode 1 --fa_files Ath.fa,Hsa.fa,Sce.fa Mode 2: OrthoMCL analysis based on former OrthoMCL run (former run directory needs to be given), if you want to change the inflation parameter, p-value cutoff (can only be lower than your former run BLAST p-value cutoff), percent identity cutoff or percent match cutoff. No BLAST or BLAST parsing performed. Example: % orthomclight.pl --mode 2 --former_run_dir Sep_8 --inflation 1.4 Mode 3: OrthoMCL analysis from user-provided BLAST result BLAST out file and genome gene relation file telling which genome has which gene (Please refer to 5. File Formats). No BLAST performed. Example: % orthomclight.pl --mode 3 --blast_file AtCeHs_blast.out --gg_file AtCeHs.gg --num-threads 4 Mode 4: OrthoMCL analysis from user-provided BPO (BLAST PARSING OUT) file and GG (genome gene relation) file telling which genome has which gene (Please refer to 5. File Formats). No BLAST or BLAST parsing performed. Example: % orthomclight.pl --mode 4 --bpo_file AtCeHs.bpo --gg_file AtCeHs.gg --num-threads 4 Mode 5: OrthoMCL analysis based on previous run, but with less taxa included or with only inflation value changed (FASTER than mode 2, no selection on reciprocal best/better hits performed). Example: % orthomclight.pl --mode 5 --former_run_dir Sep_8 --taxa_file AtCeHs.gg --inflation=1.1 --num-threads 4 4. OrthoMCL Arguments ~~~~~~~~~~~~~~~~~~~~~ fa_files=<String> Protein FASTA file names, with each file containing protein sequences from one species, separated by comma(e.g. "Eco.fa,Sce.fa,Afu.fa") pv_cutoff=<Float> P-Value or E-Value Cutoff in BLAST search and/or ortholog clustering, 1e-5 (DEFAULT). pi_cutoff=<Int> Percent Identity Cutoff <0-100> in ortholog clustering, 0 (DEFAULT). pmatch_cutoff=<Int> Percent Match Cutoff <0-100> in ortholog clustering, 0 (DEFAULT). inflation=<Float> Markov Inflation Index, used in MCL algorithm, 1.5 (DEFAULT). Increasing this index increases cluster tightness, and the number of clusters. former_run_dir=<String> Former run directory, required in Mode 2, e.g. "July_21". Then the blast result file and bpo file in former run directory will be used instead of running from the very beginning. blast_file=<String> Blast out file provided by user, required in Mode 3. It will be parsed into BPO file, and further used for ortholog clustering. bpo_file=<String> BPO (Blast Parse Out) file provided by user, required in Mode 4. Please refer to ORTHOMCL_INSTALL about its format. gg_file=<String> GG (Genome Gene mapping) file provided by user, required in Mode 3 & 4. Please refer to ORTHOMCL_INSTALL about its format. taxa_file=<String> TAXA file provided by user, required in Mode 5. Please refer to ORTHOMCL_INSTALL about its format. num_threads=<Int> Number of threads when running the MCL command. 5. File Format in OrthoMCL ~~~~~~~~~~~~~~~~~~~~~~~~~~ There are some files generated during OrthoMCL running: all_orthomcl.out Final result showing how genes are clustered all_blast.bbh List of reciprocal best/better hit, for further reference parameter.log Parameter Log file recording all parameters orthomcl.log OrthoMCL Log file tmp/all.fa Fasta file containing protein sequences from all species tmp/all.gg Genome gene relation file, telling which genome contains which gene, format is shown afterwards tmp/all_blast.out All-against-all BLAST out file tmp/all.bpo BLAST PARSING OUT file from all_blast.out tmp/all_bpo.idx Storing a data array, used in blast query tmp/all_bpo.se Storing a hash, used in blast query tmp/all_ortho.mtx Matrix file storing all weights tmp/all_ortho.idx Mapping information between Gene ID's and Index'es used in Matrix tmp/all_ortho.mcl MCL clustering result FORMAT of all.gg or "usr_gg_file" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ath: At1g01190 At1g01280 At1g04160 ... Hsa: Hs10834998 Hs10835119 Hs10835271 ... Sce: YAL029c YAR009c YAR010c YHR023w ... Each line stands for each genome. Each line starts with genome name, followed by a colon ":", and then followed by all the gene id's separated by space key " ". FORMAT of all.bpo or "usr_bpo_file" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1;At1g01190;535;At1g01190;535;0.0;97;1:1-535:1-535. 2;At1g01190;535;At1g01280;510;2e-56;29;1:69-499:28-474. 3;At1g01190;535;At1g11600;510;1e-45;27;1:59-531:21-509. Each line represents each query-subject similarity relation. And all the info is separated by ";", which are, in order, similarity id, query id, query length, subject id, subject length, BLAST E-value, percent identity, HSP info (each HSP is in the format of HSP_id:query_start-query_end:subject_start-subject_end. different HSP info are seperated by "." ) IMPORTANT: 1. Similarity ID represents BPO file line id, so it should start from 1 for the first line, and be consecutive for the whole file. 2. BPO file is a parsing result from BLAST, so for each query gene id, its hits can't be scattered in the file, but should be listed in ajacent lines. 3. For BLAST m8 format (i.e. $BLAST_FORMAT="compact"), sequence length information is not stored. So when running OrthoMCL in mode 3, the corresponding columns (i.e. query length, and subject length) will be 0. Please do not use Percent match cutoff in this case. FORMAT of user taxa list file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This file is simple: you can use similar file like GG file, or use a file with each taxa id listed in different lines. (Each taxa id should be defined in GG file) I attached sample GG and BPO files in directory "sample_data". Please report any bug to us. Thanks a lot. **********************BLAST result truncation problem*********************** I got several emails reporting "division by zero" error during BLAST parsing process. If you happen to meet this as well, that means some of your BLAST result is truncated, which causes some hits are not listed in the alignment part. In this case, set $blast_flag{'hsp'}=0 in orthomclight.pl lines 48-60, which will not result in HSP parsing. Besides, you can not use any percent identity cutoff or percent coverage cutoff, since your data lack such information. **************************************************************************** Feng Chen fengchen@sas.upenn.edu Roos Lab Depart of Biology University of Pennsylvania
About
OrthoMCLight is an edited version of OrthoMCL v1.4 that requires less memory and has a faster runtime. It is a genome-scale algorithm for grouping orthologous protein sequences.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published