-
Notifications
You must be signed in to change notification settings - Fork 0
How to use
In this section, we will look at how to use each of CATE's functions.
By reading this section the end user will know the following:
- How to execute each function of CATE.
- How to prepare the prerequisites' that a specific function might need.
- What are CATE's CLI prompts and what do they mean?
- Understanding errors and how to solve them.
Before we dive into the functions we will start by making ourselves familiar with CATE's prompts and other end-user information.
CATE is designed to be executed through the command line and will run without any further input from the user. However, CATE will continue to provide the user with a series of messages updating the user on its progress and execution.
First, we will look at the simplest form of such a message, through the image below.
As shown CATE's typical output can be divided into 5 regions. Let us look at regions 2, 3, 4, and 5 in a bit more detail to get a better understanding of what these messages mean.
Region 2
Region 2 describes the location of the output and intermediate folders. These are specified via the parameters file which we will look at later in the parameter file sections.
In the given example these folders already exist, if they do not CATE will automatically create them in the user-specified location. This will then be indicated by a folder-created message.
In certain instances, CATE does not require an intermediate file. In such instances, its information will not be displayed.
Region 3
Following region 2, Region 3 will indicate the details of the selected function. All function's using a CUDA GPU will have their names preceded by the statement CUDA powered.
Region 4
Region 4 states the calculation type. It can take two main forms:
- FILE mode
- WINDOW mode
For all tests FILE mode is available and, it will require a gene file. The intricacies of this file are discussed later on in this section's Gene file and Calculation mode. For the neutrality test functions of Tajima’s D, Fu and Li's, Fay and Wu’s test statistics, and Fixation Index WINDOW mode is available.
In the example above FILE, mode was used, therefore, CATE shows the location of the gene file being read. If it was window mode it will state that the calculation mode is WINDOW and the window and step size.
Region 5
Before the execution of the selected function's statistics, CATE will display the details of the user-specified GPU device. This serves two purposes. One, the user will know the details of the selected device enabling them to confirm that the correct device was selected. Second, it will help the user understand if CATE is communicating properly with the GPU.
The instance above is just an example of a typical CATE run cycle. These outputs will change and you will get error messages depending on your job parameters and the query being processed. In the following sub-sections of this "How to use" section, we will look at the execution of the different functions on CATE using various configurations and the meanings of the different outputs and error messages you might encounter.
As discussed in the Software Overview, CATE has 17 functions. Here will discuss how to execute all of them one function at a time. We will start with the simplest of functions, the helper functions followed by the complimentary tools before moving on to the most important of implementations, the evolutionary test functions.
Before starting off looking into the functions we have one stop to make. This is looking into CATE's parameter file and gene file and how to make them for yourself.
CATE has a proprietary parameters file. It is written in JSON format. The passing of parameters for specific functions into CATE will be done using this parameters file. A complete example of the parameters.json file can be found in CATE's GitHub repository. CATE also has a built-in function to automatically create a standard version of this file called Print sample parameter file. We will discuss the required parameters that have to be added to the parameter file based on each function when we get to those functions. For now, let's keep in mind that this file exists.
The gene file is a tab-delimited text file. In its most basic form, it will consist of two columns. The gene file is used to tell CATE what regions of the genome to analyze.
The information retained in the two columns is as follows:
- Column 1: The gene/query region's names
- Column 2: The coordinates/positions of the query regions in genomic space.
Column 1 is simple enough, it can be something as simple as "gene_1".
Column 2 is slightly more tricky. It will contain information regarding the query region's chromosome and start and stop positions based on base pairs. This information is separated using colons (:).
So if we want to analyze the region of the Cystatin C (CST3) protein situated in chromosome 20 between the positions of 23,626,706 and 23,638,473 base pairs it would be written in our gene file as:
- CST3 20:23626706:23638473
Note it is advised to not use any form of spaces in the gene file other than the tab delimitations to separate the column.
The next query region's information will occupy the next line. In this manner, you can specify an unlimited list of regions to be analyzed by CATE. A sample gene list file used to analyze all known regions of Chromosome 1 in the Human Genome built according to Genome Reference Consortium Human Build 37 (GRCh37) is provided in CATE's GitHub repository.
If you have already looked at one of the parameters files in our repository, you will have noticed that some parameters will be common to pretty much all the functions, maybe except the helper functions.
We'll look at these parameters now:
"CUDA Device ID"
This parameter specifies the CUDA device to be used. At present, CATE can only use one CUDA device at a time. If a system has more than one CUDA device the user can get the ID of the CUDA device they wish to be used by CATE and assign its ID via this parameter. If you do not know the ID of your CUDA device you can simply use CATE's, CUDA device list function. By default, the CUDA device ID is zero (0).
"Input path"
This parameter specifies the location of CATE's proprietary file structure. CATE's file structure will be created using its VCF splitter function.
"Output path"
This parameter specifies the location of the directory that specifies where the results of CATE will be written to.
"Intermediate path"
This parameter specifies the directory to which CATE's intermediatory data will be stored in. It is responsible for data that might be used during resuming of CATE after an unscheduled termination or to store data that might be used later on during the execution of a function.
"Ploidy"
This parameter specifies the number of sets of chromosomes present in an organism in the sample. For example, diploid organisms such as humans will have a ploidy of 2, and viruses that are haploid will have a ploidy of 1.
And finally, we are ready to dive into the execution of the functions.
In this section, we will first look at what each function does and how each function is executed in the command line. This is usually the easiest to understand because it is usually just the function's argument followed by the location of our parameters.json file's location. Where needed we will provide snippets of outputs that we feel are important.
For all of these, we assume that our CATE executable is named simply "CATE".
CATE's functions' arguments are NOT caSe-sEnsitive.
But as we have been told this alone is not enough. We will then, where needed look at what prerequisites are needed and how they can be prepared before function execution can be done.
Let us start with the simplest of functions that require only the function's argument to run. This is a very good place to start.
The "Help menu" function prints CATE's current help menu. It will contain a detailed list of all of CATE's functions that are currently available including their arguments for execution and the literature from which they have been adopted. It will also state the various features available in CATE as well as how to use each function and activate features.
Function execution:
./CATE -h
or ./CATE --help
The "CUDA device list" function lists all available CUDA devices that are present in the current system. From this list, the user can get the "CUDA Device ID" to configure the parameter file in CATE. This is listed as GPU number as shown in the sample output below.
Function execution:
./CATE -c
or ./CATE --cuda
Sample output:
If there is more than one CUDA device they will be listed sequentially.
This is also a good time to show that all successful program executions are indicated by the statement:
Program has completed its run.
CATE's complimentary tools enable it to become a software suite for genomic data processing. These functions provide the user with some genomic data processing capabilities so that they are able to perform certain tasks without switching software.
Let us start with the most important of complimentary tools in CATE.
There is a very good reason as to why the VCF Splitter is CATE's most important function. CATE requires a very specific file structure to function. To create this file structure we use the VCF splitter tool. But first, let us learn a bit about what this file structure is.
CATE functions by splitting a VCF file into segments. This allows us to overcome a major drawback that exists with the processing of large VCF files, sequential data access. Sequential data access is essentially reading a file one line at a time from the beginning till the desired data point is reached. Due to CATE's file structure and algorithms, it allows us to randomly access the file structure and data points on the VCF. A schematic overview of this file structure is shown below.
Now that we have a rough understanding of the importance of the file structure and what it looks like, lets see how we can create it.
The VCF Splitter has two modes and we will look at them one at a time:
- CHR
In the presence of a VCF with data on multiple chromosomes and data files for the samples, we must use the CHR mode before creating the segmented file structure using the CTSPLIT mode below. The CHR mode will split a large VCF containing multiple chromosomes into separate chromosomes and it will also extract only the GT columns information from the samples. This is because CTSPLIT needs only the GT column. So if you have a VCF file with more than one chromosome's data or with data on samples from multiple columns other than GT (check FORMAT column) or both then run the CHR mode first. It can also be used to extract variants by the number of reference or alternate allele information present.
Knowing what the CHR mode does let's see how we can set it up. We start with the configuration of the parameters.json file. Here we fill out the following parameters (you can always refer to the sample parameter file):
"CUDA Device ID"
"Intermediate path"
"Output path"
"Ploidy"
Since we have looked at the above parameters before, we won't go through them again. In this instance, the input path can contain multiple VCFs. Now let's look at the new parameters below.
"Input path"
Folder with the location of the VCF files to be split.
"Split mode: CHR"
We will set the Split mode to CHR. This will tell CATE to execute this mode of the VCF Splitter function.
"CHR individual summary"
CHR individual summary can be either Yes or No. CATE is able to provide a ratio of coverage for each sample per chromosome. This will tell you the extent of allelic information present in each sample for a chromosome. This summary file will be created at the end of the program execution and will have a *.summary extension.
An example of the file is present in CATE's GitHub repository. It shows the coverage by each individual with its ID for each chromosome that was present in the full VCF. There was information on 2 chromosomes in the original VCF.
"Split cores"
This will enable CATE to know how many CPU cores it will have access to when executing the VCF Splitter function.
"Split SNPs per_time_CPU"
This will enable CATE to know how many variants from the VCF file should be loaded into RAM at once.
A good rule of thumb would be if you have a slower hard disk (for instance not a SSD) and a large RAM then set this to a high value for instance "500000", so that less time is spent between the program queueing the hard disk for access.
"Split SNPs per_time_GPU"
This will enable CATE to know how many variants from the RAM will be processed by the GPU at a time. Helps to prevent overloading of the GPU.
We have tested this at about 100000
"Reference allele count"
This will enable CATE to know if variants should be filtered by the number of reference alleles.
"Alternate allele count"
This will enable CATE to know if variants should be filtered by the number of alternate alleles.
If the "Reference allele count"
and the "Alternate allele count"
are both set to 1 you can directly extract the SNPs for each chromosome.
CATE will create a folder in the output directory by the VCF file name and the VCFs split by chromosome will be written into the folder. These VCFs names will be the chromosome. Finally, if the individual summary was selected then CATE will print the summary data into a *.summary file whose name will be the name of the input VCF.
- CTSPLIT
The CTSPLIT mode will create CATE's unique file hierarchy. For this, the VCF file being processed must only have a single chromosome's data and only information on the GT column. This can be achieved using the CHR mode explained above.
Additionally, CTSPLIT requires a population file, which specifies the population of each sample in the VCF file. This is a tab-delimited file, which has to have a minimum of two columns and it is assumed that the columns have headings.
The information retained in the two columns is as follows:
Column 1: Sample ID, this column should contain all the IDs of the samples present in the VCF.
Column 2: Population ID, this column must have the name of the population the sample belongs to.
The purpose of the population file is to specify which samples belong to which population. Then CATE can carry out evolutionary tests by population. Every sample in the VCF file must have a population attached to it. Other than splitting the sample data by population this can also serve as an opportunity to even discard unwanted sample's data. Even so, the sample should be given a population ID and you can discard that folder later. If you want all the samples in a VCF to be in the same population simply assign them all the same ID in the population file.
Knowing what the CTSPLIT mode does let's see how we can set it up. We start with the configuration of the parameters.json file. Here we fill out the following parameters (you can always refer to the sample parameter file):
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"Split cores"
"Split SNPs per_time_CPU"
"Split SNPs per_time_GPU"
"Reference allele count"
"Alternate allele count"
Since we have looked at the above parameters before, we won't go through them again. In this instance, the input path can contain multiple VCFs. Now let's look at the new parameters below.
"Population file path"
"Sample_ID Column number"
"Population_ID Column number"
Let's look at these three parameters together as they all relate to the same thing, the population file.
We use "Population file path"
to specify the location of the aforementioned population file. A sample of such a population file from the 1000 Genome Project is provided in CATE's repository.
The "Sample_ID Column number"
will refer to the column containing the IDs of the samples in the VCF and by the given example this will be column 1. So the value will be 1.
The "Population_ID Column number"
will refer to the column containing the IDs of the populations that each sample belongs to and by the given example this will be column 6. So the value will be 6.
"SNP count per file"
This value will tell CATE the maximum number of filtered SNPs that will be present in a file segment.
"MAF frequency"
This value will tell CATE the MAF (Minor Allele Frequency) value with which to filter SNPs. If populations are present then each population's MAF will be determined separately.
"Frequency logic"
This can be either one of 5 comparison operators. They are as follows:
-
"="
equal to the set MAF. -
">"
greater than set MAF. -
"<"
less than set MAF. -
">="
greater than or equal to the set MAF. -
"<="
less than or equal to the set MAF.
Function execution:
To execute the VCF Splitter we pass the argument below followed by the location of the parameters.json file.
./CATE -svcf parameters.json
or ./CATE --splitvcf parameters.json
After you have made your segmented file structure you can directly move into the evolutionary analyses. But in this section let's finish looking at the rest of the complimentary tools.
This function allows you to split a single FASTA file containing data on multiple sequences. It is NOT a GPU-driven function. Therefore does not need an NVIDIA CUDA-capable GPU or the configuration of the "CUDA Device ID"
parameter.
FASTA Split requires the configuration of the following parameters:
"Output path"
The output path is to direct CATE to the output folder where all the split FASTA files will be written to. Each FASTA file's name will be the name of the sequence.
"Sequence"
You can extract a singular sequence by specifying the name of the sequence here. Or, to extract all the sequences present in the FASTA file simply configure it as "All"
. CATE will then extract all the sequences present in the FASTA file.
"Raw FASTA file"
This parameter configures the location of the FASTA file to be split. Since this is a singular input it cannot be configured by the "Input path"
parameter.
Function execution:
To execute FASTA Split we pass the argument below followed by the location of the parameters.json file.
./CATE -sfasta parameters.json
or ./CATE --splitfasta parameters.json
This function performs the opposite of the above-mentioned FASTA Split function. It will combine all the FASTA files present in the "Input path"
folder into a singular FASTA file. Its configurable parameters are as follows:
"FASTA files folder"
Specifies the location of the folder containing the FASTA files (FASTA files can have the following extensions: .fasta, .fna, .ffn, .faa, .frn, .fa).
"Merge FASTA path"
Name and location of the FASTA file to which all the sequence data will be written.
Function execution:
To execute FASTA Merge we pass the argument below followed by the location of the parameters.json file.
./CATE -mfasta parameters.json
or ./CATE --mergefasta parameters.json
The extract genes function is used to extract FASTA sequences of genes using a reference genome in FASTA format and CATE's gene file coordinates. It requires the configuration of the following parameters:
"Output path"
"Intermediate path"
Since we know what the above parameters do already let's look at the ones unique to this function.
"Reference genome ex"
Specifies the location of the reference genome in FASTA format.
"Extract gene list"
Points CATE to the location of the gene list file containing the gene list coordinates, it uses a standard gene list file. Here you can also set the value to "universal"
. This will point CATE to the "Universal gene list"
parameter. This "Universal gene list"
parameter can be used to point to a gene list file where multiple functions refer to the same gene list.
Function execution:
To execute Extract Genes we pass the argument below followed by the location of the parameters.json file.
./CATE -egenes parameters.json
or ./CATE --extractgenes parameters.json
Creates CATE's gene list file from a GFF3 file. Note that only regions annotated as genes will be extracted. It requires the configuration of the following parameters:
"Output path"
"GFF file"
The location of the GFF3 file that needs to be converted.
Function execution:
To execute Convert a GFF to Gene File we pass the argument below followed by the location of the parameters.json file.
./CATE -g2g parameters.json
or ./CATE --gff2gene parameters.json
This function is able to identify haplotypes present in a gene region provided the VCF file and the original reference genome the VCF file was created with. The function will output a summary file with a .hsum extension complete with the haplotypes, the places in the sequence where mutations had occurred in relation to the reference genome, and the number of samples with the respective haplotype. It is also capable of reconstructing the FASTA sequences of the haplotypes and when needed, perform a complete population-wide reconstruction of the samples' FASTA sequences.
A sample of the output summary file can be found in CATE's Github repo. This is a tab-delimited file. If you look at the sample file you will notice in the "mutated_positions" column that some fields contain the value "NA". This means that these haplotypes contain the original reference genome sequence.
To execute this function requires the configuration of the following parameters:
"CUDA Device ID"
"Intermediate path"
"Output path"
"Ploidy"
Since we have looked at the above parameters before, we won't go through them again. But let us look at the parameters unique to this function.
"Input path"
Here, the input path points to CATE's segmented file structure which would have to be created beforehand using CATE's VCF Split funtion.
"Reference genome hap"
Tells CATE of the location of the reference genome used to create the original VCF file. It should be in a FASTA format.
"Population out"
This parameter can either be "YES"
or "NO"
. If it is set to "YES"
then CATE will reconstruct the entire population's FASTA sequences else it will only create the FASTA sequence of each unique haplotype once.
The FASTA sequences will be written to the folder named as follows: PopulationID_genefilename.
Each FASTA file will be named after the gene name in Column 1 of the gene file. It will contain all the unique haplotypes that were found in that region. If "Population out"
was set to "YES"
then an additional FASTA file will be created for each gene with the suffix population.
"Hap extract gene list"
Points CATE to the location of the gene list file containing the gene list coordinates, it uses a standard gene list file. Here you can also set the value to "universal"
. This will point CATE to the "Universal gene list"
parameter. This "Universal gene list"
parameter can be used to point to a gene list file where multiple functions refer to the same gene list.
Function execution:
To execute Extract haplotypes from VCFs we pass the argument below followed by the location of the parameters.json file.
./CATE -hapext parameters.json
or ./CATE --hapfromvcf parameters.json
This function will convert a MAP file to a SNP-based gene file (The gene file contains coordinates for single SNPs). It will also separate the data by the chromosome the SNPs belong to. Each chromosome's SNPs will be written to a separate gene file. This is typically used by CATE's EHH function. To execute this function requires the configuration of the following parameters:
"Output path"
"MAP file"
Points to the location of the MAP file that needs to be split and converted.
"SNP prefix"
This is the set of characters or character that will be added in front of each SNP as its Column 1 identifier in the gene file. It will be incremented by the number of the SNP in the gene file.
Function execution:
To execute Convert MAP File to Gene File we pass the argument below followed by the location of the parameters.json file.
./CATE -m2g parameters.json
or ./CATE --map2gene parameters.json
This function will simply print a standard default parameters.json file complete with all possible parameters.
It requires no special parameters since if you would be using this function, most probably you do not have a parameter file to configure.
Function execution:
To execute Print default Parameter File we pass the argument below followed by the location to which you want your parameters.json file to be created at.
./CATE -pparam parameters.json
or ./CATE --printparam parameters.json
The creation of our framework and the software CATE is centered around these 6 evolutionary tests. There are 7 functions involved in their execution and they are all CUDA powered (which means they require a CUDA-enabled NVIDIA GPU). They are listed below:
- Tajima's D statistic (1989)
- Fu and Li's D, D*, F and F* statistics (1993 & 1995)
- Fay and Wu's normalized H and E statistics (2006)
- Neutrality test full (Runs all three neutrality tests above in unison)
- Fixation Index (Fst) (1965)
- McDonald–Kreitman Neutrality Index (1991)
- Extended Haplotype Homozygosity (EHH) (2002)
Running these functions is simple enough, but they have quite a bit of flexibility when it comes to their parameter configuration.
We will start this section by first focussing on the running of two main calculation modes which involve 5 out of 7 of these functions. Then we will talk about the 4 functions designed around the three neutrality tests of evolution and finish the neutrality tests by discussing the execution of CATE's high-performance mode called Prometheus.
Finally, we will finish off the "How to use" part of CATE by discussing the execution of the remaining functions.
For these functions, the "Input path"
parameter will refer to the location of CATE's segmented file structure created using its VCF Splitter function's CTSPLIT mode.
CATE's "Calculation mode"
parameter in its parameter file is used to configure the following five functions:
- Tajima's D statistic (1989)
- Fu and Li's D, D*, F and F* statistics (1993 & 1995)
- Fay and Wu's normalized H and E statistics (2006)
- Neutrality test full (Runs all three neutrality tests above in unison)
- Fixation Index (Fst) (1965)
"Calculation mode"
can be either one of the two modes:
"WINDOW"
Window mode is a calculation type where the bins (calculation ranges) are defined by a fixed window and step size. There are essentially three configurations that window mode can take as shown below.
These configurations can be accessed using the following two parameters after configuring the "Calculation mode"
as "WINDOW"
. These two parameters are as follows:
"Window size"
As shown in the diagram window size defines the size of the window to be analyzed in base pairs (bp).
"Step size"
The step size is also set in bps. If the "Step size"
is set to the same size as the "Window size"
then there will be a nonoverlapping window-wise calculation. If they are different the calculations may or may not overlap.
If the "Step size"
is set to 0
(zero) this will let CATE know that the user requires a Continuous sliding window calculation. Here the step size will be set to 1 SNP within CATE. Basically, the window size will be calculated for every SNP in the dataset.
NOTE: Do not add bp to the end of the "Window size"
and "Step size"
parameters. Just write the number values.
"FILE"
In File mode, the user specifies CATE with its gene file. Then CATE uses the coordinates present in the gene file to calculate the test statistics for each region. File mode is activated by setting "Calculation mode"
as "FILE"
.
For the "FILE"
mode, each of the seven test functions will have a parameter where you will specify the location of the gene list file. They all have the possibility to be parameterized as "universal"
. This will point CATE to the parameter "Universal gene list"
. This will then point to the location of the gene list file. This enables the possibility to configure all or any combination of the functions to a common gene list file location.
The algorithm is designed based on the original paper (1989) by Fumio Tajima.
To execute this function requires the configuration of the following parameters:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"Calculation mode"
If "Calculation mode"
is "FILE"
:
"Tajima gene list"
Specifies the location of the gene list file for the calculation of Tajima's D statistic.
If "Calculation mode"
is "WINDOW"
:
Window size
Step size
"Prometheus activate"
This parameter activates CATE's high-performance mode and it can be either "YES"
or "NO"
. The parameters of Prometheus will be discussed in the later sections below.
Function execution:
To execute Tajima's D statistic function we pass the argument below followed by the location of the parameters.json file.
./CATE -t parameters.json
or ./CATE --tajima parameters.json
Results of Tajima's D statistic function are written to a tab-delimited *.td file. Example results of this can be found in the CATE's GitHub repository.
The D, D* and F statistics are calculated based on the original paper by Fu et al (1993). However, due to errors in the equations in the original paper the F* statistic's vf* and uf* are calculated based on the corrected equations in the paper by Simonsen et al (1995).
To execute this function requires the configuration of the following parameters:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"Calculation mode"
If "Calculation mode"
is "FILE"
:
"Fu and Li gene list"
Specifies the location of the gene list file for the calculation of Fu and Li statistics.
If "Calculation mode"
is "WINDOW"
:
Window size
Step size
"Prometheus activate"
This parameter activates CATE's high-performance mode and it can be either "YES"
or "NO"
. The parameters of Prometheus will be discussed in the later sections below.
Function execution:
To execute Fu and Li statistics function we pass the argument below followed by the location of the parameters.json file.
./CATE -f parameters.json
or ./CATE --fuli parameters.json
Results of Fu and Li statistics function are written to a tab-delimited *.fl file. Example results of this can be found in the CATE's GitHub repository.
The algorithm is designed based on the original paper of Fay and Wu's normalized H and E statistics (2006).
To execute this function requires the configuration of the following parameters:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"Calculation mode"
If "Calculation mode"
is "FILE"
:
"Fay and Wu gene list"
Specifies the location of the gene list file for the calculation of Fay and Wu statistics.
If "Calculation mode"
is "WINDOW"
:
Window size
Step size
"Prometheus activate"
This parameter activates CATE's high-performance mode and it can be either "YES"
or "NO"
. The parameters of Prometheus will be discussed in the later sections below.
Function execution:
To execute Fay and Wu statistics function we pass the argument below followed by the location of the parameters.json file.
./CATE -w parameters.json
or ./CATE --faywu parameters.json
Results of Fay and Wu statistics function are written to a tab-delimited *.fw file. Example results of this can be found in the CATE's GitHub repository.
Runs all three neutrality tests above in unison.
To execute this function requires the configuration of the following parameters:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"Calculation mode"
If "Calculation mode"
is "FILE"
:
"Neutrality gene list"
Specifies the location of the gene list file for the calculation of Neutrality tests statistics.
If "Calculation mode"
is "WINDOW"
:
Window size
Step size
"Prometheus activate"
This parameter activates CATE's high-performance mode and it can be either "YES"
or "NO"
. The parameters of Prometheus will be discussed in the later sections below.
Function execution:
To execute the Neutrality tests statistics function we pass the argument below followed by the location of the parameters.json file.
./CATE -n parameters.json
or ./CATE --neutrality parameters.json
Results of the Neutrality tests statistics function are written to a tab-delimited *.nt file. Example results of this can be found in the CATE's GitHub repository.
CATE’s high-performance mode is designed to expedite CATE’s GPU-based architecture by processing multiple query regions in unison through CPU multithreading and optionally via SSD-based multi read and write techniques. It has been implemented for the neutrality tests of Tajima’s D, Fu and Li's D, D*, F and F*, Fay and Wu’s H and E as they treat each segregating site independently.
Prometheus is activated by setting the parameter "Prometheus activate"
as "YES"
. Afterward, the user must configure the four additional parameters listed below:
"CPU cores"
Specifies the maximum number of CPU cores that CATE can access.
"SNPs per time"
Specifies the number of SNPs that will be loaded into the GPU at a time.
"Number of genes"
Specifies the number of query regions CATE will process at a time.
"Multi read"
Can be either "YES"
or "NO"
. Should be used if the system is equipped with an SSD drive. Allows CATE to read multiple files at once.
Prometheus terminal outputs are slightly different from that of a normal run. A normal run will provide the user with a real-time output of the results but since Prometheus on CATE conducts batch processing it will not do this and only provide the status of the batch of query regions being processed. A capture of CATE's Prometheus mode is shown below.
Let us take a quick look at the regions in the figure above to get an understanding of the CLI output.
REGION 1
Will provide the user with confirmation that Prometheus has been activated and that the entered parameters have been successfully received by CATE.
REGION 2
Will state the range of query regions being processed.
REGION 3
Since the GPU is limited to processing all collected SNPs at a time by the "SNPs per time"
parameter to prevent GPU overloading CATE will break the SNPs into batches. These batches will be processed by GPU rounds. The round being processed will be stated by CATE on the CLI messages.
Prometheus works on all calculation modes for the four neutrality test functions and the resultant outputs are the same as when Prometheus is not activated. Execution of the neutrality test functions also remains the same.
This function is able to calculate the Fixation Index between multiple populations. CATE's version of the Fst does not have a limitation on the number of populations that can be defined.
To execute this function requires the configuration of the following parameters:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"Population index file path"
The populations that each sample belongs to must be defined by this three-column tab-delimited file. CATE assumes that the columns have titles, therefore column headers must always be present. They are as follows:
- Sample_name
These are the IDs of the samples in the VCF files.
- population_ID
This is the custom populations or sub-populations the samples belong to.
- Super_population
This is the name of the main populations the samples belong to, which was used to define the populations during the VCF Splitter's CTSPLIT execution. These are the populations under which CATE has created the segmented file hierarchy.
A few examples of these files are present in CATE's GitHub repository.
"Population ID"
Provide the combination of populations whose Fixation Index needs to be calculated. Any combination of two or more populations can be configured using the population IDs defined in the file in the "Population index file path"
's second column.
They have to be separated by a "," (comma): "AFR,XKY,ZXY"
"Calculation mode"
If "Calculation mode"
is "FILE"
:
"Fst gene list"
Specifies the location of the gene list file for the calculation of Fixation Index statistics.
If "Calculation mode"
is "WINDOW"
:
Window size
Step size
Function execution:
To execute the Fixation Index statistics function we pass the argument below followed by the location of the parameters.json file.
./CATE -x parameters.json
or ./CATE --fst parameters.json
Results of the Fixation Index statistics function are written to a tab-delimited *.fst file. Example results of this can be found in the CATE's GitHub repository.
There are two main methods of McDonald–Kreitman Neutrality Index calculation in CATE. We will first look at the parameters that are unique to each one and then follow through with the common parameters.
The modes are specified using the "Alignment mode"
parameter.
They are as follows:
-
"CHROM"
mode
Takes the input of an alignment for the entire chromosome with the entire chromosome or gene region of the outgroup sequence.
The preferred software with which chromosome-wide whole genome alignment should be performed is: GSAlign
PLEASE ensure that the REFERENCE sequence is first and the OUTGROUP sequence is second.
Alignment file of the reference genome to the outgroup genome must also be provided in a *.maf format file.
This mode will take in only a single alignment file and its location must be specified using the "Alignment file"
parameter.
-
"GENE"
mode
This is the per-gene alignment method. Here we take the query region and align it to the respective sequence of the outgroup.
Alignments should be performed using NCBI's BLASTN tool. This is available either online or for download via BLAST+.
PLEASE ensure that the REFERENCE gene sequence is the QUERY and OUTGROUP gene sequence is the SUBJECT.
You can use CATE's Extract Genes function, to get the REFERENCE gene sequences.
Alignments must be provided in the blastn *.txt format.
This mode uses a special format of CATE's gene file with a third column. The third column will contain the location of the alignment file for each query region. A sample of such a gene list file is available in CATE's GitHub repository.
Now let's take a look at the common parameters:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
Protein coding parameters
Since McDonald–Kreitman test involves detecting synonymous to non synonymous mutations we have allowed the user to define the codon information using the following three parameters. We have provided some standard examples for each.
"Start codon(s)"
Defines the codons that indicate the start of a coding region.
Example: "ATG"
"Stop codon(s)"
Defines codons that indicate the stop of a coding region.
Example: "TAA,TAG,TGA"
"Genetic code"
Defines the amino acids and the codons they are being coded by. It follows the following format:
AMINO_ACID|CODONS_SEPERATED_BY_COMMAS;
The vertical line (|) signifies the following of the codon information for the amino acid.
The semi-colon (;) signifies the end of that particular amino acid's coding information.
Example:
"A|GCT,GCC,GCA,GCG;R|CGT,CGC,CGA,CGG,AGA,AGG;N|AAT,AAC;D|GAT,GAC;B|AAT,AAC,GAC;C|TGT,TGC;Q|CAA,CAG;E|GAA,GAG;Z|CAA,CAG,GAA;G|GGT,GGC,GGA,GGG;H|CAT,CAC;M|ATG;I|ATT,ATC,ATA;L|CTT,CTC,CTA,CTG,TTA,TTG;K|AAA,AAG;F|TTT,TTC;P|CCT,CCC,CCA,CCG;S|TCT,TCC,TCA,TCG,AGT,AGC;T|ACT,ACC,ACA,ACG;W|TGG;Y|TAT,TAC;V|GTT,GTC,GTA,GTG;X|TAA,TGA,TAG"
"ORF known"
ORF stands for Open Reading Frame and this parameter can be either "YES"
or "NO"
.
If "YES"
it would state that the coordinates of the query region stated in the gene file are the coordinates for the target ORF/ coding region and CATE does not need to find the ORF within the query region.
If "NO"
CATE will begin to look for the ORF within the query region. It will look at individual frames within the query region to find the most viable ORF. The longest coding sequence will be considered the most significant with the highest likelihood that the sequence corresponds to the Open Reading Frame.
"Reference genome mk"
The location of the FASTA sequence that was used for the creation of the VCF files.
"McDonald–Kreitman gene list"
Specifies the location of the gene list file for the calculation of McDonald–Kreitman statistics.
Function execution:
To execute McDonald–Kreitman statistics function we pass the argument below followed by the location of the parameters.json file.
./CATE -m parameters.json
or ./CATE --mk parameters.json
Results of the McDonald–Kreitman statistics function are written to a tab-delimited *.mc file. Example results of this can be found in the CATE's GitHub repository.
As shown below CATE's EHH has four modes divided into two main methods.
The mode can be specified by the parameter: "Range mode"
We will first look at FIXED and FILE modes:
In these modes, the query region defined is the core haplotype region. The extended haplotypes for these modes can be defined in two ways.
- FIXED mode
In FIXED methods it can either be a positive (+) value or a negative (-) value. The core haplotype's start position will be augmented by the user-specified value and this will be extended region to be analyzed.
For this mode the "Range mode"
parameter should be set as "FIXED"
and the "FIXED mode"
parameter must have the augmentation value (Example: "+10000").
- FILE mode
This mode will use a specialized gene file with three columns. An example of such a gene list file is provided in CATE's GitHub repository. For this mode, the "Range mode"
parameter should be set as "FILE"
. The coordinates of the extended region can be provided as a range, or even as positive (+) value or a negative (-) values similar to the FIXED mode.
Now let's move on to the SNP and BP modes:
Most EHH software uses this method for EHH calculations. Here the core haplotypes will always be 0 (Reference allele) and 1 (Alternate allele). Therefore, the query regions consist of a single SNP. An example of the gene list file used by CATE in this instance is also provided in CATE's GitHub repository.
Both modes require CPU-level multi-threading and therefore require the "EHH CPU cores"
parameter to define the number of CPU cores CATE can use.
The outputs will be produced to a folder in the "Output path"
with the name of the gene list file. Each query SNP will have a separate output file created.
- SNP mode
For this mode, the "Range mode"
parameter should be set as "SNP"
and the "SNP default count"
parameter should have the number of SNPs by which CATE will displace on either side of the SNP.
- BP mode
For this mode, the "Range mode"
parameter should be set as "BP"
and the "SNP BP displacement"
parameter should have the number of Base Pairs by which CATE will displace on either side of the SNP.
The common parameters are as follows:
"CUDA Device ID"
"Input path"
"Intermediate path"
"Output path"
"Ploidy"
"EHH FILE path"
Specifies the location of the gene list file for the calculation of Extended Haplotype Homozygosity statistics.
Function execution:
To execute the Extended Haplotype Homozygosity statistics function we pass the argument below followed by the location of the parameters.json file.
./CATE -e parameters.json
or ./CATE --ehh parameters.json
Results of the Extended Haplotype Homozygosity statistics function are written to a tab-delimited *.ehh file. Example results of this can be found in the CATE's GitHub repository.
With that, we conclude CATE's How to use. We hope this has provided you with the required preliminary understanding of the functions.
We wish you the best of luck with your speedy calculations🔥.