-
Notifications
You must be signed in to change notification settings - Fork 4
panHiTE tutorial
We are excited to announce a significant update to HiTE, which now includes the powerful panHiTE functionality. This release introduces the following key features:
-
Improved LTR Detection: The original
LTR_retriever
has been replaced with our new tool,FiLTR
, offering improved accuracy and completeness in LTR detection. -
Introducing panHiTE: This new workflow is designed for population genomics, facilitating TE annotation and analysis across multiple genomes. panHiTE accepts
RNA-seq
data, enabling the detection of gene expression changes associated with specific TE insertions, making it ideal for exploring individual-specific traits.
In this tutorial, we'll demonstrate how to use panHiTE with a demo data.
(Docker and Singularity versions will be available once panHiTE stabilizes.)
# 1. Install Nextflow
conda create -n nextflow -c conda-forge -c bioconda nextflow==24.10.3
# 2. Download HiTE
git clone https://github.com/CSU-KangHu/HiTE.git
# 3. Grant execution permissions for the required tools.
cd HiTE && python configure.py
# 4. Create the HiTE environment and record the environment path
source ~/.bashrc # or open a new terminal
conda env create --name HiTE -f environment.yml
conda activate HiTE
which python
# For example, the output might be: /home/xxx/miniconda3/envs/HiTE/bin/python.
# Record this path as it will be used later: /home/xxx/miniconda3/envs/HiTE
To install R packages, you may need to configure the CRAN mirror first. For example, to use the default CRAN mirror, add the following configuration to your ~/.Rprofile
file:
options(repos = c(CRAN = "https://cloud.r-project.org/"))
We provide a script to install all necessary R dependencies in one step:
cd HiTE
conda activate HiTE
Rscript RNA_seq/install_R_dependencies.R
If the installation completes successfully, the following R packages will be installed:
- argparser
- tibble
- dplyr
- minpack.lm
- readr
- stringr
- tidyr
- Rsubread
- limma
- edgeR
- ggplot2
If the script fails to install the packages, you can also install them manually. We strongly recommend that you follow the steps below to ensure the required R packages are correctly installed:
R
# R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
# ...
> library(argparser)
> library(tibble)
> library(dplyr)
> library(minpack.lm)
> library(readr)
> library(stringr)
> library(tidyr)
> library(Rsubread)
> library(limma)
> library(edgeR)
> library(ggplot2)
Download the demo data from Zenodo. A complete genome assembly, annotation, and RNA-seq reads data were downloaded from the publication: Kang M, Wu H, Liu H, Liu W, Zhu M, Han Y, Liu W, Chen C, Song Y, Tan L, Yin K. The pan-genome and local adaptation of Arabidopsis thaliana. Nature Communications. 2023 Oct 6;14(1):6259.
-
pan_genomes_dir
(Required)
All genome assemblies should be stored in a single directory, which should be specified as thepan_genomes_dir
parameter. -
genome_list
(Required)
A tab-delimited file with the following columns:- Column 1: Genome assembly file name
- Column 2: Gene annotation file name (optional)
- Columns 3 & 4: Paths to RNA-seq data (optional, single-end data in column 3, paired-end data in columns 3 & 4)
2.1. A Complete Example (genome_list):
# genome_name gene_annotation_name is_paired (1:True/0:False) Illumina RNA-seq reads (tab-delimited) 44.ket_10.fa 44.ket_10.gff 1 CRR624282_Chr1_f1.fq.gz CRR624282_Chr1_r2.fq.gz 02.tibet.fa 02.tibet.gff 1 CRR624279_Chr1_f1.fq.gz CRR624279_Chr1_r2.fq.gz 25.per_1.fa 25.per_1.gff 0 SRR748686_Chr1.fq.gz SRR748686_Chr1.copy.fq.gz
2.2 Example Without Gene Expression Analysis (genome_list_no_RNA):
# genome_name gene_annotation_name 44.ket_10.fa 44.ket_10.gff 02.tibet.fa 02.tibet.gff 25.per_1.fa 25.per_1.gff
2.3 Example for panTE Detection and Annotation Only (genome_list_no_RNA_no_gene):
# genome_name 44.ket_10.fa 02.tibet.fa 25.per_1.fa
-
out_dir
(Required): Specify the output directory path. -
gene_dir
(Optional, required for analyses in example 2.1 and 2.2): Place all gene annotation files in a single directory and set this directory as thegene_dir
parameter.
Important:- Please ensure that the chromosome names in the genome assembly are consistent with those in the genome annotation GFF file to avoid errors. For example, using
CM072657.1
in the genome assembly while usingChr1
in the annotation file is not allowed. It is recommended to use a unified naming convention, such asChr1
, across both files. - Ensure that the gene_id in multiple gene annotation files has a consistent name, with the last element separated by an
underscore
.
For example, in the file44.ket_10.gff
, a gene_id might beket_10_AT1G01010
, and in the file 02.tibet.gff, the gene_id should betibet_AT1G01010
.
- Please ensure that the chromosome names in the genome assembly are consistent with those in the genome annotation GFF file to avoid errors. For example, using
-
RNA_dir
(Optional, required for analysis in example 2.1): Set this as the parent directory for RNA-seq reads. The paths listed in columns 3 & 4 of thegenome_list
should be accessible via this directory.
To run panHiTE from start to end, use the following command:
# 1. activate nextflow
conda activate nextflow
# 2. Write the execution script
# Make sure to replace the `xxx` placeholders with your actual absolute paths.
source_dir=xxx
pan_genomes_dir=xxx
gene_dir=xxx
RNA_dir=xxx
genome_list=xxx
out_dir=xxx
conda_name=/home/xxx/miniconda3/envs/HiTE # You need to replace the previously recorded HiTE conda environment path here
cd $source_dir && /usr/bin/time -v nextflow run panHiTE.nf \
-profile conda --conda_name ${conda_name} \
--pan_genomes_dir ${pan_genomes_dir} \
--genome_list ${genome_list} \
--genes_dir ${gene_dir} \
--RNA_dir ${RNA_dir} \
--out_dir ${out_dir} \
--threads ${threads}
# Example script:
#source_dir=/home/xxx/HiTE
#pan_genomes_dir=/home/xxx/ath_pan_genome/pan_genome/ath/ath_genome_and_annotation/genomes
#gene_dir=/home/xxx/ath_pan_genome/pan_genome/ath/ath_genome_and_annotation/gff_files
#RNA_dir=/home/xxx/ath_pan_genome/pan_genome/ath/ath_genome_and_annotation/RNA_seq_files
#genome_list=/home/xxx/ath_pan_genome/pan_genome/ath/ath_genome_and_annotation/genome_list
#out_dir=/home/xxx/ath_pan_genome/pan_genome/ath/panHiTE_output
#cd $source_dir && /usr/bin/time -v nextflow run panHiTE.nf \
# -profile conda --conda_name /home/xxx/miniconda3/envs/HiTE \
# --pan_genomes_dir ${pan_genomes_dir} \
# --genome_list ${genome_list} \
# --genes_dir ${gene_dir} \
# --RNA_dir ${RNA_dir} \
# --out_dir ${out_dir} \
# --threads 40 \
# --miu 7e-9
If you only require the panTE library and TE annotation for each genome without performing differential gene detection caused by TEs, you do not need to specify the --RNA_dir
parameter. Additionally, the --genome_list
input file only needs to include two columns: genome_name
and gene_annotation_name
. Refer to the example file demo/genome_list_no_RNA
.
source_dir=xxx
pan_genomes_dir=xxx
gene_dir=xxx
genome_list=xxx
out_dir=xxx
conda_name=/home/xxx/miniconda3/envs/HiTE # Replace this with the path to your HiTE conda environment
cd $source_dir && /usr/bin/time -v nextflow run panHiTE.nf \
-profile conda --conda_name ${conda_name} \
--pan_genomes_dir ${pan_genomes_dir} \
--genome_list ${genome_list} \
--genes_dir ${gene_dir} \
--out_dir ${out_dir} \
--threads ${threads}
Run HiTE for each genome and annotate each genome using the panTE library. These two steps can be parallelized on the HPC platform to effectively reduce runtime. We tested this on an HPC platform managed by Slurm, and the key step is to provide the correct HPC configuration.
- Modify the
HiTE/nextflow.config
file
// HPC with singularity-loading, adapted from nanome
hpc { // general HPC configuration
params {
// HPC Slurm default parameters
qos = 'resq'
partition = 'resQ'
queue = partition
processors = 40
memory = '100.GB'
time = '50.h'
...
account = 'xxx'
queueSize = 5 // max number of job submit
...
}
}
# Please consult your HPC platform to modify the above parameters
# If running a large and complex genome, you may want to set the `time` parameter a bit higher; otherwise, it may terminate unexpectedly due to exceeding the runtime limit.
# It is recommended not to set the `queueSize` too high, as too much parallelism could lead to high disk I/O, potentially causing an unexpected termination.
- Run panHiTE using conda and HPC
# Add the hpc option in the -profile section
source_dir=xxx
pan_genomes_dir=xxx
gene_dir=xxx
RNA_dir=xxx
genome_list=xxx
out_dir=xxx
conda_name=/home/xxx/miniconda3/envs/HiTE # Here, replace with the previously recorded HiTE conda environment name
cd $source_dir && /usr/bin/time -v nextflow run panHiTE.nf \
-profile conda,hpc --conda_name ${conda_name} \
--pan_genomes_dir ${pan_genomes_dir} \
--genome_list ${genome_list} \
--genes_dir ${gene_dir} \
--RNA_dir ${RNA_dir} \
--out_dir ${out_dir} \
--threads ${threads} \
--skip_analyze 0
Please verify that the ${out_dir}/run_hite_single/${genome}
directory contains all the necessary output files to ensure that HiTE has successfully run for each genome.
The complete output should include the following files:
- Files (2), (3), (4), (9), (10), and (11) are results from the LTR module.
- File (8) is from the TIR module.
- File (1) is the output of the Helitron module.
- Files (5) and (6) are results from the non-LTR module.
- File (7) contains the merged results of all TEs.
The expected structure of the output directory is as follows:
.
├── confident_helitron.fa (1)
├── confident_ltr_cut.fa (2)
├── confident_ltr.internal.fa (3)
├── confident_ltr.terminal.fa (4)
├── confident_non_ltr.fa (5)
├── confident_other.fa (6)
├── confident_TE.cons.fa (7)
├── confident_tir.fa (8)
├── intact_LTR.fa (9)
├── intact_LTR.fa.classified (10)
└── intact_LTR.list (11)
If any file has a size of 0, such as confident_tir.fa
, it indicates that HiTE did not detect any TIR elements in the genome. This could be due to one of two reasons:
- The genome genuinely lacks TIR elements.
- The TIR detection step in HiTE did not run correctly.
To determine the cause, you can check if other similar genomes contain TIR elements. If you're unable to confirm, the safest approach is to delete the output directory for the affected genome and rerun the pipeline to ensure proper execution.
The panHiTE pipeline consists of 9 processes:
preprocess_genomes
run_hite_single
merge_terminal_te
pan_remove_redundancy
annotate_genomes
summarize_tes
pan_gene_te_relation
pan_generate_bam_for_RNA_seq
pan_detect_de_genes
To ensure smooth execution, the pipeline stores the output files of all processes. If the program is interrupted for any reason, simply rerun the pipeline. Nextflow will automatically detect these intermediate files and skip the completed processes, making it particularly efficient for large-scale pangenome analyses.
If you need to rerun a specific process and its downstream processes, you can delete the corresponding process_name
directory within the output directory specified by the --out_dir
parameter.
Type nextflow run panHiTE.nf --help
for help.
panHiTE - Nextflow PIPELINE (v0.0.1)
=================================
Usage:
The typical command is as follows:
nextflow run panHiTE.nf --pan_genomes_dir xxx --genome_list xxx --genes_dir xxx --RNA_dir xxx --out_dir xxx --threads 40 --skip_analyze 0 --miu 7e-9
Mandatory arguments:
--pan_genomes_dir A directory containing the pan-genomes
--genome_list A text file with genome and gene names. Each line represents a pair of genome and gene names, separated by a tab ( ). The genome name is mandatory, while the gene name is optional. If a gene name is provided, the genes_dir parameter must also be specified.
--out_dir Output directory
General options:
--softcore_threshold occurrence of core_TE = num_of_genomes, softcore_threshold * num_of_genomes <= softcore_TE < num_of_genomes, 2 <= dispensable_TE < softcore_threshold * num_of_genomes, private_TE = 1. default = [ 0.8 ]
--genes_dir A directory containing the gene annotation files, gff format.
--RNA_dir A directory containing the RNA-seq files.
--te_type Retrieve specific type of TE output [ltr|tir|helitron|non-ltr|all]. default = [ all ]
--threads Input thread num. default = [ 10 ]
--skip_analyze Whether to skip analyze, only generate panTE library. default = [ 0 ]
--miu The neutral mutation rate (per bp per ya). default = [ 1.3e-8 ]
-
DE_genes_from_TEs.tsv
This file contains TEs inserted upstream, downstream, or inside genes, which significantly alter gene expression levels across populations.
"Gene_name" "NoInsertion (mean_TPM)" "Upstream (mean_TPM)" "Downstream (mean_TPM)" "Inside (mean_TPM)" "Upstream_vs_NoInsertion_logFoldChange" "Inside_vs_NoInsertion_logFoldChange" "Downstream_vs_NoInsertion_logFoldChange"
"AT4G34410" 562.145 NA 1.14 NA NA NA -8.03975183205617
"AT3G48360" 150.99 0.16 16.46 NA -7.033707790821 NA -3.12185094231913
"AT4G01950" 389.61 2.33 17.3366666666667 NA -6.87406289658701 NA -4.4129255767341
"AT5G45890" 137.77 0.21 NA NA -6.84154485479315 NA NA
"AT4G22880" 68.44 NA 0.07 NA NA NA -6.02008424604333
...
Column Descriptions:
-
Gene_name
: Gene name. -
NoInsertion (mean_TPM)
: Mean expression (TPM) of the gene with no TE insertion. -
Upstream (mean_TPM)
: Mean expression (TPM) of the gene with a TE inserted upstream. -
Downstream (mean_TPM)
: Mean expression (TPM) of the gene with a TE inserted downstream. -
Inside (mean_TPM)
: Mean expression (TPM) of the gene with a TE inserted inside. -
Upstream_vs_NoInsertion_logFoldChange
: Log2 fold change of upstream insertion compared to no insertion. -
Inside_vs_NoInsertion_logFoldChange
: Log2 fold change of inside insertion compared to no insertion. -
Downstream_vs_NoInsertion_logFoldChange
: Log2 fold change of downstream insertion compared to no insertion.
Positive fold changes indicate upregulation, while negative values indicate downregulation. The formula for fold change is:
Upstream_vs_NoInsertion_logFoldChange = log2(Upstream + 1) - log2(NoInsertion + 1)
.
Result Interpretation:
"AT4G34410" 562.145 NA 1.14 NA NA NA -8.03975183205617
The average expression level (TPM) of gene AT4G34410 without TE insertion is 562.145. In specific samples (one or more), a TE insertion is present downstream of this gene, which results in an average expression level of 1.14 after the insertion. The log fold change is -8.03, indicating that the gene is downregulated due to the TE insertion.
You can further examine the all_gene_TEs_details.tsv
file to identify which samples contain the TE insertion for this gene, the distance between the TE insertion and the gene, expression levels in specific samples, and other related information.
-
all_gene_TEs_details.tsv
This file details the positional relationships between TEs and genes across samples (e.g., upstream, downstream, inside, and distance), along with gene expression data.
"Gene_name" "Genome_name" "TE_name" "Chromosome" "TE_start" "TE_end" "Gene_start" "Gene_end" "Position" "Species" "Distance" "expression"
"AT1G01020" "44.ket_10.fa" "44-Helitron_44" "Chr1" 19241 20919 8958 11367 "Upstream" "44.ket_10" 7874 12.65
"AT1G01030" "44.ket_10.fa" "44-Helitron_44" "Chr1" 19241 20919 13881 15947 "Upstream" "44.ket_10" 3294 0.98
"AT1G01040" "44.ket_10.fa" "44-Helitron_44" "Chr1" 19241 20919 25358 33453 "Upstream" "44.ket_10" 4439 31.61
"AT1G01050" "44.ket_10.fa" "25-TIR_69" "Chr1" 44534 44768 33396 35397 "Upstream" "44.ket_10" 9137 91.79
"AT1G01050" "02.tibet.fa" "25-TIR_69" "Chr1" 47261 47494 36147 38160 "Upstream" "02.tibet" 9101 102.58
"AT1G01050" "25.per_1.fa" "25-TIR_69" "Chr1" 44736 44971 33603 35605 "Upstream" "25.per_1" 9131 88.57
...
Column Descriptions:
-
Gene_name
: Gene name. -
Genome_name
: Genome name. -
TE_name
: TE name. -
Chromosome
: Chromosome name. -
TE_start
/TE_end
: Start and end positions of the TE. -
Gene_start
/Gene_end
: Start and end positions of the gene. -
Position
: TE position relative to the gene (e.g., upstream, downstream, inside). -
Species
: Species name. -
Distance
: Distance between the TE and the gene. -
expression
: Gene expression level.
-
panHiTE.CorePan_fitmodel.pdf and panHiTE.CorePan_fitsmooth.pdf
Saturation curves of Pan and Core TE family counts based on a fitted model and LOESS smoothing.
-
panTE.fa
A pan-TE library generated by panHiTE. -
TE_summary.pdf
Statistical summaries of core, softcore, dispensable, and private TEs in the pan-genome, including:- (a) Proportions of TE families based on full-length TE annotations:
- (b) Proportions of TE families based on all TE annotations:
- (c) Genome coverage statistics based on full-length TE annotations:
- (d) Genome coverage statistics based on all TE annotations:
- (e) Counts of different TE superfamilies based on full-length TE annotations:
- (f) Counts of different TE superfamilies based on all TE annotations:
- (g) Genome coverage by TE superfamilies based on full-length TE annotations:
- (h) Genome coverage by TE superfamilies based on all TE annotations:
- (i) Insertion time distributions of Copia and Gypsy TEs across samples (default mutation rate: 1.3e-8):
...
Hunan Provincial Key Lab on Bioinformatics, Central South University.