MaizeGDB will assign the annotation prefix as well as provide you the genome/species accession ID. Once you have that you can run the script to get your GFF3 as per maizeGDBs requirements.
The script itself does not require any extra packages installed, but since we have various formats of GFF3 files out there, we will need to standardize it before you can run this script - which requires you to install AGAT
. Below is the recommended way to install this:
module load miniconda3
conda create -y -n agat
conda activate agat
conda install -c bioconda agat
Get the script:
git clone git@github.com:HuffordLab/MaizeGDB_gff3_format.git
chmod +x maizegdb_gff3_formatter.py
You can run the script as follows:
maizegdb_gff3_formatter.py <renamed_agat_formatted.gff3> <canonical_transcript_ids.txt>
renamed_agat_formatted.gff3
: The gff3 file sanitized using the agat_sp_manage_IDs.pl
script. Typically, you should request obtain gene id prefix from maizeGDB, and then run this as follows:
agat_sp_manage_IDs.pl --gff input.gff3 --prefix Ab00001aa --tair --output prefinal.gff3
canonical_transcript_ids.txt
: list of transcript ids that are considered as primary transcript. You can run the TRaCE program to determine the canonical transcript and create a list of mRNA ids (one per line). The number should be equal to the gene count in GFF3
Ab00001aa000001_T001
Ab00001aa000002_T002
Ab00001aa000006_T001
Ab00001aa000009_T003
Ab00001aa000011_T001
Ab00001aa000012_T001
Ab00001aa000017_T001
maizegdb_gff3_formatter.py \
renamed_agat_formatted.gff3 \
canonical_transcript_ids.txt > maizeGDB_specifications.gff3