GenBank Loader is a Java-based command-line utility that may be used to download the latest GenBank data from the National Center for Biotechnology Information, and to load those publicly available DNA sequences into a local relational database to facilitate advanced analysis.
ORIGINAL CODE HERE, written by Matthew Storer
This is an updated version, edited by Vivek Ramanan in 2021, for the Brown Center of Biomedical Informatics (BCBI). I have added comments with ** before them so they hopefully make running the process easier.
Building the GenBank Loader is accomplished by running the following from the project's base directory:
$ mvn package
Two preparatory steps must be taken before GenBank Loader may be used. These are creating the local database into which GenBank data will be ultimately stored, and ensuring your system has sufficient disk space available to store everything.
Before the GenBank Loader can be used, the database into which GenBank DNA sequences will be stored must be created. GenBank Loader is pre-configured to use MySQL as its database back-end. If you don't already have it installed, you should first download and install the MySQL Community Server.
Once MySQL is installed, run the following command from the project's base directory:
$ mysql -uroot < createdb.sql
This will create the genbank database, with appropriate default values to integrate with the GenBank Loader's default settings.
** You should need to use this command because you will be using the BCBI DB pursamydbcit.
The GenBank Loader's intermediate files and target database tables can take hundreds of gigabytes of disk space. Please be sure that you have at least 500 gigabytes of disk space available before starting this process!
To use the GenBank Loader and to see a list of command-line switches, run the program without any arguments:
$ java -jar genbank-loader-1.0.jar
GenBank Loader
--------------
Copyright 2015 The University of Vermont and State Agricultural College. All rights reserved.
usage: genbank-loader [-d <string>] [-h <string>] --load | --prepare [-p <string>] [-u <string>]
-d,--db <string> the database name (default 'genbank')
-h,--host <string> the database host (default 'localhost')
--load only load prepared files into the target database
-p,--pass <string> the database user password (default 'genbank')
--prepare only prepare database files for import
-u,--user <string> the database user name (default 'genbank')
The GenBank Loader has two primary modes of operation: prepare
and load
. The preparation stage involves downloading GenBank files and transforming them such that they may be loaded into the target database. The loading stage involves taking those prepared files and importing them into the target database.
To download and prepare files for import, run the following command:
$ java -Xmx1000m -jar genbank-loader-1.0.jar --prepare
GenBank Loader
--------------
Copyright 2015 The University of Vermont and State Agricultural College. All rights reserved.
INFO Load - process started at Thu Jun 18 10:09:31 EDT 2015
INFO FTPClient - connected to 'ftp.ncbi.nlm.nih.gov'
INFO FTPClient - disconnected from 'ftp.ncbi.nlm.nih.gov'
INFO FTPClient - connected to 'ftp.ncbi.nlm.nih.gov'
INFO FTPClient - disconnected from 'ftp.ncbi.nlm.nih.gov'
INFO AbstractFTPLoader - [2] start
INFO AbstractFTPLoader - [1] start
INFO FTPClient - connected to 'ftp.ncbi.nlm.nih.gov'
INFO FTPClient - connected to 'ftp.ncbi.nlm.nih.gov'
INFO AbstractFTPLoader - [2] (1/3749, 0%) processing 'gbbct1.seq.gz'
INFO AbstractFTPLoader - [1] (2/3749, 0%) processing 'gbbct10.seq.gz'
INFO AbstractFTPLoader - [1] (3/3749, 0%) processing 'gbbct100.seq.gz'
INFO AbstractFTPLoader - [2] (4/3749, 0%) processing 'gbbct101.seq.gz'
...
INFO AbstractFTPLoader - [2] (3748/3749, 99%) processing 'complete.999.genomic.gbff.gz'
INFO AbstractFTPLoader - [1] (3749/3749, 100%) processing 'complete.wgs_mstr.gbff.gz'
INFO AbstractFTPLoader - [1] done.
INFO FTPClient - disconnected from 'ftp.ncbi.nlm.nih.gov'
INFO AbstractFTPLoader - [2] done.
INFO FTPClient - disconnected from 'ftp.ncbi.nlm.nih.gov'
INFO AbstractLoader - processing 3749 files across 2 threads took 13 hours, 24 minutes, 19 seconds
INFO Load - preparing GenBank files finished at Thu Jun 18 23:33:54 EDT 2015 (took 13 hours, 24 minutes, 23 seconds).
Downloaded files will be stored in a temporary folder while they are being processed, and are deleted after processing is completed. Resultant files that will be imported into the local database are stored in the out folder in the current working directory:
$ ls out
annotations.txt authors.txt basic.txt dbxrefs.txt journals.txt keywords.txt
Each of these files is associated with a corresponding table in the target database.
I ran the Prepare and Load steps through batch jobs on Oscar, which allowed them to run in the background and not take up much time. I used a time maximum of 100 hours and then the given command for prepare or load.
Java -jar target/genbank-loader-1.0.jar —prepare
- This file holds information on Maven, MySQL, and the Oracle MySQL connection
- Change version to 8.0.21
- Added the following repository section to the end of the Pom.XML for Oscar specifically (pom_Oscar.xml)
- This holds information on the links to download and access the MySQL database
- private static final String JDBC_DRIVER = "com.mysql.cj.jdbc.Driver"
- Added "?serverTimezone=UTC&allowLoadLocalInfile=true" to the String url JDBC
- This file goes through the individual Genbank records and processes them into files
- Changed populateGInumber to [0] because there is no index 1 anymore
- This change is due to GenBank changing their data organization
With this, hopefully, prepare will work! If it doesn’t, try changing some of these variables, particularly the set of added strings to the String url JDBC in Load.java. Some of these variables are server dependent.
During the preparation stage, thousands of files are downloaded over potentially many hours. It is possible that during this time, the process might be interrupted (perhaps by an OutOfMemoryError
, lost network connection, user error, etc.). Should any such interruption occur, it is important to know that the download and processing of data files may be resumed where it left off with an extremely remote chance of data corruption.
This is because metadata is generated about each successfully processed file (allowing the system to know where to resume operations), and because data generated from each downloaded file is processed independently, the results from which are appended to the master target files for database import only after all individual file processing has completed.
Corruption of the master files can occur only if the GenBank Loader is interrupted during the few milliseconds required to append individual file's data to its respective master file. This is extremely unlikely to occur, as while the processing of an individual file may take many seconds, appending those data to the master file takes just a few milliseconds.
GenBank Loader is a multi-threaded process that can leverage the cores of your CPU to download and prepare GenBank files in parallel, which can dramatically improve performance.
When GenBank Loader executes its prepare
function, it determines how much memory it has been allocated, and creates as many threads as it can safely use without negatively impacting other user and system processes. GenBank Loader will use at least one core, but may use up to (N-2) cores, where N is the total number of cores in your system's CPU.
GenBank Loader's preparation stage requires 400 megabytes of Java heap memory per thread to properly execute, and may otherwise suffer severely degraded performance, or even fail with an OutOfMemoryError
. It is therefore suggested to allocate at least 2.5 times the required memory per thread to the JVM, but more is better and will translate to improved performance, especially on systems with many CPU cores.
To allocate the minimum suggested memory to the GenBank Loader process, use the JVM option -Xmx1000m
as in the example above.
See Oracle's Java SE Documentation for details about -Xmx
and other JVM options.
To load the prepared GenBank files into the local database, execute the following:
$ java -jar genbank-loader-1.0.jar --load
GenBank Loader
--------------
Copyright 2015 The University of Vermont and State Agricultural College. All rights reserved.
INFO Load - process started at Wed Jun 17 16:55:32 EDT 2015
INFO AbstractLoader - populating database from './out' -
INFO AbstractLoader - loading './out/basic.txt' into table 'basic'
INFO DataSource - getConnection : establishing connection to 'genbank'
INFO AbstractLoader - building indexes for table 'basic'
INFO AbstractLoader - loading './out/keywords.txt' into table 'keywords'
INFO AbstractLoader - building indexes for table 'keywords'
INFO AbstractLoader - loading './out/dbxrefs.txt' into table 'dbxrefs'
INFO AbstractLoader - building indexes for table 'dbxrefs'
INFO AbstractLoader - loading './out/journals.txt' into table 'journals'
INFO AbstractLoader - building indexes for table 'journals'
INFO AbstractLoader - loading './out/authors.txt' into table 'authors'
INFO AbstractLoader - building indexes for table 'authors'
INFO AbstractLoader - loading './out/annotations.txt' into table 'annotations'
INFO AbstractLoader - building indexes for table 'annotations'
INFO AbstractLoader - finished populating database. took 12 hours, 6 minutes, 59 seconds
INFO Load - populating database with release '207' finished at Thu Jun 18 05:02:35 EDT 2015 (took 12 hours, 7 minutes, 3 seconds).
Note that as soon as the load
process completes, you may safely delete intermediate database import files in the out folder.
Load required a few more changes for me particularly to get this process working. The main issue is that the current set of code in GenBank loader relies on a set of background scripts by UVM that I was able to get access to in order to change some of them. Link: https://bitbucket.org/UVM-BIRD/ccts-common/src/master/
In particular, the DB folder within src/ is the scripts that I was able to edit to fix the load process. The DB folder can be copied and placed into your local genbank-loader (within the genbank folder), where I edited the following:
- import edu.uvm.ccts.genbank.db.loader.AbstractFTPLoader
- Initialize the following files in loader with “package edu.uvm.ccts.genbank.db.loader;”
- Loader/AbstractFileLoader.java
- Loader/AbstractFTPLoader.java
- Loader/AbstractLoader.java: This code has the MySQL commands that loads the data
- In AbstractLoader.java, update dbLoad (line 94) by removing the alter statements starting with DBUtil.executeUpdate
DBUtil.executeUpdate("set session sql_log_bin = OFF", dataSource);
DBUtil.executeUpdate("load data local infile '" + filename + "' into table " + table, dataSource);
DBUtil.executeUpdate("set session sql_log_bin = ON", dataSource);
- FeatureTableParser.java holds the variable names for each of the files to be processed - if you want to remove one or change a filename, do so here
- Use truncate from MySQL command line to clean out the data tables you are going to update
- However, this requires additional permission for the “drop” command from the DB team
- Once you have gotten permissions, use truncate on the command line and not in the script as I had some issues with that and allows the manual operations used later on to add onto pre existing tables with data
java -jar target/genbank-loader-1.0.jar --load -h pursamydbcit.services.brown.edu -u vramanan -p <password>
*I had to use my password on the command line to get this work with the batch job
Changes to the file loading:
- I was able to load the following tables easily: Basic, keywords, journals and dbxrefs
- Authors and annotations were too large to load so I split them into 5 and 10 files respectively and manually loaded them after that.
- I do not advise trying to mass load them at once. I tried it twice and each time it crashed due to communication failures just a day before they were completed (approximately 11-12 days)
These were the results of the “wc -l” command on the authors and annotations files to find their line count:
- Authors: 3,322,431,281
- Annotations: 6,712,480,648 This is the split command for annotations – again use a batch job rather than command line
Split -a 1 -l 700,000,000 annotations.txt annotations
Apply same logic to authors
Once split, make sure to not truncate authors and annotations as you are adding on each individual file to the previously loaded table
I made a flowchart so I could quickly go through the motions for each manual update:
- Rm
- Mv <nextfile.txt>
- Adds “.txt” extension
- Update FeatureTableParser.java with new name of file
- Remove "truncate" or "delete" from AbstractLoader.java in dbLoad
- Comment out the defining variables of the files you don’t need
- Change the name of the annotations or authors file you are using with, for example “authorsa.txt” (a is first file of the 5, followed by bcde)
- Maven package
- Update the batch file script
- Run script
Use the following command to check the row count of the tables to see the updates as you’re going allowing:
Select table_rows "Rows Count" from information_schema.tables where table_name="annotations" and table_schema="genbank";
Try not to use the “select count(*) from
” because that’s a long linear timeGenBank Loader is Copyright 2015 The University of Vermont and State Agricultural College. All rights reserved.
GenBank Loader is licensed under the terms of the GNU General Public License (GPL) version 3.