-
Notifications
You must be signed in to change notification settings - Fork 98
Storage hadoop in 15 minutes
This tutorial is just for testing purposes. The Hadoop Storage Engine is not production ready.
OpenCGA uses dependencies from Hortonworks HDP-2.5.0 internally. It has not been tested with other flavours of Hadoop.
If you already have a working installation of hadoop, skip this step.
You can use one of the many available hadoop sandboxes provided by Hortonworks or Cloudera or downloading and installing manually the required Hadoop components: Hadoop, Spark, HBase, Phoenix
Download or pull the version you want to try.
git clone https://github.com/opencb/opencga.git
You can build the application from sources executing:
mvn clean install -DskipTests -Dstorage-hadoop
You can customize some configuration parameters adding them to the compilation with -D<param>=<value>
. Some interesting params are:
-
OPENCGA.INSTALLATION.DIR
for changing the installation directory. -
OPENCGA.CELLBASE.HOST
to specify the cellbase installation. -
OPENCGA.CELLBASE.VERSION
to specify the cellbase version. -
OPENCGA.STORAGE.ENGINE.DEFAULT
to specify the default storage engine. By default is "mongodb", so we will need to add--storage-engine hadoop
to each command. Compile with-DOPENCGA.STORAGE.ENGINE.DEFAULT=hadoop
to avoid that.
To see the rest of the configurable parameters, check the default-config profile at the main pom.xml.
For example, to change the installation directory, execute:
mvn clean install -DskipTests -Dstorage-hadoop -DOPENCGA.INSTALLATION.DIR=${HOME}/opt/opencga/
Then copy the application (the content of build folder) into the installation directory, by default and in this tutorial this is /opt/opencga.
mkdir /opt/opencga
cp ./build/* /opt/opencga
- See Download and Installation for more information.
Needless to say, the computer where opencga is installed must have access to the Hadoop cluster.
In order to interact with Hadoop, we need to provide the configuration files. In OpenCGA There are two ways for doing that, depending on the way of accessing to Hadoop.
This configuration is for hadoop client nodes (or local installations, or hadoop nodes) where the commands 'hadoop', 'yarn' and 'hbase' are installed, and the client configuration updated. The script bin/opencga-env.sh
will add the configuration files to the java classpath. Nothing else is needed.
In this scenario, you will be able to execute this commands:
hadoop classpath
hbase classpath
In other case, we need to obtain the configuration files from the cluster hadoop. In this scenario, just copy the configuration files in a folder called etc
in the installation directory. This folder is automatically added to the classpath.
With this configuration, you will only be able to execute queries.
To simplify the installation, we are going to use the embedded server for the REST API.
# Set up opencga
## Install OpenCGA
./opencga-admin.sh catalog install -p <<< $CATALOG_ADMIN_PASSWORD
## Start servers
mkdir p ../logs/
./opencga-admin.sh server rest --start -p <<< $CATALOG_ADMIN_PASSWORD \
2>> ../logs/daemon.err \
>> ../logs/daemon.out &
./opencga-admin.sh catalog daemon --start -p <<< $CATALOG_ADMIN_PASSWORD \
2>> ../logs/server.err \
>> ../logs/server.out &
## Create our first user
./opencga-admin.sh users create -u platinum --user-email platinum@illumina.com \
--user-name Platinum \
--user-organization Illumina \
--user-password $USER_PASSWORD \
--project-alias platinum \
--project-name Platinum \
--password <<< $CATALOG_ADMIN_PASSWORD
- See Getting started in 5 min for more info.
For this testing area, we are going to use a sample VCF data from the Platinum genomes. You can use any other file, but all the examples below use the VCF file platinum-genomes-vcf-NA12877_S1.genome.vcf.gz
You can find all the required files for this tutorial in this links:
Original VCF files -> http://bioinfo.hpc.cam.ac.uk/downloads/datasets/vcf/platinum_genomes/gz/ Transformed proto files -> http://bioinfo.hpc.cam.ac.uk/downloads/datasets/vcf/platinum_genomes/proto/ Cellbase annotation -> http://bioinfo.hpc.cam.ac.uk/downloads/datasets/vcf/platinum_genomes/annotation/
Once OpenCGA is installed and running, we need to create a new study in catalog and load the proto files:
# Create study and folder structure
./opencga.sh users login -u platinum -p <<< PlatinumP@ss
./opencga.sh studies create --project-id platinum --alias platinum --name Platinum
./opencga.sh files create-folder -s platinum --folder 10_input
./opencga.sh files create-folder -s platinum --folder 20_transformed
./opencga.sh files create-folder -s platinum --folder 30_load
./opencga.sh files create-folder -s platinum --folder 40_annotation
# Link files
./opencga.sh files link -s platinum -i /path/to/platinum/vcfs/* --path 10_input
./opencga.sh files link -s platinum -i /path/to/platinum/proto/* --path 20_transformed
./opencga.sh files link -s platinum -i /path/to/platinum/annotation/* --path 40_annotation
Once everything is set up, just need to load the files. This command line will create an internal job that will be executed by the catalog daemon.
# Index asynchronously via Daemon
./opencga.sh files index --id 10_input --outdir 30_load --load
Optionally, we can use the opencga-analysis.sh command line for a synchronous execution:
# Index synchronously
mkdir /tmp/opencga_job
rm -rf /tmp/opencga_job/*
./opencga-analysis.sh variants index --file-id 10_input --outdir /tmp/opencga_job --load --path 30_load
For testing porpouses, it may be interesting to have an standalone installation of OpenCGA-Storage. A simple indexation can be done executing the next command:
./opencga-storage.sh variant index --storage-engine hadoop --study-id 1 --study-name platinum --gvcf --database opencga_platinum_platinum -i /path/to/platinum/proto/*
At this point, the last but not least, is annotate the variants. Despite this can be done at the same time than indexing variant files, it may be more clear in separated executions:
./opencga-storage.sh variant annotation --storage-engine hadoop --database opencga_platinum_platinum
This will annotate all the variants without annotation at the database. This will avoid to annotate already annotated variants.
If we have already downloaded the annotations, we can load them executing:
for file in /path/to/platinum/annotations/*
do
./opencga-storage.sh variant annotation --storage-engine hadoop --database opencga_platinum_platinum --load $file
done
And we are done! At this point we will be ready to query variants. Here are some examples commands:
- Count number of variants
./opencga-storage.sh variant query --storage-engine hadoop --database opencga_platinum_platinum --count
- Get the first 10 variants from the Chromosome 8
./opencga-storage.sh variant query --storage-engine hadoop --database opencga_platinum_platinum --region 8 --limit 10 --sort
- Count variants in gene BRCA2
./opencga-storage.sh variant query --storage-engine hadoop --database opencga_platinum_platinum --gene BRCA2 --count
You can find the full list of options at the help:
./opencga-storage.sh variant query --help
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About