Skip to content

Indexing Variant Data v0.6.0

Jacobo Coll Moragón edited this page Aug 30, 2016 · 2 revisions

Indexing VCF files

VCF can be indexed executing either using the implemented pipeline in the CLI or using the Java API. The aim of this indexation is allow making queries over the indexed data. Indexing data happens in two consecutive steps: transformation and load. During the transformation the VCF data is normalized and converted into an internal variant data model (see Data Models). During the load this normalized and validated file will be loaded in the active storage engine plugin. For more information about the indexation process, see OpenCGA Storage Engine.

For this testing area, we are going to use a sample VCF data from the 1000 Genomes Project. You can use any other file, but all the examples below use the VCF file ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz

Using the Command Line Interface

OpenCGA provides two ways of indexing VCF data:

  1. OpenCGA Analysis: this is a high level CLI that uses OpenCGA Catalog.
  2. OpenCGA Storage: this is a low level CLI completely independent of OpenCGA Catalog. You should use this one if you already have a metadata server and only need a Storage Indexing capabilities.
OpenCGA Analysis and Catalog

You have the complete description of OpenCGA command line interface at Command Line, this is just a quick start example. First of all, you will need a user account. This user information will be only stored in your Catalog. For instance:

./opencga.sh users create -u myuser -p mypass -e my@e.mail -n "my name"

Then, you can organize your data in several projects, and several studies in each project. Here we only will have one of each:

./opencga.sh projects create -a myproject -n "Default project" -d "First project created." -u myuser -p mypass
./opencga.sh studies create -a mystudy --project-id myuser@myproject -n "Default study" -d "First study created." -u myuser -p mypass

Now we can do the actual indexing. First, we have to tell Catalog where is the file we want to operate on:

./opencga.sh files create  -u myuser -p mypass -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz  --study-id myuser@myproject/mystudy --bioformat VARIANT

And ask for the indexation. If you have properly configured your storage engine (currently MongoDB or HBase) and want to do the transformation and load, you can go straight and do:

./opencga.sh files index -u myuser -p mypass -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -Dannotate=false -- --include-genotypes --compress-genotypes

if you also want to annotate the variants and compute statistics about the genotypes, run instead:

./opencga.sh files index -u myuser -p mypass -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz  -- --include-genotypes --compress-genotypes --calculate-stats

Now your vcf should be indexed and ready to receive queries. You can do so with:

./opencga.sh files info -id myuser@myproject/mystudy/ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz -u myuser -p mypass --output-format IDS > myfileid.txt
./opencga-storage.sh fetch-variants -r 20:16050050-16050100 -a $(cat myfileid.txt) --database opencga_myuser_myproject

We are currently working on what options should be the default ones, in order to allow using the CLI in the most intuitive way.

OpenCGA Storage

⚠️ This CLI is a low level CLI. Any metadata record must be done by the application.

A VCF indexation can be done in one or two steps, depending on if you want to delay the database load or not. //: # (It is more illustrative to do the two steps indexation)

A simple indexation may be done like the next command. Note that at this level you must manage your own ids. We will use 1 and 2 for instance:

./opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --database chr22_test_db

If your dataset is big and you want to do smaller steps, it is recommended to split the ETL process in two:

./bin/opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz --transform 

./bin/opencga-storage.sh index-variants --studyId 1 --file-id 2 -i ALL.chr22.phase1.projectConsensus.genotypes.vcf.gz.variants.json.gz --database chr22_test_db --load  
Clone this wiki locally