This repository contains my submission for the Streptococcus pneumoniae Distance Coding Challenge concerning my application for the Pathogen Informatics and Modelling group at EMBL-EBI, which implements the MinHash Algorithm to compute genetic distances between isolates and visualize the relationships using a Neighbor-Joining Tree as in (Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x).
Here is a graphical workflow :
- Efficient processing of genome sequences using Biopython.
- Implementation of MinHash Sketching for scalable computation.
- Calculation of Jaccard Distances.
- Construction and visualization of a Neighbor-Joining Tree.
- Modular Python code.
Before you begin, ensure you have the following:
- Python 3.8 or later installed on your system.
- Required Python packages installed. You can install them via:
pip install -r requirements.txt
When cloning the repo, you will have 4 fasta files associated :
R6.fa
and TIGR4.fa
: reference sequences.
14412_3#82.contigs_velvet.fa
and 14412_3#84.contigs_velvet.fa
: draft sequences, assembled from shotgun sequencing runs.
Clone the repository and navigate to the project directory:
git clone https://github.com/PhilRTFM/EBI_BacPop_Technical-Interview.git
cd EBI_BacPop_Technical-Interview
- Ensure the genome files (
R6.fa
,TIGR4.fa
,14412_3#82.contigs_velvet.fa
,14412_3#84.contigs_velvet.fa
) are present in the working directory. - Run the main script:
python main.py
- Pairwise Distance Matrix: Saved as
pairwise_distance_matrix.txt
. - Neighbor-Joining Tree:
- Newick format:
phylogenetic_{method}_tree.nwk
- PNG visualization:
phylogenetic_{method}_tree.png
Where {method} isfull
for k-mer Jaccard orminhash
with sketches.
- Newick format:
-
FASTA Parsing
- Function(s):
load_fasta(file_path)
: Loads genome sequences from FASTA files and concatenates them into a single string.extract_kmers(sequence, k=14)
: Extracts all unique k-mers (of length 14) from the concatenated sequence.
- Role:
- Converts genome sequences into k-mer sets for further analysis.
- Prepares the dataset for exact and approximate distance calculations.
- Function(s):
-
14-mers and Jaccard Distance
- Function(s):
full_distance(kmers_a, kmers_b)
: Computes the exact Jaccard distance between two sets of k-mers.construct_distance_matrix(labels, distances)
: Constructs a symmetric distance matrix from pairwise distances.
- Role:
- Computes exact pairwise Jaccard distances for all genome pairs using the full k-mer sets.
- Produces a numerical representation of genetic similarity between genomes in matrix form.
- Function(s):
-
MinHash Sketching
- Function(s):
hash_kmer(kmer)
: Generates a hash value for each k-mer, considering both forward and reverse complement.minhash_sketch(kmers, sketch_size=1000)
: Generates a sketch of thesketch_size
smallest hash values from a set of k-mers.
- Role:
- Provides a compact representation of k-mer sets for efficient distance computation.
- Handles large datasets by reducing memory requirements while retaining essential information.
- Function(s):
-
Approximate Jaccard Distance
- Function(s):
minhash_distance(sketch_a, sketch_b)
: Computes the approximate Jaccard distance between two MinHash sketches.construct_distance_matrix(labels, distances)
: Constructs a distance matrix for the approximate distances.
- Role:
- Approximates the genetic distances between genomes using MinHash sketches, saving computation time and memory.
- Function(s):
-
Neighbor-Joining Tree
- Function(s):
doNeighbourJoining(distance_matrix, labels)
: Constructs a phylogenetic tree using the Neighbor-Joining algorithm.save_newick(tree, output_path)
: Saves the tree in Newick format.plot_tree_from_newick(newick_path, output_path)
: Visualizes the tree and saves it as an image.
- Role:
- Constructs phylogenetic trees based on both full and MinHash distance matrices.
- Saves the trees in both Newick format and graphical PNG format for further analysis and visualization.
- Function(s):
The MinHash-based distances, computed using sketches of size 1000, approximate the true Jaccard distances. This method is computationally efficient and well-suited for large datasets:
[[0.0000 0.2624 0.9868 0.9868]
[0.2624 0.0000 0.9858 0.9863]
[0.9868 0.9858 0.0000 0.2847]
[0.9868 0.9863 0.2847 0.0000]]
The exact Jaccard distances, calculated using full k-mer sets, provide the baseline for evaluating the accuracy of the MinHash approximation:
[[0.0000 0.9744 0.9887 0.9887]
[0.9744 0.0000 0.9885 0.9885]
[0.9887 0.9885 0.0000 0.5608]
[0.9887 0.9885 0.5608 0.0000]]
The Neighbor-Joining tree, built from the computed distance matrices, highlights the evolutionary relationships between isolates. Below is the Newick format output:
(((R6.fa:0.1316,TIGR4.fa:0.1316):0.7129,14412_3#82.contigs_velvet.fa:0.7129):0.1425,14412_3#84.contigs_velvet.fa:0.1425);
(((R6.fa:0.1261,TIGR4.fa:0.1261):0.7202,14412_3#82.contigs_velvet.fa:0.7202):0.1408,14412_3#84.contigs_velvet.fa:0.1408);
Both trees, visualized side by side, demonstrate a similarity between the Full Jaccard and MinHash methods:
- Method: Full Jaccard Distance
Time for Full Jaccard Distance Matrix: 3.7777 seconds
- Method: MinHash Jaccard Distance
Time for MinHash Sketches: 31.3916 seconds
Time for MinHash Distance Matrix: 0.0030 seconds
-
Comparison of Full and MinHash Distances:
- The Full Jaccard distances serve as the ground truth for assessing the MinHash approximation.
- Despite slight numerical differences, the MinHash-based tree retains the same topology as the Full Jaccard tree, confirming its accuracy for phylogenetic inference.
-
Effect of Sketch Size:
-
Biological Relevance:
- Phylogenetic trees elucidate evolutionary relationships between isolates, aiding in understanding pathogen lineages.
- Even if the MinHash sketches computation takes time, computational efficiency of MinHash may enables analysis of large-scale genomic datasets, critical in epidemiological studies.
For queries, reach out to:
- GitHub: PhilRTFM