Genome-centric long-read metagenomics workflow for automated recovery and analysis of prokaryotic genomes with Nanopore or PacBio HiFi sequencing data.
The mmlong2 workflow is a continuation of mmlong.
- Snakemake workflow running dependencies from a Singularity container for enhanced reproducibility
- Bioinformatics tool and parameter optimizations for processing high complexity metagenomic samples
- Circular prokaryotic genome extraction as separate genome bins
- Eukaryotic contig removal for reduced prokaryotic genome contamination
- Differential coverage support for improved prokaryotic genome recovery
- Iterative ensemble binning strategy for improved prokaryotic genome recovery
- Recovered genome quality classification according to MIMAG guidelines
- Supplemental genome quality assessment, including microdiversity approximation and chimerism checks
- Automated taxonomic classification at genome, contig and 16S rRNA levels
- Generation of analysis-ready dataframes at genome and contig levels
The recommended way of installing mmlong2 is by setting up a Conda environment through Bioconda:
mamba install -c bioconda mmlong2
A local Conda environment with the latest workflow code can also be created by using the following code:
mamba create --prefix mmlong2 -c conda-forge -c bioconda snakemake=8.2.3 singularity=3.8.6 zenodo_get pv pigz tar yq ncbi-amrfinderplus -y
mamba activate ./mmlong2 || source activate ./mmlong2
git clone https://github.com/Serka-M/mmlong2 mmlong2/repo
cp -r mmlong2/repo/src/* mmlong2/bin
chmod +x mmlong2/bin/mmlong2
mmlong2 -h
Bioinformatics tools and other software dependencies will be automatically installed when running the workflow for the first time.
By default, a pre-built Singularity container will be downloaded and set up, although pre-defined Conda environments can also be used by running the workflow with the --conda_envs_only
setting.
To acquire prokaryotic genome taxonomy and annotation results, databases are necessary and can be automatically installed by running the following command:
mmlong2 --install_databases
If some of the databases are already installed, they can also be re-used by the workflow without downloading (e.g. --database_gtdb
option). Alternatively, a guide for manual database installation is also provided.
For trying out the mmlong2 workflow, small test datasets can be downloaded from Zenodo:
zenodo_get -r 12168493
Once downloaded, to test the workflow in Nanopore mode up until the genome binning completes (ETA 2 hours, 110 Gb peak RAM):
mmlong2 -np mmlong2_np.fastq.gz -o mmlong2_testrun_np -p 60 -run binning
To test the workflow in PacBio HiFi mode using metaMDBG as the assembler and perform genome recovery and analysis (ETA 4.5 hours, 170 Gb peak RAM):
mmlong2 -pb mmlong2_pb.fastq.gz -o mmlong2_testrun_pb -p 60 -dbg
MAIN INPUTS:
-np --nanopore_reads Path to Nanopore reads
-pb --pacbio_reads Path to PacBio HiFi reads
-o --output_dir Output directory name (default: mmlong2)
-p --processes Number of processes/multi-threading (default: 3)
OPTIONAL SETTINGS:
-db --install_databases Install missing databases used by the workflow
-dbd --database_dir Output directory for database installation (default: current working directory)
-cov --coverage CSV dataframe for differential coverage binning (e.g. NP/PB/IL,/path/to/reads.fastq)
-run --run_until Run pipeline until a specified stage completes (e.g. assembly polishing filtering singletons coverage binning taxonomy annotation extraqc stats)
-tmp --temporary_dir Directory for temporary files (default: current working directory)
-dbg --use_metamdbg Use metaMDBG for assembly of PacBio reads (default: use metaFlye)
-med --medaka_model Medaka polishing model (default: r1041_e82_400bps_sup_v5.0.0)
-mo --medaka_off Do not run Medaka polishing with Nanopore assemblies (default: use Medaka)
-vmb --use_vamb Use VAMB for binning (default: use GraphMB)
-sem --semibin_model Binning model for SemiBin (default: global)
-mlc --min_len_contig Minimum assembly contig length (default: 3000)
-mlb --min_len_bin Minimum genomic bin size (default: 250000)
-rna --database_rrna 16S rRNA database to use
-gunc --database_gunc Gunc database to use
-bkt --database_bakta Bakta database to use
-kj --database_kaiju Kaiju database to use
-gtdb --database_gtdb GTDB-tk database to use
-h --help Print help information
-v --version Print workflow version number
ADVANCED SETTINGS:
-fmo --flye_min_ovlp Minimum overlap between reads used by Flye assembler (default: auto)
-fmc --flye_min_cov Minimum initial contig coverage used by Flye assembler (default: 3)
-env --conda_envs_only Use conda environments instead of container (default: use container)
-n --dryrun Print summary of jobs for the Snakemake workflow
-t --touch Touch Snakemake output files
-r1 --rule1 Run specified Snakemake rule for the MAG production part of the workflow
-r2 --rule2 Run specified Snakemake rule for the MAG processing part of the workflow
-x1 --extra_inputs1 Extra inputs for the MAG production part of the Snakemake workflow
-x2 --extra_inputs2 Extra inputs for the MAG processing part of the Snakemake workflow
-xb --extra_inputs_bakta Extra inputs (comma-separated) for MAG annotation using Bakta
To perform genome recovery with differential coverage, prepare a 2-column comma-separated dataframe, indicating the additional read datatype (NP
for Nanopore, PB
for PacBio, IL
for short reads) and read file location.
Dataframe example:
PB,/path/to/your/reads/file1.fastq
NP,/path/to/your/reads/file2.fastq
IL,/path/to/your/reads/file3.fastq.gz
The prepared dataframe can be provided to the workflow through the -cov
option.
<output_name>_assembly.fasta
- assembled and polished metagenome<output_name>_16S.fa
- 16S rRNA sequences, recovered from the polished metagenome<output_name>_bins.tsv
- per-bin results dataframe<output_name>_contigs.tsv
- per-contig results dataframe<output_name>_general.tsv
- workflow result summary as a single row dataframedependencies.csv
- list of dependencies used and their versionsbins
- directory for metagenome assembled genomesbakta
- directory, containing genome annotation results from bakta
Suggestions on improving the workflow or fixing bugs are always welcome.
Please use the GitHub Issues
section or e-mail to mase@bio.aau.dk for providing feedback.
If you use mmlong2 in a publication, please cite:
Sereika M, Mussig AJ, Jiang C, Knudsen KS, Jensen TBN, Petriglieri F, et al. Recovery of highly contiguous genomes from complex terrestrial habitats reveals over 15,000 novel prokaryotic species and expands characterization of soil and sediment microbial communities. bioRxiv. 2024.12.19.629313. https://doi.org/10.1101/2024.12.19.629313