This repository presents a bespoke Variational Autoencoder (VAE) that integrates all molecular and phenotypic data sets available for cancer cell lines.
- Clone this repository
- Create a python (Python 3.10) environment: e.g.
conda create -n mosa python=3.10
- Activate the python environment:
conda activate mosa
- Run
pip install -r requirements.txt
- Install shap from
https://github.com/ZhaoxiangSimonCai/shap
, which is customised to support the data format in MOSA. - Run
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118
The installation time largely depends on the internet speed as packages need to be downloaded and installed over the internet. Typically the installation should take less than 10 minutes.
- Download data files from figshare repository (see links in the manuscript)
- Configure the paths of the data files in
reports/vae/files/hyperparameters.json
- Run MOSA with
python PhenPred/vae/Main.py
The expected output, including the latent space matrix and reconstructed data matrices, can be downloaded from the figshare repository as described in the paper.
As a deep learning-based method, the runtime of MOSA depends on whether a GPU is available for training. MOSA took 52 minutes to train and generate the results using a V100 GPU on the DepMap dataset.
Although MOSA is specifically designed for analysing the DepMap dataset, the model can be adapted for any multi-omic datasets. To use MOSA with custom datasets:
- Prepare the custom dataset following the formats of DepMap data, which can be downloaded from figshare repositories as described in the manuscript.
- Configure the paths of the data files in
reports/vae/files/hyperparameters.json
. At least two omic datasets are required. - Run MOSA with
python PhenPred/vae/Main.py
- If certain benchmark analysis cannot be run properly, MOSA can be run by setting
skip_benchmarks=true
in thehyperparameters.json
to only save the output data, which includes the integrated latent space matrix and reconstructed data for each omics. - To further customise data pre-processing, the user can create their own dataset following the style of
PhenPred/vae/DatasetDepMap23Q2.py
, and the use the custome dataset class in theMain.py
.
- Download the data from figshare
- Place the downloaded files to
reports/vae/files/
- In the
Main.py
, configure to run MOSA from pre-computed datahyperparameters = Hypers.read_hyperparameters(timestamp="20231023_092657")
.
- Directly run MOSA with the default configurations as described above.
To incorporate disentanglement learning, two additional terms are included in the loss function, following the Disentangled Inferred Prior Variational Autoencoder (DIP-VAE) approach, as described by Kumar et al. (2018):
To use this, update the hyperparameters.json
file by specifying dip_vae_type
as either "i"
or "ii"
(type ii is recommended), and define the parameters lambda_d
and lambda_od
as float values, which control the diagonal and off-diagonal regularization, respectively.
The pre-trained models can be downloaded from the Hugging Face model hub: MOSA
Cai, Z et al., Synthetic multi-omics augmentation of cancer cell lines using unsupervised deep learning, 2023