Topyfic_analysis

This repo contains all the analysis related to this paper.

I would also encourage you to take a look at the Topyfic package.

Overview

The gene expression profiles of distinct cell types reflect com- plex genomic interactions among multiple simultaneous biological processes within each cell that can be altered by disease progression as well as genetic background. The identification of these active cellular programs is an open challenge in the analysis of single-cell RNA-seq data. Latent Dirichlet Allocation (LDA) is a generative method used to identify recurring patterns in counts data, commonly referred to as topics that can be used to interpret the state of each cell. However, LDA’s interpretability is hindered by several key factors including the hyperparameter selection of the number of topics as well as the variability in topic definitions due to random initialization. We developed Topyfic, a Reproducible LDA (rLDA) package, to accurately infer the identity and activity of cellular programs in single-cell data, providing insights into the relative contributions of each program in individual cells. We apply Topyfic to brain single-cell and single-nucleus datasets of two 5xFAD mouse models of Alzheimer’s disease crossed with C57BL6/J or CAST/EiJ mice to identify distinct cell types and states in different cell types such as microglia. We find that 8-month 5xFAD/Cast F1 males show higher level of microglial activation than matching 5xFAD/BL6 F1 males, whereas female mice show similar levels of microglial activation. We show that regulatory genes such as TFs, microRNA host genes, and chromatin regulatory genes alone capture cell types and cell states. Our study highlights how topic modeling with a limited vocabulary of regulatory genes can identify gene expression programs in single-cell data in order to quantify similar and divergent cell states in distinct genotypes.

Data

ENCODE data: cart
MODEL-AD data: GEO

Pre processing

It includes four main steps but you can download the preprocessed gene count data from ENOCDE portal here.

Get unfiltered gene count h5ad for each experiment
Merge data by experimental batch across file IDs and filter for nuclei > 500 UMI.
Run Scrublet to remove doublet cells (threshold > 0.25)
Annotate nuclei
normalize counts using depth normalization

For depth-in analysis in please look at this github repository.

Regulatory genes

We hypothesize that we can define meaningful topics for cell identity using only regulatory genes, which account for 12% of protein coding genes.

For more information about how we determine regulatory genes, please look at this github repository.

Training TopModel

To find the best number of topics (k), we start to train our model using several Ks starting from K=5 until 50. We start our training by running on WT and 5xFAD mice separately.

For each K:

Training model with 100 different random seeds using topyfic.py to get train object per random seed
Aggregate all training objects using make_train.py
Make TopModel using make_topmodel.py

At the end you have one train object, one Topmodel object for each K.

Description of each folder in Training directory

Training on WT (BL6 and BL6/CAST) using all genes: here.
Training on 5xFAD (BL6 and BL6/CAST) using all genes: here.
Combine training using all genes: here.
Training on WT (BL6 and BL6/CAST) using regulatory genes: here.
Training on 5xFAD (BL6 and BL6/CAST) using regulatory genes: here.
Combine training using regulatory genes: here.
Training on microglia single cell using all genes: here.

Analysis TopModel

Here you can find the downstream analysis related to each model

single nucleus RNA-seq data using all genes: notebook related to this dataset and figure2 is here
single nucleus RNA-seq data using regulatory genes: notebook related to this dataset and figure3 is here
single cell microglia RNA-seq data: notebook related to this dataset and figure4 is here

Hosting results

Single-nucleus RNA-seq results: ENCODE portal
Single-cell (microglia) RNA-seq results: Zenodo

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Analysis		Analysis
Figures		Figures
Training		Training
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topyfic_analysis

Overview

Data

Pre processing

Regulatory genes

Training TopModel

Description of each folder in Training directory

Analysis TopModel

Hosting results

About

Releases

Packages

Languages

nargesr/Topyfic_analysis

Folders and files

Latest commit

History

Repository files navigation

Topyfic_analysis

Overview

Data

Pre processing

Regulatory genes

Training TopModel

Description of each folder in Training directory

Analysis TopModel

Hosting results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages