Skip to content
zzchr1s edited this page Nov 8, 2022 · 7 revisions

Introduction

RefTM (Reference-guided Topic Modeling of single-cell chromatin accessibility data) is a reference-guided approach based on topic modeling to analyze scCAS data, which not only utilizes the information in existing bulk chromatin accessibility and annotated scCAS data, but also takes advantage of topic models for single-cell data analysis. RefTM simultaneously models: 1) the shared biological variation among reference data and the target scCAS data; 2) the unique biological variation in scCAS data; 3) other variations from known covariates in scCAS data (enables batch effect correction). RefTM can be expanded to do more general integrative data analysis by setting proper reference data and sc data.

Installation

You can install the released version of package RefTM from Github:

devtools::install_github("cuhklinlab/RefTM")

Data format

Two count matrices should be provided as input, sc_data and ref_data respectively.

sc_data: Peak-by-Cell count matrix

ref_data: Cell-by-Peak count matrix

And the peaks of the two matrices must be the same. Other than peak, other feature inputs such as motifs and genomic bins are also acceptable.

A. RefTM-LDA workflow on scATAC data guided by aggregating single-cell data.

First, load package RefTM:

library(RefTM)

Input data

reference data: pseudo-bulk forebrain_ref_data. MG and OC cells are left when constructing the pseudo-bulk reference data to investigate the influence of incomplete reference data. scCAS data: forebrain_sc_data

sc_data <- forebrain_sc_data
ref_data <- forebrain_ref_data
cell_label <- forebrain_label_mat

Modeling

set.seed(2022)
result <- RefTM(sc_data, ref_data)

Visualization

Visualization of latent topics obtained by LDA (no reference)

result_LDA <- LDA(t(sc_data), k = 10)
RetTM_tsne(result_LDA@gamma, cell_label)

Visualization of shared latent topics between reference data and scCAS data

theta = RefTM_postprocess(result, k1 = 5)
RetTM_tsne(theta[, 1:k1], cell_label)

Visualization of unique latent topics in scCAS data

RetTM_tsne(theta[, -c(1:k1)], cell_label)

Visualization of RefTM

RetTM_tsne(theta, cell_label)

B. RefTM-STM workflow on scATAC data guided by bulk data, with cell-specific covariates included.

First, load package RefTM:

library(RefTM)

Input data

reference data: pseudo-bulk CLPLMPPMPP_ref_data. scCAS data: CLPLMPPMPP_sc_data. covariate: CLPLMPPMPP_donor_label.

sc_data <- CLPLMPPMPP_sc_data
ref_data <- CLPLMPPMPP_ref_data
donor_label <- CLPLMPPMPP_donor_label
cell_label <- CLPLMPPMPP_label_mat

Modeling

set.seed(2022)
result <- RefTM(sc_data, ref_data, workflow = "STM", covariate = as.factor(donor_label))

Visualization

Visualization of RefTM without batch effect correction

theta = RefTM_postprocess(result, k1 = 5, erase.BF = FALSE)
RetTM_tsne(theta, cell_label)

Visualization of RefTM with batch effect correction

theta = RefTM_postprocess(result, k1 = 5)
RetTM_tsne(theta, cell_label, donor_label)

Cell clustering

Seurat_louvain <- RA3::RA3_clustering(t(theta), length(unique(cell_label)))