Infer spatially resolved cell-cell communication signaling at cellular resolution

Introduction

CytoSignal detects cell-cell signaling from spatial transcriptomics data at single-cell resolution. CytoSignal performs a nonparametric statistical test to identify which cells within a tissue have significant activity for a particular signaling interaction. CytoSignal considers multi-component interactions and separately models interactions mediated by diffusion vs. contact-dependent molecules.

Example data

In this tutorial, we demonstrate how to preform CytoSignal on a mouse embryonic E14 brain data captured by Slide-tags, a spatial transcriptomic protocol at single-cell resolution. This data was originally produced in the work of Andrew J. C. Russell et al., 2023, and is now publicly available on Single-Cell Portal via SCP2170. For convenience, we have pre-processed the data so it can be easily loaded into user R environment, which is available at figshare.

Raw gene expression count matrix: SCP2170_annotated_dgCMatrix.rds, 26,944 genes by 4623 nuclei)
Corresponding cluster annotation: SCP2170_cluster.csv
Spatial location: SCP2170_spatial.csv

Download the files listed above from the provided link into a desired local directory.

Create CytoSignal object

Loading example data

Next, load the downloaded data into the R environment using the following codes. We assume that users have the files placed at the current working directory (as can be shown by getwd()). Alternatively, users can also specify the path to the files.

## The RDS file will be loaded into a ready-to-use object
dge <- readRDS("SCP2170_annotated_dgCMatrix.rds")

## The cluster annotation need to be presented as a factor object
cluster <- read.csv("SCP2170_cluster.csv")
cluster <- factor(cluster$cell_type)
names(cluster) <- colnames(dge)

## The spatial coordinates need to be presented as a matrix object
spatial <- as.matrix(read.csv("SCP2170_spatial.csv", row.names = 1))
## Please make sure that the dimension names are lower case "x" and "y"
colnames(spatial) <- c("x", "y")

Then a CytoSignal object can be created with the following command.

library(cytosignal)

cs <- createCytoSignal(raw.data = dge, cells.loc = spatial, clusters = cluster)

Adding Ligand-receptor Database

CytoSignal has a built-in ligand-receptor interaction database resorted from CellphoneDB, which can be simply loaded into the object running the following command.

cs <- addIntrDB(cs, g_to_u, db.diff, db.cont, inter.index)

Preprocessing

Next, remove low-quality cells and genes, retain only those genes available in the interaction database, and convert gene names to Uniprot IDs.

cs <- removeLowQuality(cs, counts.thresh = 300)
cs <- changeUniprot(cs)

Inferring spatilly resolved significant Ligand-receptor interation activities

The spatially resolved interaction scores of each interaction in each location (LRscore) is defined as the co-expression of ligand and receptor genes within close spatial proximity. The computation can be divided into three main steps: 1) defining spatial neighbors for each location; 2) calculating the amount of ligand (L) and receptor (R) each location can receive from their spatial neighbors; 3) calculating ligand-receptor co-expression within each spatial neighborhood; 4) performing spatial permutation test to infer significant interactions.

Defining spatial neighborhoods

CytoSignal defines spatial neighborhoods for diffusion-dependent and contact-dependent interactions differently. We use the Epsilon ball approach for diffusible interactions and Delaunay Triangulation for contact-dependent interactions.

For diffusion-dependent interactions, for each location i, we define its spatial neighbors as all locations j within a circle centered on location i with a predefined radius r (200 µm by default). We next weight the amount of L that i receives based on the physical distance between i and j transformed by a Gaussian kernel. For determining the parameters of this kernel, we will need a scaling factor between the arbitrary units of the spatial coordinates and real unit (such as µm), which is based on prior knowledge of user dataset. For the specific example Slide-tags data, the conversion ratio is 0.73.

cs <- inferEpsParams(cs, scale.factor = 0.73)

It’s recommended to review the inferred parameters for a sanity check.

cs

## An object of class CytoSignal
## Number of cells: 4623 
## Number of genes: 15343 
## Number of clusters: 21 
## Nearest neighbor imputation:  
##   (default: )
## LR scores:  
##   (default: )
## Parameters:
##   lib.size: numeric of length 4623
##   r.diffuse.scale: 273.97
##   sigma.scale: 73.71

For each location, we can then use findNN() to find its spatial neighborhood and then calculate the weights between each it and its neighbors. The results for diffusible interactions are saved with model name GauEps and contact-dependent interactions are saved with model name DT.

cs <- findNN(cs)

Inferring ligand and receptor expression in neighborhood

The next step is to calculate the amount of ligand (L) and receptor (R) each location can receive from their spatial neighbors using function imputeLR.

cs <- imputeLR(cs)

Calculating ligand-receptor co-expression and identifying spatially significant interactions

For calculating the LRscore of each interaction within each spatial neighborhood, we multiply L and R within each location and apply an average within its DT neighborhood. Next, we perform a spatial permutation test, calculate the null distribution of the imputed ligands and receptors, and calculate a one-sided p-value. To control for multiple hypothesis testing and potential biases caused by cellular density differences, we further perform spatial false discovery rate correction.

The output of CytoSignal is a test statistic S and adjusted p-value for each signaling interaction within each location in the tissue. Finally, cells with significant signaling activity can be identified by setting a significance level such as p.value = 0.05. For convenience, we rank interactions by either the number of significant spatial positions or the spatial variability statistics of the LRscore calculated by SPARK-X.

The function inferIntrScore is an integrated function that performs the LRscore calculation and permutation tests. For the calculation of LRscore, by default, receptors are taken from the normalized expression. Ligands of diffusible interactions are imputed with Gaussian Epsilon ball (GauEps-Raw), while those from contact-dependent interactions are imputed with Delaunay Triangulation (DT-Raw). The function also provides an option recep.smooth to smooth the receptor expressions with Delaunay Triangulation (GauEps-DT and DT-DT), which is useful when the sparsity of the data is relatively high. The function inferSignif() should be called subsequently to identify significant interactions. For ranking the interactions by spatial variability statistics, we provide rankIntrSpatialVar().

Note that we perform a permutation test to infer the significance of the LR-score, where randomized sampling procedure is applied to the spatial locations. We set the seed at the beginning to ensure reproducible result in the tutorial.

set.seed(42) # Optional
cs <- inferIntrScore(cs)
cs <- inferSignif(cs, p.thresh = 0.05, reads.thresh = 100, sig.thresh = 100)
cs <- rankIntrSpatialVar(cs)

reads.thresh is the minimum number of reads for a ligand-receptor interaction to be considered. sig.thresh is the minimum number of beads for a ligand-receptor interaction to be considered. The three thresholds can be changed arbitrarily if the number of the significant beads is too large or too small. The function inferSignif() by default is applied to all LRscore types available. When any thresholds need to be tweaked for a specific LRscore type, users can specify the slot.use argument explicitly.

Users can use function showIntr() to view all significant interactions, where the argument slot.use is for specifying the LRscore calculation model used. Possible options are explained as followed:

"GauEps-Raw": Ligand expressions are imputed with Gaussian Epsilon ball and receptor expressions are taken from normalized expression. Typically for the diffusion-dependent interactions.
"DT-Raw": Ligand expressions are imputed with Delaunay Triangulation and receptor values are taken from raw expression. Typically for the contact-dependent interactions.
"Raw-Raw": Ligand and receptor expressions are all taken from the raw expression. Typically for the contact-dependent interactions that are expected to happen in the same spot that contains multiple cells. This can be useful for Visium data (See description below).
"GauEps-DT": Ligand expressions are imputed with Gaussian Epsilon ball and receptor values are imputed with Delaunay Triangulation. This is for diffusion-dependent interactions, but exists only when inferIntrScore() is run with recep.smooth = TRUE.
"DT-DT": Ligand values and receptor values are all imputed with Delaunay Triangulation. This is for contact-dependent interactions, but exists only when inferIntrScore() is run with recep.smooth = TRUE.

The argument signif.use can return interactions that are ranked by different metrics:

"result": have p-value less that specified threshold.
"result.hq": have significant p-value and passes the quality control on the minimum number of reads and beads as mentioned above.
"result.spx": are spatially variable and are of high quality according to that in "result.hq".

Setting return.name = TRUE displays both interaction unique IDs and the interaction names for easier interpretation. Interaction names are shown as a “ligand-receptor” gene symbol.

allIntrs <- showIntr(cs, slot.use = "diffusion-Raw", signif.use = "result.spx", return.name = TRUE)
print(head(allIntrs))

## CPI-SS0659DBE72 CPI-SS04124F4E1 CPI-SS0008137B6 CPI-SS00F4DDF4B CPI-SS0E063192D 
##   "EFNA4-EPHA4"     "EPO-EFNB2"   "EFNA4-EPHA5"     "PTN-PTPRS"    "NRG2-ERBB4" 
## CPI-SS0C694CB44 
##   "VEGFA-NMDE2"

When working with spatial technology where each spot contains multiple cells (e.g. 10X Visium), CytoSignal is able to quantify signaling at spot level without requiring deconvolution. For diffusion-dependent interactions, if the spot size is smaller than the preset diffusion radius, we can still infer signaling activities for spatial neighbors within a given radius. The result of such inference can be retrieved by setting slot.use = "diffusion-Raw". For contact-dependent interactions, because cells will only touch their directly connected neighbor cells within the same spot, we only need to consider the gene expression within each spot without applying spatial smoothing across spots. This step is already performed along with inferIntrScore(), and the result significant interactions should be accessed with slot.use = "Raw-Raw".

Visualization

CytoSignal makes cellular-resolution, spatially-resolved signaling inference. We developed several new visualizations for plotting each inferred signaling interaction.

Most of the functions take two arguments, intr and slot.use, for specifying individual interactions. Users can first show a list of significant interactions that is available for visualization with showIntr(). intr should then be the unique ID of available interaction(s) and slot.use should be a selection as explained above.

Visualizing significant cell-cell communication in 3D

We developed a 3D edge plot for visualizing the signal-sending and signal-receiving cells of an interaction at single-cell resolution. The plot comes with two layers of scatter plot of cells placed at the top and bottom of a 3D box space, for labeling the receivers and senders, respectively. In each layer, we use low transparency to highlight the cells that is sending or receiving the signal of the specified interaction. Cells are colored by cluster at the mean time. Finally, we bring lines that connect the senders with their corresponding tentative receivers, which are the edges.

intr.use <- names(allIntrs)[1]
plotEdge(cs, intr.use, slot.use = "diffusion-Raw", pt.size = 0.15)

To plot the cluster annotation legend alone, users can simply use plotCluster() for reference.

plotCluster(cs)

Combination plots of 3D edge plot, gene expression, LRscore and cluster annotation

We provide a function named plotSignif2 for making a general combination figure. This function plots on a per-interaction basis, and for each interaction it shows 1) the imputed gene expression of the ligand and receptor; 2) the original raw expression values of the ligand and receptor; 3) the inferred LRscore; 4) cluster annotations for each location; 5) 3D edge plot for visualizing the signal-sending and signal-receiving cells. General graphical setting can be found in the documentation of the function (?plotSignif2).

plotSignif2(cs, intr = intr.use, slot.use = "diffusion-Raw", return.plot = TRUE, edge = TRUE, pt.size = 0.2)

Recommended: When plotting a large number of interactions, it’s recommended to use numeric index with intr and turn to return.plot = FALSE (by default). In this way, plots of all input interactions at high resolution will be stored on the device instead of on the screen.

The codes below show the top 5 significant interactions ranked by SPARK-X for both diffusion-dependent and contact-dependent interactions.

# For diffusion dependent interactions
signif.use <- "result.spx"
lrscore.slot <- "diffusion-Raw"
plot_dir <- paste0("path_to_result/diff-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)

# For contact dependent interactions
## Choose a level of significance metric
lrscore.slot <- "contact-Raw"
plot_dir <- paste0("path_to_result/cont-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)

Detection of signaling-associated differentially expressed genes

Since CytoSignal predicts the locations within a tissue where a protein-protein interaction occurs, it enables new types of analyses not possible with existing computational approaches.

For instance, we provide a function inferIntrDEG() for detecting signaling-associated differentially expressed genes. It first identifies genes that are differentially expressed in the cells with significant signaling activities for each interaction. Next, it performs a sparse regression analysis based on a glmnet model using the output statistic S from CytoSignal as the response variable.

The returned object contains a list of results such as model parameters and significant genes for each interaction. Users can check the genes that are significant for a specific using intrDEG[[intr.use]]$sign_genes.

set.seed(233)
intrDEG <- inferIntrDEG(cs, intr = intr.use, slot.use = "diffusion-Raw", signif.use = "result.spx")
intrDEG[[1]]$sign_genes

##  [1] "Gm29260"       "Epha4"         "Zeb2"          "Neb"          
##  [5] "Mpped2"        "Meis2"         "Efna4"         "Mob3b"        
##  [9] "E130006D01Rik" "Mki67"         "Gm45680"       "B020031H02Rik"
## [13] "Sox1ot"        "Hmgb2"         "Nfix"          "Tox3"         
## [17] "Cdon"          "Gm30373"       "Nrxn3"         "Gli3"         
## [21] "Gas1"          "Dleu2"         "Gm27017"       "Celsr1"       
## [25] "Zbtb20"        "Tcf4"          "Gldc"          "Emx2os"       
## [29] "Sytl5"

GO term enrichment analysis

The signaling-associated genes can subsequently be used to identify GO terms or transcription factors associated with the signaling interaction. This analysis can be done with any GO tools of preference. We recommend using GORILLA which we also utilized in our manuscript. However, we also provide an alternative of using gprofiler2.

# Need to set `evcodes = TRUE` to get list genes that hits each GO term
goRes <- gprofiler2::gost(intrDEG[[1]]$sign_genes,
                          organism = "mmusculus", evcodes = TRUE)

To present these results more clearly, we provide one more function heatmap_GO() for plotting the regression weights of the selected genes as a heatmap along with their enriched GO terms. The heatmap will be colored by the coefficient value that is calculated from the regression analysis, while white-colored grids indicate the corresponding gene is not enriched in the corresponding GO term.

If the GO analysis was performed using other tools, we require the result to be in the form of data.frame and has the following three columns:

A column for the GO term names, i.e. the short text description. The name of this column need to be set to argument description.col and the content will be shown as the row labels of the heatmap.
A column for the p-values of the GO terms, for ranking the terms. The name of this column need to be set to argument pval.col.
A column for the gene intersection evidence. Should have some thing like "gene1, gene2, gene3" so we can identify the hit. The name of this column need to be set to argument gene.col. This information could have been formatted variously by different tools, users might need to further provide a splitting function that splits the string into a character vector of the gene names to argument gene.split.fun.

heatmap_GO(intrDEG, goRes$result, intr = intr.use, 
           description.col = "term_name", pval.col = "p_value", 
           gene.col = "intersection")

In our manuscript, we used a web-based tool REVIGO to visualize the GO terms. Users can simply copy the term IDs from the result into the web form and get an interactive 2D scatter plot in the browser. For command-line users (e.g. on HPCs), we provided a function revigo() to send query and fetch the result from their server. This allows us to generate a similar scatter plot without using an internet browser.

In the output 2D scatter plot, each dot represents a GO term and is surrounded by other similar or related GO terms. The color scale represents the log transformed p-value in this example, which corresponds to value in the query. The size of dots indicates the frequency of the GO term in the underlying GOA database.

revigoRes <- revigo(GOterms = goRes$result$term_id, 
                    value = goRes$result$p_value, valueType = "PValue",
                    speciesTaxon = "10090")
plotREVIGO(revigoRes, labelSize = 3)