CytoSignal detects cell-cell signaling from spatial transcriptomics data at single-cell resolution. CytoSignal performs a nonparametric statistical test to identify which cells within a tissue have significant activity for a particular signaling interaction. CytoSignal considers multi-component interactions and separately models interactions mediated by diffusion vs. contact-dependent molecules.
In this tutorial, we demonstrate how to preform CytoSignal on a mouse embryonic E14 brain data captured by Slide-tags, a spatial transcriptomic protocol at single-cell resolution. This data was originally produced in the work of Andrew J. C. Russell et al., 2023, and is now publicly available on Single-Cell Portal via SCP2170. For convenience, we have pre-processed the data so it can be easily loaded into user R environment, which is available at figshare.
Download the files listed above from the provided link into a desired local directory.
Next, load the downloaded data into the R environment using the
following codes. We assume that users have the files placed at the
current working directory (as can be shown by getwd()
).
Alternatively, users can also specify the path to the files.
## The RDS file will be loaded into a ready-to-use object
dge <- readRDS("SCP2170_annotated_dgCMatrix.rds")
## The cluster annotation need to be presented as a factor object
cluster <- read.csv("SCP2170_cluster.csv")
cluster <- factor(cluster$cell_type)
names(cluster) <- colnames(dge)
## The spatial coordinates need to be presented as a matrix object
spatial <- as.matrix(read.csv("SCP2170_spatial.csv", row.names = 1))
## Please make sure that the dimension names are lower case "x" and "y"
colnames(spatial) <- c("x", "y")
Then a CytoSignal object can be created with the following command.
Next, remove low-quality cells and genes, retain only those genes available in the interaction database, and convert gene names to Uniprot IDs.
The spatially resolved interaction scores of each interaction in each location (LRscore) is defined as the co-expression of ligand and receptor genes within close spatial proximity. The computation can be divided into three main steps: 1) defining spatial neighbors for each location; 2) calculating the amount of ligand (L) and receptor (R) each location can receive from their spatial neighbors; 3) calculating ligand-receptor co-expression within each spatial neighborhood; 4) performing spatial permutation test to infer significant interactions.
CytoSignal defines spatial neighborhoods for diffusion-dependent and contact-dependent interactions differently. We use the Epsilon ball approach for diffusible interactions and Delaunay Triangulation for contact-dependent interactions.
For diffusion-dependent interactions, for each location i, we define its spatial neighbors as all locations j within a circle centered on location i with a predefined radius r (200 µm by default). We next weight the amount of L that i receives based on the physical distance between i and j transformed by a Gaussian kernel. For determining the parameters of this kernel, we will need a scaling factor between the arbitrary units of the spatial coordinates and real unit (such as µm), which is based on prior knowledge of user dataset. For the specific example Slide-tags data, the conversion ratio is 0.73.
It’s recommended to review the inferred parameters for a sanity check.
## [1] 273.9726
## [1] 73.70953
For each location, we can then use findNN()
to find its
spatial neighborhood and then calculate the weights between each it and
its neighbors. The results for diffusible interactions are saved with
model name GauEps
and contact-dependent interactions are
saved with model name DT
.
The next step is to calculate the amount of ligand (L) and receptor (R) each location can receive from
their spatial neighbors using function imputeLR
.
For calculating the LRscore of each interaction within each spatial neighborhood, we multiply L and R within each location and apply an average within its DT neighborhood. Next, we perform a spatial permutation test, calculate the null distribution of the imputed ligands and receptors, and calculate a one-sided p-value. To control for multiple hypothesis testing and potential biases caused by cellular density differences, we further perform spatial false discovery rate correction.
The output of CytoSignal is a test statistic S and adjusted p-value for each
signaling interaction within each location in the tissue. Finally, cells
with significant signaling activity can be identified by setting a
significance level such as p.value = 0.05
. For convenience,
we rank interactions by either the number of significant spatial
positions or the spatial variability statistics of the LRscore
calculated by SPARK-X.
The function inferIntrScore
is an integrated function
that performs the LRscore calculation and permutation tests. For the
calculation of LRscore, by default, receptors are taken from the
normalized expression. Ligands of diffusible interactions are imputed
with Gaussian Epsilon ball (GauEps-Raw
), while those from
contact-dependent interactions are imputed with Delaunay Triangulation
(DT-Raw
). The function also provides an option
recep.smooth
to smooth the receptor expressions with
Delaunay Triangulation (GauEps-DT
and DT-DT
),
which is useful when the sparsity of the data is relatively high. The
function inferSignif()
should be called subsequently to
identify significant interactions. For ranking the interactions by
spatial variability statistics, we provide
rankIntrSpatialVar()
.
cs <- inferIntrScore(cs)
cs <- inferSignif(cs, p.value = 0.05, reads.thresh = 100, sig.thresh = 100)
cs <- rankIntrSpatialVar(cs)
reads.thresh
is the minimum number of reads for a
ligand-receptor interaction to be considered. sig.thresh
is
the minimum number of beads for a ligand-receptor interaction to be
considered. The three thresholds can be changed arbitrarily if the
number of the significant beads is too large or too small. The function
inferSignif()
by default is applied to all LRscore types
available. When any thresholds need to be tweaked for a specific LRscore
type, users can specify the slot.use
argument
explicitly.
Users can use function showIntr()
to view all
significant interactions, where the argument slot.use
is
for specifying the LRscore calculation model used. Possible options are
explained as followed:
"GauEps-Raw"
: Ligand expressions are imputed with
Gaussian Epsilon ball and receptor expressions are taken from normalized
expression. Typically for the diffusion-dependent interactions."DT-Raw"
: Ligand expressions are imputed with Delaunay
Triangulation and receptor values are taken from raw expression.
Typically for the contact-dependent interactions."Raw-Raw"
: Ligand and receptor expressions are all
taken from the raw expression. Typically for the contact-dependent
interactions that are expected to happen in the same spot that contains
multiple cells. This can be useful for Visium data (See description
below)."GauEps-DT"
: Ligand expressions are imputed with
Gaussian Epsilon ball and receptor values are imputed with Delaunay
Triangulation. This is for diffusion-dependent interactions, but exists
only when inferIntrScore()
is run with
recep.smooth = TRUE
."DT-DT"
: Ligand values and receptor values are all
imputed with Delaunay Triangulation. This is for contact-dependent
interactions, but exists only when inferIntrScore()
is run
with recep.smooth = TRUE
.The argument signif.use
can return interactions that are
ranked by different metrics:
"result"
: have p-value less that specified
threshold."result.hq"
: have significant p-value and passes the
quality control on the minimum number of reads and beads as mentioned
above."result.spx"
: are spatially variable and are of high
quality according to that in "result.hq"
.Setting return.name = TRUE
displays both interaction
unique IDs and the interaction names for easier interpretation.
Interaction names are shown as a “ligand-receptor” gene symbol.
allIntrs <- showIntr(cs, slot.use = "GauEps-Raw", signif.use = "result.spx", return.name = TRUE)
print(head(allIntrs))
## CPI-SS0659DBE72 CPI-SS04124F4E1 CPI-SS0008137B6 CPI-SS00F4DDF4B CPI-SS0E063192D
## "EFNA4-EPHA4" "EPO-EFNB2" "EFNA4-EPHA5" "PTN-PTPRS" "NRG2-ERBB4"
## CPI-SS0C694CB44
## "VEGFA-NMDE2"
When working with spatial technology where each spot contains
multiple cells (e.g. 10X Visium), CytoSignal is able to quantify
signaling at spot level without requiring deconvolution. For
diffusion-dependent interactions, if the spot size is smaller than the
preset diffusion radius, we can still infer signaling activities for
spatial neighbors within a given radius. The result of such inference
can be retrieved by setting slot.use = "GauEps-Raw"
. For
contact-dependent interactions, because cells will only touch their
directly connected neighbor cells within the same spot, we only need to
consider the gene expression within each spot without applying spatial
smoothing across spots. This step is already performed along with
inferIntrScore()
, and the result significant interactions
should be accessed with slot.use = "Raw-Raw"
.
CytoSignal makes cellular-resolution, spatially-resolved signaling inference. We developed several new visualizations for plotting each inferred signaling interaction.
Most of the functions take two arguments, intr
and
slot.use
, for specifying individual interactions. Users can
first show a list of significant interactions that is available for
visualization with showIntr()
. intr
should
then be the unique ID of available interaction(s) and
slot.use
should be a selection as explained above.
We developed a 3D edge plot for visualizing the signal-sending and signal-receiving cells of an interaction at single-cell resolution. The plot comes with two layers of scatter plot of cells placed at the top and bottom of a 3D box space, for labeling the receivers and senders, respectively. In each layer, we use low transparency to highlight the cells that is sending or receiving the signal of the specified interaction. Cells are colored by cluster at the mean time. Finally, we bring lines that connect the senders with their corresponding tentative receivers, which are the edges.
To plot the cluster annotation legend alone, users can simply use
plotCluster()
for reference.
We provide a function named plotSignif2
for making a
general combination figure. This function plots on a per-interaction
basis, and for each interaction it shows 1) the imputed gene expression
of the ligand and receptor; 2) the original raw expression values of the
ligand and receptor; 3) the inferred LRscore; 4) cluster annotations for
each location; 5) 3D edge plot for visualizing the signal-sending and
signal-receiving cells. General graphical setting can be found in the
documentation of the function (?plotSignif2
).
plotSignif2(cs, intr = intr.use, slot.use = "GauEps-Raw", return.plot = TRUE, edge = TRUE, pt.size = 0.2)
Recommended: When plotting a large number of
interactions, it’s recommended to use numeric index with
intr
and turn to return.plot = FALSE
(by
default). In this way, plots of all input interactions at high
resolution will be stored on the device instead of on the screen.
The codes below show the top 5 significant interactions ranked by SPARK-X for both diffusion-dependent and contact-dependent interactions.
# For diffusion dependent interactions
signif.use <- "result.spx"
lrscore.slot <- "GauEps-Raw"
plot_dir <- paste0("path_to_result/diff-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)
# For contact dependent interactions
## Choose a level of significance metric
lrscore.slot <- "DT-Raw"
plot_dir <- paste0("path_to_result/cont-dep_", signif.use, "_", lrscore.slot, "/")
plotSignif2(cs, intr = 1:5, slot.use = lrscore.slot, signif.use = signif.use, plot_dir = plot_dir)
Since CytoSignal predicts the locations within a tissue where a protein-protein interaction occurs, it enables new types of analyses not possible with existing computational approaches.
For instance, we provide a function inferIntrDEG()
for
detecting signaling-associated differentially expressed genes. It first
identifies genes that are differentially expressed in the cells with
significant signaling activities for each interaction. Next, it performs
a sparse regression analysis based on a glmnet model
using the output statistic S
from CytoSignal as the response variable.
The returned object contains a list of results such as model
parameters and significant genes for each interaction. Users can check
the genes that are significant for a specific using
intrDEG[[intr.use]]$sign_genes
.
intrDEG <- inferIntrDEG(cs, intr = intr.use, slot.use = "GauEps-Raw", signif.use = "result.spx")
intrDEG[[1]]$sign_genes
## [1] "Gm29260" "Epha4" "Zeb2" "Neb"
## [5] "Mpped2" "Meis2" "Efna4" "Mob3b"
## [9] "E130006D01Rik" "Mki67" "Gm45680" "B020031H02Rik"
## [13] "Sox1ot" "Hmgb2" "Nfix" "Tox3"
## [17] "Cdon" "Gm30373" "Nrxn3" "Gli3"
## [21] "Gas1" "Dleu2" "Gm27017" "Celsr1"
## [25] "Zbtb20" "Tcf4" "Gldc" "Emx2os"
## [29] "Sytl5"
The signaling-associated genes can subsequently be used to identify GO terms or transcription factors associated with the signaling interaction. This analysis can be done with any GO tools of preference. We recommend using GORILLA which we also utilized in our manuscript. However, we also provide an alternative of using gprofiler2.
# Need to set `evcodes = TRUE` to get list genes that hits each GO term
goRes <- gprofiler2::gost(intrDEG[[1]]$sign_genes,
organism = "mmusculus", evcodes = TRUE)
To present these results more clearly, we provide one more function
heatmap_GO()
for plotting the regression weights of the
selected genes as a heatmap along with their enriched GO terms. The
heatmap will be colored by the coefficient value that is calculated from
the regression analysis, while white-colored grids indicate the
corresponding gene is not enriched in the corresponding GO term.
If the GO analysis was performed using other tools, we require the
result to be in the form of data.frame
and has the
following three columns:
description.col
and the content will be shown as the row
labels of the heatmap.pval.col
."gene1, gene2, gene3"
so we can identify the hit. The
name of this column need to be set to argument gene.col
.
This information could have been formatted variously by different tools,
users might need to further provide a splitting function that splits the
string into a character vector of the gene names to argument
gene.split.fun
.heatmap_GO(intrDEG, goRes$result, intr = intr.use,
description.col = "term_name", pval.col = "p_value",
gene.col = "intersection")
In our manuscript, we used a web-based tool REVIGO to visualize the GO terms. Users
can simply copy the term IDs from the result into the web form and get
an interactive 2D scatter plot in the browser. For command-line users
(e.g. on HPCs), we provided a function revigo()
to send
query and fetch the result from their server. This allows us to generate
a similar scatter plot without using an internet browser.
In the output 2D scatter plot, each dot represents a GO term and is
surrounded by other similar or related GO terms. The color scale
represents the log transformed p-value in this example, which
corresponds to value
in the query. The size of dots
indicates the frequency of the GO term in the underlying GOA
database.