Spatial pattern mining

Previously we have published a paper that describes the hidden markov random field (HMRF) method for identifying spatial patterns from spatial gene expression data.

Unlike typical clustering techniques, HMRF integrates both spatial locations and the gene expression to find groups of cells that are both spatially adjacent and which are similar to each other in expression space.

The main steps are:

  1. Create neighborhood network based on spatial positions
  2. Define genes of interest
  3. Call HMRF routine to find spatial clusters


1. Spatial network

To define spatial relationship between cells, we first create a spatial network using createSpatialNetwork function.

createSpatialNetwork function
createSpatialNetwork <- function(gobject,name = NULL, dimensions = "all", method = c('Delaunay', 'kNN'), delaunay_method = c("deldir", "delaunayn_geometry", "RTriangle"), maximum_distance_delaunay = "auto", options = "Pp", Y = TRUE,j = TRUE,S = 0,minimum_k = 0,knn_method = "dbscan", k = 4,maximum_distance_knn = NULL, verbose = F, return_gobject = TRUE, ...)

Users can choose between two methods for the spatial network (Voronoi diagram by Delaunay triangulation or KNN network).

  • KNN network (method="kNN") is simpler in concept and is faster to compute. The option k needs to be defined to specify the number of neighbors. When maximum_distance_knn is defined, no two cells are connected in the spatial network if they are further than maximum_distance_knn apart regardless of k. The minimum_k specifies how many minimum number of neighbors each cell will have in order to ensure that there are no singletons in the resulting network. For more details, see createSpatialKNNnetwork.

    • Note for 3D datasets, dimensions="all" will use all three dimensions to calculate the spatial distance. However it is sometimes desirable to use only x and y dimensions, but not z dimension. Pass dimensions=c("sdimx", "sdimy").
    • Note to create a network based on maximum distance only, one can set k to be a very high value (e.g. k=100)
    • Guidance for HMRF: we recommend approximately 5-10 neighbors per cell.
  • Voronoi network (method="Delaunay"). Voronoi diagrams are used to model a number of different biological structures, including cells and bone microarchitecture. It can be used to understand physical constraints in a tissue organization.

    The default implementation of Delaunay triangulation is "deldir". Besides the typical option for Delaunay, there are also a couple of other options to further filter the network. The maximum_distance_delaunay will cut-off the network based on a maximum distance. The minimum_k will ensure that there is a minimum number of neighbors to each cell. Y, j, and S parameters refer to the RTriangle settings (safely leave as default). The dimension option is also available for Voronoi. For details, see createSpatialDelaunayNetwork().

2. Define genes of interest

Genes of interest can be a list of spatial genes returned by a spatial gene detection algorithm (see tutorial).

3. Call HMRF routine

doHMRF function
doHMRF <- function(gobject, expression_values = c('normalized', 'scaled', 'custom'), spatial_network_name = 'Delaunay_network', spatial_genes = NULL, spatial_dimensions = c('sdimx', 'sdimy', 'sdimz'), dim_reduction_to_use = NULL, dim_reduction_name = 'pca', dimensions_to_use = 1:10, name = 'test', k = 10, betas = c(0, 2, 50), tolerance = 1e-10,zscore = c('none','rowcol', 'colrow'), numinit = 100, python_path = NULL,output_folder = NULL,overwrite_output = TRUE)

This is the function that performs the HMRF routine.

  • expression_values: the version of gene expression matrix to use. "scaled" is appropriate for HMRF as it is a row and column wise z-scored version of expression matrix.
  • spatial_network_name: the version of spatial network to use. Either Delaunay_network or kNN_network.
  • dim_reduction_to_use: whether to use dimension reduced expression matrix. Set it to NULL if we wish to rather use the original gene expression matrix. (Defaults to NULL)
  • dim_reduction_name, dimensions_to_use: apply only if dim_reduction_to_use is set to "pca"
  • k: number of spatial clusters
  • betas: smoothness parameter with three values - (1) starting beta, (2) beta increment, (3) number of betas to run. E.g. c(0, 2, 50) means to run betas 50 times from beta=0, 2, 4, 6, 8, …, to 100.
  • tolerance: tolerance value
  • zscore: set to "none" if the gene expression is z-scored already. Otherwise, perform row and col-wise z-scoring ("rowcol"). "colrow" means that the order is first col then row-wise zscoring. Note the longer dimension should be zscored first.
  • numinit: number of initializations to try

A general guideline for choosing k is to use the gap-statistics by Tibshrani. The elbow point of the gap size vs. k plot usually indicates choices of k.

To decide which beta one should use, the current state of algorithm repeats HMRF for a number of betas. Here are some guidelines:

  • if the number of genes is from 10 to 50, the recommended range is 0 to 10 at increment of 0.5. E.g. c(0,0.5,20)
  • if the number of genes is below 50, the recommended range is 0 to 15 at increment of 1. E.g. c(0,1,15)
  • if the number of genes is between 50 to 100, the range is 0 to 50 at increment of 2. E.g. c(0, 2, 25)
  • if the number of genes is between 100 and 500, the range is 0 to 100 at increment of 5. E.g. c(0, 5, 20)

Within the range of beta, we recommend selecting the best beta by the Bayes information criterion. This requires first performing randomization of spatial positions to generate the null distribution of log-likelihood scores for randomly distributed cells for the same range of betas. Then find the beta where the difference between the observed and the null log-likelihood is maximized. See the HMRF tutorial from the Chapter.

Returns a list type object with the following fields:

  • name: name used
  • output_data: output path
  • k: k used,
  • betas: betas used,
  • python_path: python_path_used

After running HMRF, one can further view the HMRF spatial clusters (viewHMRFresults2D), and add HMRF annotation to pDataDT table (addHMRF).

#from seqfish+ dataset
#100 spatial genes
HMRF_spatial_genes = doHMRF(gobject = VC_test, expression_values = 'scaled', spatial_genes = my_spatial_genes, spatial_network_name = 'Delaunay_network', k = 9, betas = c(0,1,50), output_folder = paste0(hmrf_folder, '/', 'Spatial_genes/SG_top100_k9_scaled'))
## view results of HMRF
for(i in seq(0, 50, by = 1)) {
  viewHMRFresults2D(gobject = VC_test,HMRFoutput = HMRF_spatial_genes,k = 9, betas_to_view = i,point_size = 2)
## add HMRF of interest to giotto object
VC_test = addHMRF(gobject = VC_test,HMRFoutput = HMRF_spatial_genes, k = 9, betas_to_add = c(28), hmrf_name = 'HMRF_2')

4. Spatial coexpression modules

4.1 Based on HMRF domains
One can create spatial coexpression modules by computing differentially expressed genes between HMRF domains that are obtained in the previous step. Most statistical tests (t-test or Mann-Whitney test) will suffice. Then the differentially expressed genes per spatial domain can form one metagene for further analysis.

4.2 A separate approach
Alternative to HMRF, there is another way to detect spatial coexpression modules using a simple network-based smoothing approach. This approach consists of two functions mentioned below.

detectSpatialCorGenes function
The detectSpatialCorGenes function creates a smoothened gene expression matrix based on identified spatial genes (see step 1) and calculates pairwise gene-gene correlation scores.
detectSpatialCorGenes(gobject, method = c('grid', 'network'), expression_values = c('normalized', 'scaled', 'custom'), subset_genes = NULL, spatial_network_name = 'Delaunay_network', network_smoothing = NULL, spatial_grid_name = 'spatial_grid', min_cells_per_grid = 4, cor_method = c('pearson', 'kendall', 'spearman'))
  • gobject: the Giotto object
  • method: this is the method to use for spatial averaging or smoothening. This can be either based on a pre-computed spatial network or based on a pre-defined grid.
  • expression_values: this defines which expression values to use. Log normalized expression values are used by default.
  • subset_genes: this decides which genes to use. In order to obtain clear spatial co-expressio modules, we would recommend restricting this to identified spatial genes.
  • spatial_network_name: name of the spatial network to use if method is set to 'network'. Of note, multiple spatial networks can be created with createSpatialNetwork by giving them a unique name.
  • spatial_grid_name: name of the spatial grid to use if method is set to 'grid'. Of note, multiple spatial grids can be created with createSpatialGrid by giving them a unique name.
  • network_smoothing: this determines how much smoothening will occur. Setting b = 0 means no smoothing from neighboring cells and b = 1 means no contribution from its own expression. If this is set to NULL (default) then it will use b = 1 - 1/k, where k is the median number of k-neighbors in the selected spatial network
  • min_cells_per_grid: this is a cutoff to remove grids that do not have sufficient cells for meaningful averaging (defaults to 4 cells)
  • cor_method: correlation method to use, can be either Pearson, Kendall or Spearman.
This function will return a spatial correlation object, which stores all the necessary information and calculations and can be used in downstream convenience functions, such as showSpatialCorGenes().

clusterSpatialCorGenes function
clusterSpatialCorGenes uses the previously calculated result (the spatial correlation object) to create spatial co-expression modules, which are based on hierarchical clustering and a user defined cluster number (k).
clusterSpatialCorGenes(spatCorObject, name = "spat_clus", hclust_method = "ward.D", k = 10, return_obj = TRUE)
  • spatCorObject: this is the spatial correlation object obtained from detectSpatialCorGenes
  • name: name of the spatial clustering results. Different results (e.g. with different k) can be saved using a unique name.
  • hclust_method: determines which clustering method to use, defaults to 'ward.D'. See method argument for the base hclust function for alternative methods.
  • k: number of clusters to return
  • return_obj: return an updated spatial correlation object, otherwise a simple list with clustering results will be returned.
ext_spatial_genes = km_spatialgenes[1:500]$genes # obtained with a spatial gene detection method
# 1. calculate gene spatial correlation and single-cell correlation 
# create spatial correlation object
spat_cor_netw_DT = detectSpatialCorGenes(SS_seqfish, method = 'network', spatial_network_name = 'Delaunay_network', subset_genes = ext_spatial_genes)
spat_cor_netw_DT = clusterSpatialCorGenes(spat_cor_netw_DT, name = 'spat_netw_clus', k = 8)
heatmSpatialCorGenes(SS_seqfish, spatCorObject = spat_cor_netw_DT, use_clus_name = 'spat_netw_clus', heatmap_legend_param = list(title = NULL))
# Spatial co-expression modules can be ranked based on how strong correlation scores are within and outside the module using the function rankSpatialCorGroups().
netw_ranks = rankSpatialCorGroups(SS_seqfish, spatCorObject = spat_cor_netw_DT, use_clus_name = 'spat_netw_clus')
# show coexpression genes
top_netw_spat_cluster = showSpatialCorGenes(spat_cor_netw_DT, use_clus_name = 'spat_netw_clus', selected_clusters = 6, show_top_genes = 1)
# create metagenes
cluster_genes_DT = showSpatialCorGenes(spat_cor_netw_DT, use_clus_name = 'spat_netw_clus', show_top_genes = 1) # extract all the spatial co-expression module genes
cluster_genes = cluster_genes_DT$clus; names(cluster_genes) = cluster_genes_DT$gene_ID
SS_seqfish = createMetagenes(SS_seqfish, gene_clusters = cluster_genes, name = 'cluster_metagene')
spatCellPlot(SS_seqfish, spat_enr_names = 'cluster_metagene', cell_annotation_values = netw_ranks$clusters, point_size = 1.5, cow_n_col = 3)