# Guide to calculating spatial genes

Giotto supports multiple ways for searching for spatially variable genes. Currently we have incorporated SpatialDE, trendceek, Spark, as well as two methods that we have developed binSpect and silhouetteRank. The common goal is to score genes in the spatial transcriptomic dataset based on the extent to which individual genes' expression values form a spatially coherent pattern (or whether there is a dependence of expression on spatial locations). The methods achieve this goal through various algorithms and statistical tests.

### Methods

#### SpatialDE

The method uses Gaussian process regression to decompose expression variability into a spatial covariance term and nonspatial variance term. The spatial covariance term assumes a linear trend and periodic pattern of gene expression variation. Multiple different spatial covariance functions are tested including: (1) null model, (2) general Gaussian covariance (squared exponential), (3) linear covariance, and (4) periodic covariance functions. A suitable model is selected using Bayes information criterion.

#### Trendsceek

The method employs previous functions within R called spatstat. In basic, it computes four statistics mark-seggregation summary statistics, including the Stoyan's mark-correlation function, mean-mark function, variance-mark function and mark-variogram of marked point process. Mark seggregation computes the probability of finding two marks given the separation of two points. Considering the distribution of all pairs at a particular radius, a mark segregation is said to be present if this distribution is dependent on r, such that it deviates from what would be expected if the marks would be randomly distributed over the spatial locations of the points. For statistical testing, a null distribution of summary statistics is computed for every radius r, after permuting expression labels. The maximum deviation (from expected value) observed among n randomly distributed patterns is compared with the observed deviation (from expected value). A p-value is calculated via the rank.

#### SPARK

SPARK is a method perceived as an extension of SpatialDE. SPARK models count data directly and employs properly calibrated p-values. Well-calibrated p-values allows this method to find more spatially variable genes at a given FDR cut-off. This is sometimes an improvement over SpatialDE which may produce overly conservative p-values. SPARK models expression levels across spatial locations using generalized linear spatial model (GLSM). It allows modeling the distribution of expression values through an overdispersed Poisson distribution (for count data) or Gaussian distribution (for normalized data). Notably, to make sure that the algorithm can discover various spatial patterns, SPARK employs ten different spatial kernels, including five periodic kernels with different periodicity parameters and five Gaussian kernels with different smoothness parameters.

#### BinSpect

This method uses a graph based approach to compute the probability of encountering two physically neighboring cells being both expressed. It first computes a KNN neighborhood graph for a given k (defined by user) based on cells' physical locations. For each gene, it binarizes gene expression value across all the cells (1 for expressed, 0 for nonexpressed). An edge in the KNN graph connecting two cells is classified as 1-0, 1-1, 0-1, or 0-0 based on gene expression label. A contigency table tallying all the 1-1's, 1-0's, 0-1's, and 0-0's edges for the entire KNN network is created. A P-value is reported based on the hypergeometric distribution test.

#### SilhouetteRank

This method stands for silhouette coefficient on binarized spatial gene expression profiles. As the name suggests, gene expression is first binarized to 1's (expressed) and 1's (nonexpressed). It next calculates silhouette coefficient of the 1-marked cells on a locally weighted spatial distance matrix. The distance function is 1 -similarity, where the similarity function is a rank-based, exponentially-transformed score emphasizing more the closely located cells and penalizes far away cells' distance. For statistical testing, N randomly permuted patterns are generated by shuffling the cell locations while keeping the distribution of 1's and 0's the same as real data. The extreme right-tail of distribution of random patterns' silhouette scores is modeled using the GPD distribution.

### Usage

#### SpatialDE

##### spatialDE function

```
spatialDE <- function(gobject = NULL, expression_values = c('raw', 'normalized', 'scaled', 'custom'), size = c(4,2,1), color = c("blue", "green", "red"), sig_alpha = 0.5, unsig_alpha = 0.5, python_path = NULL, show_plot = NA, return_plot = NA, save_plot = NA, save_param = list(), default_save_name = 'SpatialDE')
```

The input is a gene expression matrix. There are 4 version of expression matrix (indicated by `expression_values`

). Raw version (in counts) is recommended. SpatialDE performs library size normalization (by default) if raw expression is used. Otherwise, one can also use “normalized” and skip SpatialDE normalization step.

There are no other parameters required. The parameters `color`

, `sig_alpha`

, `unsig_alpha`

are used for plotting the Fraction spatial variance vs Adj. P-value https://github.com/Teichlab/SpatialDE, and is optional. To disable this FSV vs. Adj P-value plot, `show_plot`

is set to NA (default). The parameters `return_plot`

, `save_plot`

, `save_param`

are for saving the results automatically to disk (default values are NA). They are attached to every function (see `CreateGiottoInstructions()`

).

#### Outputs:

A data frame with the results. There are 3 fields reported per gene: LLR, pval, qval. LLR is log-likelihood of model, useful for creating a whole ranking of genes unambiguously. P-val, Q-val are useful for cut-off based approach to filtering the spatial genes.

#### Trendsceek

##### trendSceek function

```
trendSceek <- function(gobject, expression_values = c("normalized", "raw"), subset_genes = NULL, nrand = 100, ncores = 8, ...)
```

The input is a gene expression matrix. The “normalized” version of the matrix is recommended - this is library size normalized in normalized counts. Trendceek requires performing library size normalization externally, hence “normalized” version is required. The `subset_genes`

is used for scoring only a set of genes in the expression matrix.

The `nrand`

is the number of random resamplings of the mark distribution as to create the null distribution. The `ncores`

is the number of cores.

##### Outputs:

A list containing trendceek-statistics for every gene.

##### References:

- https://github.com/edsgard/trendsceek/blob/master/inst/doc/refman.pdf
- https://github.com/edsgard/trendsceek

#### SPARK

##### spark function

```
spark = function(gobject, percentage = 0.1, min_count = 10, expression_values = 'raw', num_core = 5, covariates = NULL, return_object = 'data.table', fit_model=c("poisson", "gaussian"), ...)
```

The input is a gene expression matrix. The SPARK algorithm requires as input a raw unfiltered expression matrix (in counts). The matrix should be unfiltered as SPARK performs its own gene selection, cell selection, and minimal quality controls to remove unwanted genes and cells. The `percentage`

and `min_count`

are SPARK's gene and cell filtering options. The `covariates`

specifies any additional covariate to account for prior to the algorithm. By default, SPARK already considers the total count per cell as covariate.

`min_counts`

: The minimum counts for each cell for filtering`percentage`

: The percentage of cells that are expressed for each gene for filtering (range from 0 to 1 for 0% to 100%)`fit_model`

: Whether to use poisson or gaussian model. Poisson is recommended for count data. Gaussian is recommended for normalized data.

##### Outputs

A data table containing gene name, combined P-value, and adjusted P-value. Combined P-value contains the P-value after applying Cauchy combination rules to combine the results of 10 spatial kernels. Adj P-value contains the adjusted p-value for multiple hypothesis testing (coming from all the genes).

##### References

#### SilhouetteRank

##### silhouetteRank function

```
silhouetteRank = function(gobject, expression_values = c('normalized', 'scaled', 'custom'), subset_genes = NULL, metric='euclidean', rbp_p = 0.95, examine_top = 0.300)
```

The input is a gene expression matrix that is normalized by library size (specified by `expression_values="normalized"`

).
SilhouetteRank uses two key settings: `rbp_p`

and `examine_top`

. The `rbp_p`

is the local neighborhood distance weighting constant (0.95 - 0.995). The smaller the `rbp_p`

, the more focus to put on local distances. The distances are exponentially decayed according to `rbp_p`

. The option `examine_top`

is the proportion of cells that participate in the spatial pattern (0 - 1.0). Small `examine_top`

(0.01) tests for streak and very focal pattern. Large `examine_top`

(0.3) finds large spatial pattern.

If `example_top`

is set to a high proportion such that `(example_top * num_cells) > (num_expressed_cells)`

of a given gene, then a number of cells equal to the difference is randomly selected among the non-expressed groups to become the expressed population. Then recalculate the silhouette score. This is done so as to make all genes comparable regardless of any difference in terms of the number of detected cells.

Note: the function can be repeated on a spatially permuted expression matrix to generate an empirical P-value for each silhouette score.

##### Outputs

A data frame containing gene name, the silhouette score.

##### References

#### BinSpect

##### binSpect function

```
binSpect = function(gobject, bin_method = c('kmeans', 'rank'), expression_values = c('normalized', 'scaled', 'custom'), subset_genes = NULL, reduce_network = FALSE, kmeans_algo = c('kmeans', 'kmeans_arma', 'kmeans_arma_subset'), nstart = 3, iter_max = 10, extreme_nr = 50, spatial_network_name='Delaunay_network', sample_nr = 50, percentage_rank = 30, do_fisher_test = TRUE, adjust_method = 'fdr', calc_hub = FALSE, hub_min_int = 3, get_av_expr = TRUE, get_high_expr = TRUE, implementation = c('data.table', 'simple', 'matrix'), group_size = 'automatic', do_parallel = TRUE, cores = NA, verbose = T, knn_params = NULL, set.seed = NULL, summarize = c('adj.p.value', 'p.value'))
```

The input is a normalized gene expression matrix (`expression_values="normalized"`

).

Prior to `binSpect`

, the spatial network (for example KNN network) based on spatial locations should be created (see function `createSpatialNetwork`

for details). Afterward, the name of the spatial network should be given to `spatial_network_name`

in `binSpect`

. For example: `createSpatialNetwork(g_object, method='kNN', name='knn_network', k=20)`

, then `binSpect(gobject, spatial_network_name='knn_network', ...)`

.

The key parameter is the `percentage_rank`

, which is the percentage of top expressed cells to be binarized to 1's.

The `bin_method`

is binarization based on K-means or rank. For k-means, a 2-class K-means is performed and the higher class is assigned 1's (`nstart`

, `iter_max`

are pertinent parameters of K-means). For rank, the top ranked percentage of cells are assigned 1's (`percentage_rank`

is the pertinent parameter).

Other optional settings `calc_hub`

, `hub_min_int`

, `get_av_expr`

, and `get_high_expr`

will also report statistics about individual genes:

`calc_hub`

: calculate the number of hub cells`hub_min_int`

: minimum number of cell-cell interactions for a hub cell`get_av_expr`

: calculate the average expression per gene of the high expressing cells`get_high_expr`

: calculate the number of high expressing cells per gene

BinSpect runs in parallel mode by default (`do_parallel`

) and will utilize all cores. Number of cores can be set by `cores`

.

##### Outputs

A data table with results.