Data preparation for cytomarker

cytomarker accepts single-cell RNA- (scRNA-seq) data in three popular formats:

SingleCellExperiment
Seurat
Anndata (NOTE: Currently supported on local sessions only, not on shinyapps.io)

cytomarker upload size limits

cytomarker limits the size of upload files to 20GB on both the publicly hosted shinyapps instance as well as local sessions. However, for optimal performance and speed, we recommend limiting file sizes to < 500MB (shinyapps) or < 1GB (local).

Cell number recommendation

It is recommended to subsample large datasets to reduce upload and compute time. Sample code for subsampling for the three data types is below.

SingleCellExperiment
Seurat
Anndata

SingleCellExperiment

To save a SingleCellExperiment object called sce for upload to cytomarker, run

save-singlecellexperiment.R
saveRDS(sce, "sce_for_cytomarker.rds")

Key points

For data in the SingleCellExperiment format, the cell type/state to design the panel around is taken as a column of colData(sce)
The expression data is taken from assays(sce). By default, logcounts is used though this may be selected by the user
Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from rownames(sce) or rowData(sce). See heuristics for matching gene names for more details.

Subsampling a SingleCellExperiment object

The following code subsamples a SingleCellExperiment to 2000 cells:

subsample-singlecellexperiment.R
set.seed(123L)
cells_to_sample <- sample(ncol(sce), min(2000, ncol(sce)))
sce_subsampled <- sce[,cells_to_sample]
saveRDS(sce_subsampled, "sce_for_cytomarker.rds")

Seurat

To save a Seurat object called seu for upload to cytomarker, run

save-seurat.R
saveRDS(seu, "seu_for_cytomarker.rds")

Seurat object conversion

cytomarker converts Seurat to SingleCellExperiment objects using Seurat's as.SingleCellExperiment command.

Key points

For data in the Seurat format, the cell type/state to design the panel around is taken as a column of seu@meta.data
The expression data is taken from the active assay of the Seurat object. By default, this is then converted to an assay named logcounts, though this may be selected by the user
Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from rownames(seu). See heuristics for matching gene names for more details.

Anndata

To save an Anndata object called adata for upload to cytomarker, run

save-anndata.py
adata.write_h5ad('adata_for_cytomarker.h5ad')

Key points

For data in the Anndata format, the cell type/state to design the panel around is taken as a column of adata.obs
The expression data is taken by default as adata.X, though if additional assays are present these may be selected
Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from adata.var_names or adata.var. See heuristics for matching gene names for more details.

Subsampling an Anndata object

subsample-anndata.py
import scanpy as sc
sc.pp.subsample(adata, n_obs=2000)
adata.write_h5ad('adata_for_cytomarker.h5ad')

Heuristics for matching gene symbols

The uploaded object is searched for HUGO gene names. The overlap between each possible gene name slot in the uploaded dataset and a gene name database is calculated. If more than 50% of the uploaded gene names are listed in the database the gene names are used without warning. If less than 50% but over 100 of the uploaded gene names are listed in the database the gene names are used but a warning is shown indicating this. If no HUGO symbols are found but ensembl ID are found, these will be auto-converted and collapsed to the gene level if necessary.

Note that currently only human gene names are supported. If a dataset with gene names from a different species is uploaded, such as mouse or rat, an error is shown.

Removal of potentially confounding genes

Several genes that frequently confound single cell RNA-Seq analyses are removed upon loading the selected dataset. The specific genes removed are:

MALAT1
The JUN family proteins (those starting with JUN)
The FOS family proteins (those starting with FOS)
All heat shock proteins (those starting with HSP)
Genes located on the mitochondrial genome
All ribosomal (those starting with RPL or RPS)
All mitoribosomal proteins (those starting with MRPL or MRPS)

Creating pre-computed UMAP coordinates

cytomarker makes use of the UMAP dimensionality reduction projection, which is described briefly here. The app is able to detect if the uploaded dataset contains pre-computed UMAP coordinates for this aspect of the visualization. Creating pre-computed UMAP coordinates is highly recommended, especially for large datasets, in order to speed up the analysis. An example of creating UMAP coordinates (using default settings) using the scater package is shown below:

library(scater)
sce <- runUMAP(sce)

UMAP computation parameters

There are several different input paramters for computing UMAP coordinates that can be changed depending on the desired outputs. For example, the number of neighbours and overall resolution can be changed by the user eo emphasize local or global data structure, and subsequently alter the number of clusters generated. The user should be sure to visit a resource like this one, which will give a brief overview of the effects of parameter modification for UMAP.

By default, the dimension assay will be stored with the name "UMAP". cytomarker is able to auto-detect any of these assays so long as some portion of the assay name contains a case-insensitive version of UMAP (such as umap, uMAP, etc.)

Dataset metadata/colData

in SingleCellExperiment objects, the metadata (useful tabular information about the cells in the dataset) is often contained in the colData column. From these columns, the user may select a category of interest on which to evaluate the cells.

caution

Users should avoid using a column named keep_for_analysis in the colData columns. This identifier is used internally by cytomarker in order to create subsampling and perform filtering on each run. If this column exists in the colData prior to running, it must be renamed.

SingleCellExperiment​

Key points​

Subsampling a SingleCellExperiment object​

Seurat​

Key points​

Anndata​

Key points​

Subsampling an Anndata object​

Heuristics for matching gene symbols​

Removal of potentially confounding genes​

Creating pre-computed UMAP coordinates​

Dataset metadata/colData​

SingleCellExperiment

Key points

Subsampling a SingleCellExperiment object

Seurat

Key points

Anndata

Key points

Subsampling an Anndata object

Heuristics for matching gene symbols

Removal of potentially confounding genes

Creating pre-computed UMAP coordinates

Dataset metadata/colData