Skip to main content

Data preparation for cytomarker

cytomarker accepts single-cell RNA- (scRNA-seq) data in three popular formats:

  1. SingleCellExperiment
  2. Seurat
  3. Anndata
Cell number recommendation

It is recommended to subsample large datasets to reduce upload and compute time. Sample code for subsampling for the three data types is below.

SingleCellExperiment

To save a SingleCellExperiment object called sce for upload to cytomarker, run

save-singlecellexperiment.R
saveRDS(sce, "sce_for_cytomarker.rds")

Key points

  • For data in the SingleCellExperiment format, the cell type/state to design the panel around is taken as a column of colData(sce)
  • The expression data is taken from assays(sce). By default, logcounts is used though this may be selected by the user
  • Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from rownames(sce) or rowData(sce). See heuristics for matching gene names for more details.

Subsampling a SingleCellExperiment object

The following code subsamples a SingleCellExperiment to 2000 cells:

subsample-singlecellexperiment.R
set.seed(123L)
cells_to_sample <- sample(ncol(sce), min(2000, ncol(sce)))
sce_subsampled <- sce[,cells_to_sample]
saveRDS(sce_subsampled, "sce_for_cytomarker.rds")

Heuristics for matching gene symbols

The uploaded object is searched for HUGO gene names. The overlap between each possible gene name slot in the uploaded dataset and a gene name database is calculated. If more than 50% of the uploaded gene names are listed in the database the gene names are used without warning. If less than 50% but over 100 of the uploaded gene names are listed in the database the gene names are used but a warning is shown indicating this. If no HUGO symbols are found but ensembl ID are found, these will be auto-converted and collapsed to the gene level if necessary.

Note that currently only human gene names are supported. If a dataset with gene names from a different species is uploaded, such as mouse or rat, an error is shown.

Removal of potentially confounding genes

Several genes that frequently confound single cell RNA-Seq analyses are removed upon loading the selected dataset. The specific genes removed are:

  1. MALAT1
  2. The JUN family proteins (those starting with JUN)
  3. The FOS family proteins (those starting with FOS)
  4. All heat shock proteins (those starting with HSP)
  5. Genes located on the mitochondrial genome
  6. All ribosomal (those starting with RPL or RPS)
  7. All mitoribosomal proteins (those starting with MRPL or MRPS)

Creating pre-computed UMAP coordinates

cytomarker makes use of the UMAP dimensionality reduction projection, which is described briefly here. The app is able to detect if the uploaded dataset contains pre-computed UMAP coordinates for this aspect of the visualization. Creating pre-computed UMAP coordinates is highly recommended, especially for large datasets, in order to speed up the analysis. An example of creating UMAP coordinates (using default settings) using the scater package is shown below:

library(scater)
sce <- runUMAP(sce)
UMAP computation parameters

There are several different input paramters for computing UMAP coordinates that can be changed depending on the desired outputs. For example, the number of neighbours and overall resolution can be changed by the user eo emphasize local or global data structure, and subsequently alter the number of clusters generated. The user should be sure to visit a resource like this one, which will give a brief overview of the effects of parameter modification for UMAP.

By default, the dimension assay will be stored with the name "UMAP". cytomarker is able to auto-detect any of these assays so long as some portion of the assay name contains a case-insensitive version of UMAP (such as umap, uMAP, etc.)

Dataset metadata/colData

in SingleCellExperiment objects, the metadata (useful tabular information about the cells in the dataset) is often contained in the colData column. From these columns, the user may select a category of interest on which to evaluate the cells.

caution

Users should avoid using a column named keep_for_analysis in the colData columns. This identifier is used internally by cytomarker in order to create subsampling and perform filtering on each run. If this column exists in the colData prior to running, it must be renamed.