Data preparation for cytomarker
cytomarker accepts single-cell RNA- (scRNA-seq) data in three popular formats:
Cell number recommendation
It is recommended to subsample large datasets to reduce upload and compute time. Sample code for subsampling for the three data types is below.
- SingleCellExperiment
- Seurat
- Anndata
SingleCellExperiment
To save a SingleCellExperiment object called sce
for upload to cytomarker, run
saveRDS(sce, "sce_for_cytomarker.rds")
Key points
- For data in the SingleCellExperiment format, the cell type/state to design the panel around is taken as a column of
colData(sce)
- The expression data is taken from
assays(sce)
. By default,logcounts
is used though this may be selected by the user - Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from
rownames(sce)
orrowData(sce)
. See heuristics for matching gene names for more details.
Subsampling a SingleCellExperiment object
The following code subsamples a SingleCellExperiment
to 2000 cells:
set.seed(123L)
cells_to_sample <- sample(ncol(sce), min(2000, ncol(sce)))
sce_subsampled <- sce[,cells_to_sample]
saveRDS(sce_subsampled, "sce_for_cytomarker.rds")
Seurat
To save a Seurat object called seu
for upload to cytomarker, run
saveRDS(seu, "seu_for_cytomarker.rds")
Seurat object conversion
cytomarker converts Seurat
to SingleCellExperiment
objects using Seurat's as.SingleCellExperiment
command.
Key points
- For data in the Seurat format, the cell type/state to design the panel around is taken as a column of
seu@meta.data
- The expression data is taken from the active assay of the Seurat object. By default, this is then converted to an assay named
logcounts
, though this may be selected by the user - Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from
rownames(seu)
. See heuristics for matching gene names for more details.
Anndata
To save an Anndata object called adata
for upload to cytomarker, run
adata.write_h5ad('adata_for_cytomarker.h5ad')
Key points
- For data in the Anndata format, the cell type/state to design the panel around is taken as a column of
adata.obs
- The expression data is taken by default as
adata.X
, though if additional assays are present these may be selected - Gene names in the form of HUGO gene symbols will be searched for to match targets to antibody IDs. By default these will be taken from
adata.var_names
oradata.var
. See heuristics for matching gene names for more details.
Subsampling an Anndata object
import scanpy as sc
sc.pp.subsample(adata, n_obs=2000)
adata.write_h5ad('adata_for_cytomarker.h5ad')
Heuristics for matching gene symbols
The uploaded object is searched for HUGO gene names. The overlap between each possible gene name slot in the uploaded dataset and a gene name database is calculated. If more than 50% of the uploaded gene names are listed in the database the gene names are used without warning. If less than 50% but over 100 of the uploaded gene names are listed in the database the gene names are used but a warning is shown indicating this. If no HUGO symbols are found but ensembl ID are found, these will be auto-converted and collapsed to the gene level if necessary.
Note that currently only human gene names are supported. If a dataset with gene names from a different species is uploaded, such as mouse or rat, an error is shown.
Removal of potentially confounding genes
Several genes that frequently confound single cell RNA-Seq analyses are removed upon loading the selected dataset. The specific genes removed are:
- MALAT1
- The JUN family proteins (those starting with
JUN
) - The FOS family proteins (those starting with
FOS
) - All heat shock proteins (those starting with
HSP
) - Genes located on the mitochondrial genome
- All ribosomal (those starting with
RPL
orRPS
) - All mitoribosomal proteins (those starting with
MRPL
orMRPS
)
Creating pre-computed UMAP coordinates
cytomarker makes use of the UMAP dimensionality reduction projection, which is described briefly here. The app is able to detect if the uploaded dataset contains pre-computed UMAP coordinates for this aspect of the visualization. Creating pre-computed UMAP coordinates is highly recommended, especially for large datasets, in order to speed up the analysis. An example of creating UMAP coordinates (using default settings) using the scater
package is shown below:
library(scater)
sce <- runUMAP(sce)
UMAP computation parameters
There are several different input paramters for computing UMAP coordinates that can be changed depending on the desired outputs. For example, the number of neighbours and overall resolution can be changed by the user eo emphasize local or global data structure, and subsequently alter the number of clusters generated. The user should be sure to visit a resource like this one, which will give a brief overview of the effects of parameter modification for UMAP.
By default, the dimension assay will be stored with the name "UMAP". cytomarker is able to auto-detect any of these assays so long as some portion of the assay name contains a case-insensitive version of UMAP (such as umap, uMAP, etc.)
Dataset metadata/colData
in SingleCellExperiment objects, the metadata (useful tabular information about the cells in the dataset) is often contained in the colData
column. From these columns, the user may select a category of interest on which to evaluate the cells.
caution
Users should avoid using a column named keep_for_analysis
in the colData
columns. This identifier is used internally by cytomarker in order to create subsampling and perform filtering on each run. If this column exists in the colData
prior to running, it must be renamed.