Curated datasets
cytomarker provides a number of pre-curated datasets from publicly available scientific publications to allow users to explore the functionality and outputs of the application without the need for his/her own experimental scRNA-seq data. This article provides a brief overview of the datasets that are currently available to cytomarker users, with information on their publication origin and any relevant pre-processing that has occurred prior to its inclusion in cytomarker.
Datasets were selected based on the following criteria: well defined and annotated cell types, clear separation and resolution of cell type clusters using UMAP dimension reductionality, and a modset number of cells overall (between 2000-4000 cells).
Tabula Sapiens
The publicly available datasets from Tabula Sapiens human cell atlas, made available through the Tabula Sapiens Consortium, have been processed to be compatible with cytomarker.
The following tissue types are available for analysis:
- Bladder
- Blood
- Bone Marrow
- Eye
- Fat
- Heart
- Kidney
- Large Intestine
- Liver
- Lung
- Lymph Node
- Mammary
- Muscle
- Pancreas
- Prostate
- Salivary Gland
- Skin
- Small Intestine
- Spleen
- Thymus
- Tongue
- Trachea
- Uterus
- Vasculature
Within each of the tissue types, one or more of the following tissue "compartments", which represent broader cell-type categories, may be present:
- Endothelial
- Epithelial
- Immune
- Stromal
cytomarker offers the user the ability to filter any of the tissue datasets above by compartment prior to analysis.
Processing of Tabula Sapiens datasets for cytomarker
The following general scheme was used for each of the data sets available through Tabula Sapiens. The general objective was to sub-sample each tissue set to between 2000-3000 cells while preserving balanced inclusion of different tissue sub-types as provided in the cell_ontology_class
metadata column. This was done in order to allow for analysis of rare sub-types and to avoid dominating the initial panel construction with sub-types that
- Each dataset was downloaded from this url
- The dataset was filtered to exclude genes with fewer than 10 total counts across all cells profiled
- A frequency distribution of cell types as annotated with
cell_ontology_class
was created - Any sub-types with a frequency of 20 or less were removed
- Any sub-types with a frequency between 20 and 100 were kept with no subsampling
- For sub-types with a frequency above 100, subsetting was done to n = 3000 / number of sub-types in the tissue
- The precomputed UMAP coordinates were taken from the
X_umap
metadata column
Human PBMC (Peripheral blood mononuclear cells)
This dataset is provided as the tutorial data for Seurat, one of the canonical analysis packages for scRNA-seq data. The introductory tutorial can be found here, while the raw data files are provided on the 10X Genomics website.
These data are processed according to the steps as outlined in the tutorial. Briefly, the raw expression matrix is filtered by the following parameters:
- each cell requires a minimum of 200 genes
- each gene requires expression in a minimum of 3 cells
After the initial filtration, standard QC procedures are applied to the data and further filtering thresholds are used prior to downstream cell annotation:
- Cells must have a minimum of 200 and a maximum of 2500 genes
- Cells must have a mitochondrial content of 5% or less
After filtration, the data are log-transformed and scaled with a standard linear transformation. Dimensionality reduction is conducted using PCA with variables gene features. UMAP dimension reduction is performed using a cluster resolution of 0.5 and the top 10 PCA components. Lastly, cluster annotation is performed with canonical immune cell markers as listed in this portion of the tutorial