Overview
Specific details for how to apply nomenclature are laid out in the protocol below and are included on GitHub (https://github.com/AllenInstitute/nomenclature) along with associated scripts. The general steps are as follows. First, a unique taxonomy_id is chosen, which will be used as a prefix for all the cell set accession IDs. To ensure uniqueness across all taxonomies, `taxonomy_ids` are tracked in a public-facing database, with future plans to transfer these to a more permanent solution that will also provide storage for accompanying taxonomy files. Second, a dendrogram is read in and used as the starting point for defining cell sets by including both provisional cell types (terminal leaf nodes) and groups of cell types with similar expression patterns (internal nodes). Third, the main script assigns accession ids and labels for each cell set and outputs an intermediate table. Fourth, the user manually annotates these cell sets to include common usage terms (aligned aliases) and any additional aliases and can also manually add additional cell sets which correspond to any combination of cell types in the taxonomy or to associated metadata. Fifth, dendrograms are optionally updated to include the new nomenclature information from this nomenclature table. Sixth, cells are assigned nomenclature tags corresponding to their cell set assignments (if any). Finally, the code produces a set of standardized files for visualization of updated taxonomic structure and for inclusion in manuscripts and input into a future database for cross-taxonomy comparison. We note that this protocol is current as of the publication of Common cell type nomenclature for the mammalian brain, but that we expect updates in the schema to be made to the GitHub repository as we extend the CCN to encompass other use cases.
Prerequisites
The critical prerequisite is a completed cell type characterization analysis with associated annotation. The CCN assumes this and the files below are a way of storing the resulting information in a specific format for input to the scripts. Scripts are then run in R.
1. Install R and prepare the environment and directory structure.
- Install R on your computer
- Install these libraries: `dplyr`, `dendextend`, `data.table`, and `ggplot2` in R.
- Download required_scripts.R to your working directory and put in a `scripts` subfolder.
- Create a `data` subfolder and put the dendrogram and metadata files to annotated therein (see next section for details).
- (Optional) Download `dend.RData`, `nomenclature_table.csv`, and `cell_metadata.csv` as an example taxonomy for annotation (see our Allen Institute Transcriptomics Explorer) and put in a `data` subfolder.
- (Optional) Install RStudio on your computer.
- (Optional) Install the `jsonlite` library if you want to save the final dendrogram in json format.
2. Generate the required input files.
- dend.RData (optional but recommended): this is an R dendrogram object representing the taxonomy to be annotated. If you used R for cell typing in your manuscript, this is likely a variable that was created at some point during this process and from which your dendrogram images are made. While this assumes a hierarchical structure of the data, additional cell sets of any kind can be made later in the script. Code for converting from other formats to the R dendrogram format is not provided, but please post to the Community Forum if you have questions about this.
- cell_metadata.csv: a table which includes a unique identifier for each cell (in this example it is stored in the `sample_name` column) as well as the corresponding cell type from the dendrogram for each cell (in this example it is stored in the `cluster_label` column). Additional metadata of any kind can be optionally included in this table.
We provide files for the taxonomy from (Hodge et al 2019) as an example. - nomenclature_table.csv: a table that includes names along with optional manual annotations of cell sets (e.g., aliases), which typically would be completed during taxonomy generation. Note that a file in the required format will be generated as an intermediate step of this protocol if the scripts are run from start to finish—in this case, only the optional annotations are needed in any format to be added later.
Generation of the initial nomenclature table
This section of the protocol converts the initial R dendrogram into an appropriate nomenclature format used for the remainder of the scripts. The scripts in this section are optional, as the nomenclature table could be generated manually, but is strongly recommended. In either case, the steps defining taxonomy and cell set variables are required.
3. Set up the R workspace and load required libraries and scripts (see GitHub).
4. Define taxonomy and cell set variables.
- taxonomy_id is the name of the taxonomy in the format CCN[YYYMMDD][#], where YYYYMMDD represents an 8 digit date format (Y=year, M=month, D=day) and # is an index for compiling multiple taxonomies on a single day.
- taxonomy_author is the name of a point person for this taxonomy (e.g., be the person who built the taxonomy, the person who uploaded the data, or the first or corresponding author on a relevant manuscript).
- taxonomy_citation is a citation or permanent data identifier corresponding to the taxonomy (or "" if there is no associated citation). Ideally the DOI for the publication will be used, or alternatively some other permanent link. Additional / updated authors and citations can be provided for specific cell sets at a later step.
- cell_set_label has been deprecated but a suitable prefix is still required for the code to run properly and it is a useful way to track provisional cell types. We recommend choosing a relevant delimiter (in this case MTG).
- structure is the location in the brain (or body) from where the data in the taxonomy was collected. Ideally this will be linked to a standard ontology.
- ontology_tag is the standard ontology term associated with the anatomic structure. In this case, we choose “middle temporal gyrus” from “UBERON” since UBERON is specifically designed to be a species-agnostic ontology and we are interested in building cross-species brain references. Additional / updated `structures` and `ontology_tags` can be defined separately for each cell set at a later step.
5. Read the dendrogram into R.
6. Build the nomenclature table for this taxonomy by providing the dendrogram and defined variables to the function “build_nomenclature_table”. The output of this script is list with three components:
- cell_set_information, a data.frame (table) of taxonomy information (see below)
- initial_dendrogram, the initially inputted dendrogram except that all nodes are labeled with short labels at this point (n1, n2, n3, etc.) to aid in manual annotation
- updated_dendrogram, a dendrogram updated to include everything in `cell_set_information`.
7. Output the initial nomenclature table to a csv file and (optionally) the updated dendrogram to a pdf.
Manual annotation of cell sets
This step is where you can manually annotate the taxonomy (e.g., add or change aliases and structures, or to add additional cell sets entirely) by updating the `nomenclature_table.csv` file. This is also a useful step for versioning—if additional information about cell sets are collected (e.g., adding aliases) the result of this manual annotation can be used as a starting point for future additional annotation. It is important to note that completion of some strategy to match cell sets with prior knowledge (e.g., computational alignment) should occur prior to application of the CCN will aid in reliably assigning alias tags. Data set alignment is not performed as part of the CCN.
8. Open `nomenclature_table.csv` in a text editor (e.g., Excel, vi, Notepad++). If the “generation of the initial nomenclature table” section was run, then nomenclature_table.csv should exist in your working directory; otherwise, it will have to be created from scratch.
9. Confirm that nomenclature_table.csv is correctly formatted. If the “generation of the initial nomenclature table” section was run, the following columns will be included. Most of these columns (indicated by a ^) are components of the CCN.
- cell_set_accession^: The unique identifier (cell set accession id) assigned for each cell set of the format CS[YYYMMDD][T]_#, where CS stands for cell set, [YYYMMDD][T] matches the taxonomy_id (see above), and the # is a unique number starting from 1 for each cell set.
- original_label: The original cell type label in the dendrogram. This is used for QC only but is not part of the CCN.
- cell_set_label: A label of the format [prefix] #, where the prefix is assigned above and the # corresponds to a single number or range of numbers (if the cell set includes multiple provisional cell types). This used to be part of the CCN and is now deprecated but is useful and is needed for coding purposes.
- cell_set_preferred_alias^: This is the label that will be shown in the dendrogram and should represent what you would want the cell set to be called in a manuscript or product. If the CCN is applied to a published work, this tag would precisely match what is included in the paper.
- cell_set_aligned_alias^: This is a special tag designed to match cell types across different taxonomies, as described below. As output from `build_nomenclature_table`, this will be blank for all cell sets.
- cell_set_additional_aliases^: Any additional aliases desired for a given cell set, separated by a “|”. For example, this allows inclusion of multiple historical names. As output from `build_nomenclature_table`, this will be blank for all cell sets.
- cell_set_structure^: The `structure`, as described above. Can be modified for specific cell sets below. Multiple `cell_set_structures` can be given separated by a “|”.
- cell_set_ontology_tag^: The `ontology_tag`, as described above. Can be modified for specific cell sets below. Multiple `cell_set_ontology_tags` can be given separated by a “|” and must match cell_set_structure above.
- cell_set_alias_assignee^: By default, the `taxonomy_author`, as described above. In this case, if aliases are assigned by different people, additional assignees can be added by separating using a “|”. The format is [preferred_alias_assignee]|[aligned_alias_assignee]| [additional_alias_assignee(s)]. If aliases are added without adding additional assignees it is assumed that the assignee is the same for all aliases.
- cell_set_alias_citation^: By default, the `taxonomy_citation`, as described above (or can be left blank). In this case, if preferred (or other) aliases are assigned based on a different citation, additional citations can be added by separating using a “|”, with the same rules as described by `cell_set_alias_assignee`. Ideally the DOI for the publication will be used (or another permanent link).
- taxonomy_id^: The `taxonomy_id`, as described above. This should not be changed.
10. Manually annotate cell sets. All of these annotation steps are optional for each cell set, but likely some annotations will be needed. The original_label column can be used to identify cell sets at the stage for updates by matching with the node (or leaf) label shown in the plotted dendrogram. Do not update the `cell_set_accession`, `cell_set_label`, or `taxonomy_id` columns of existing cell sets (although these will need to be provided for added cell sets).
- Add `cell_set_aligned_alias` terms. This slot is designed to match cell types across multiple taxonomies. Ideally these terms will be selected from a semi-controlled vocabulary of terms that are agreed-upon in the relevant cell typing community (e.g., in a respected ontology). For mammalian neocortex we propose a specific format for such aligned aliases:
- Glutamatergic neuron: [Layer] [Projection] # (e.g., L2/3 IT 4).
- GABAergic neuron: [Canonical gene(s)] # (e.g., Pvalb 3).
- Non-neuron: [Cell class] # (e.g., Microglia 2).
- For any cell type a historical name could be substituted (e.g., Chandelier 1).
- Add other alias terms in the `cell_set_additional_aliases` slot.
- Update other meta-data terms (`cell_set_structure`, `cell_set_ontology_tag`, `cell_set_alias_assignee`,and `cell_set_alias_citation`) for existing cell sets, retaining the format laid out in the previous step.
- Add any additional cell sets. To do this, take the following steps.
- In a new row, define the `cell_set_accession` as the largest existing value plus one.
- Set the `taxonomy_id` to match the other cell sets.
- If the new cell set corresponds to combinations of cell types present in the tree, the `cell_set_label` must include the numeric components of the `cell_set_labels` for relevant cell types. For example, if you wanted to build a new cell set that includes “MTG 001”, “MTG 002”, and “MTG 005”, the cell_set_label would be set as “MTG 001-002, 005”. If the cell set is unrelated to cell types, it should be given a name distinct from what is shown in the tree (in our example, any name except “MTG #”).
- Any of the metadata and alias columns can be set as described above.
- Add any other columns to this table. Any added columns will be appended to the dendrogram object at a later step along with all the standard CCN information.
- Save the resulting table as a csv file. This file will be used as input to the next step.
- We recommend checking your work very carefully before proceeding.
Creation of standard nomenclature files
This section includes several steps for taking as input the manually annotated nomenclature table and associated mappings between cell to cell type (and potentially other metadata) and outputting a series of standard nomenclature files. These nomenclature files will ideally be zipped and included as supplemental materials in relevant manuscripts and can be used as input files for databases or web products.
11. Read into R the updated nomenclature file from the previous step.
12. Update the dendrogram and output the results in multiple formats (optional but recommended if you started with a dendrogram).
- Update the dendrogram using the script “update_dendrogram_with_nomenclature”. This code will add information from the table above to the initial dendrogram object. The new cell set alias names (if any) will show up to replace the n## labels from the initial plot. In addition, all of the meta-data read in from the table will be added to the relevant nodes or leafs.
- Plot a pdf of the updated dendrogram
- Save the dendrogram (which includes the complete set of information not visible in the pdf plots) in the R “dendrogram” format and “json” formats. We find both useful for different applications at the Allen Institute.
13. Read into R the meta-data associated with each cell and specify the variables that will allow linking to the updated nomenclature for each cell set. To do this, create a character vector of `cell_set_accession_ids` that corresponds to each cell used for generating the taxonomy. This variable is used as a starting point to assign all cells to all cell sets.
14. Define cell to cell set mappings based on cell type information and metadata. The result of this section is a data frame where the first column corresponds to the cell `sample_name`, which is a unique cell ID (across all data at the Allen Institute). The remaining columns correspond to the probabilities of each cell mapping to each cell set. In this case we define hard probabilities (0 = unassigned to cell set; 1 = assigned to cell set) but this could be adapted to reflect real probabilities calculated elsewhere.
- Automatically link each cell to each cell set that is available in the dendrogram using the “cell_assignment_from_dendrogram”function (optional).
- Automatically assign cells to cell sets that were defined as combinations of cell types, but that were NOT included in the above section using the function using “cell_assignment_from_groups_of_cell_types”. As written, this function requires assumes that the `cell_set_label` is assigned using the specific format described above.
- Add cell to cell set mappings based on any other metadata (optional; typically skipped). We present an example cell set including only cells collected from neurosurgical cases.
- Add additional columns to the mapping data frame (optional; typically skipped).
15. Output the cell to cell_set assignments as a csv file.
16. Save the output files (the nomenclature table, the dendrogram files, and the cell to cell set assignments) into a single zip file for direct inclusion in a relevant manuscript.
References
Hodge RD, Bakken TE, Miller JA, Smith KA, Barkan ER, Graybuck LT, Close JL, Long B, Johansen N, Penn O, Yao Z, Eggermont J, Hollt T, Levi BP, Shehata SI, Aevermann B, Beller A, Bertagnolli D, Brouner K, Casper T, et al. 2019. Conserved cell types with divergent features in human versus mouse cortex. Nature 573:61–68. DOI: 10.1038/s41586-019-1506-7, PMID: 31435019.
Miller JA, Gouwens NW, Tasic B, Collman F, van Velthoven CT, Bakken TE, Hawrylycz MJ, Zeng H, Lein ES, Bernard A. Common cell type nomenclature for the mammalian brain. eLife. 2020 Dec 29;9:e59928. DOI: 10.7554/eLife.59928, PMID: 33372656.