By: Nicole Vasilevsky, Matthew H. Brush, Holly Paddock, Laura Ponting, Shreejoy J. Tripathy, Gregory M. LaRocca, Anita de Waard, Maryann Martone, and Anita Bandrowski, Yvonne Bradford, Ceri Van Slyke, Pascale Gaudet, Susanna-assunta Sansone, Melissa A. Haendel
PART I: Summary Reporting Guideline Checklist
Introduction
The purpose of the guidelines below is to determine what information is required to uniquely identify a resource, either using "extrinsic" identifiers or providing sufficient metadata about "intrinsic" attributes to identify sufficiently characterize a resource. The next step, beyond the scope of this document, is to provide more specific recommendations on how to document this information. For example, what are the authoritative sources of identifiers to referencing specific reagent or type of biological sequence, and in what format or syntax should complex sequence related information be described.
Below we present a summarized list of recommendations for referencing sequence molecules in general, and then five specific research resource types (antibodies, organisms, cell lines, constructs, gene knockdown reagents). Each recommendation is discussed in more detail in Part II of this document.
Recommendation 1 — Sequence Molecule Identification
Sequence identification is a central aspect of "intrinsic" identifiability for many resource types. General recommendations here require provision of at least one of the following:
- Directly reporting the full sequence of the molecule
- Reference to a resource from which the full sequence can be determined (e.g. a Entrez gene ID, or Genbank sequence accession number, dbSNP accession number)
- Recourse to approximate a sequence or generate a sequence molecule experimentally through unambiguous reference to related entities (e.g. a primer set and biological source of the template for a PCR reaction used to generate a DNA construct insert).
Example: dbSNP:rs12345
Recommendation 2 — Reporting of Antibodies
Provide at least one of the following:
- An identifier resolving to a stable public identifier, from: the Antibody Registry ID (www.antibodyregistry.org)
- For antibodies lacking identifiers add the following information to the antibodyregistry.org/add (Note: An antibody registry ID can be obtained this way for such antibodies), provide sufficient protocol details on production of the antibody so as to allow reproduction. Minimally:
- the host organism (NCBI taxonomy ID and/or strain ID from a MOD or vendor)
- the identity of the immunogen used (as per sequence molecule identification recommendations above)
Example: Thermo Fisher Scientific Cat# 13-1900, Lot#123, RRID:AB_2533005
Recommendation 3 — Reporting of Model Organisms
- For referencing species, an NCBI taxonomy name and identifier should be used. For example, [NCBItax:63221]. This can be omitted when more specific reference to a strain is known (see below).
- For specific 'wild-type' strains:
- an official name or identifier from an authoritative source [1], AND
- a source where the organism was obtained (Stock center and full catalog number).
- For genetically modified strains, report the RRID, consisting of the genotype identifier or the stock center identifier or all known genotype information:
- reference an identifier in a Model Organism Database (MOD) or other authoritative source, OR
- report genotype information directly, including genetic background, breeding information, and known sequence, genomic location, and zygosity of alterations.
Example: AGSC Cat# 100A, RRID:AGSC_100A
Recommendation 4 — Reporting of Cell Lines
- For standard publically available lines (*ammended from original instructions because Cellosaurus has joined the RRID initiative):
- an official identifier from the Cellosaurus database https://web.expasy.org/cellosaurus/, AND
- a source from which it was obtained (Vendor, Catalog Number)
- Lab-generated cell lines, identifiers should be obtained from Cellosaurus and authors should provide information regarding ALL of the following:
- the organismal source (including known genotype information where applicable),
- the anatomical source
- a developmental stage of origin
- important procedures applied in establishing the line
Example: ATCC Cat# CCL-20.2, RRID:CVCL_2260
Recommendation 5 — Constructs
- Reference to a repository or vendor can be provided when applicable (e.g. Addgene: http://www.addgene.org/)
- In the majority of cases where a construct is not publicly documented:
- construct backbone sequence should be reported directly or resolvable through reference to a vendor or repository identifier, AND
- construct insert sequences should be uniquely identified according to sequence identification criteria in Recommendation 1 above.
Example: RRID:Addgene_74745
Recommendation 6 — Reporting of Knockdown Reagents
As these are typically short oligos, complete sequences should be reported according to the sequence criteria outlined above. However, reference to catalog numbers or other databases that record oligo sequences is also sufficient.
Example: Lipofectamine Transfection Reagent Invitrogen Cat# 18324020
Recommendation 7 — Reporting of Software Tools and Databases
Software tools that are non-commercial can be reported by using their Name, URL, and Identifier from a crossmapped repository. For a list of repositories, please see [2] below
Example: ImageJ, Version 1.2, RRID:SCR_003070
PART II: Expanded Discussion of ‘Identifiability’ Requirements
As a brief prelude to our recommendations, I want to consider two important facets tied into the notion of "identifiability" as we have characterized it. The first relates to an ability to find a unique public record of the resource, such as a vendor catalog or database entry. This information will typically point to a way to obtain the resource, by purchasing from a vendor or contacting authors of the reporting publication. Our ability to provide this type of metadata requires establishing authoritative sources for minting and managing identifiers, and a means for authors and curators to discover them. We might call this first facet "extrinsic" identifiability, as it results from simply tagging a resource with some external identifier that is supported by a system enabling its discovery and use. At this level, no actual knowledge about the resource itself is required besides this artificial identifier, which is insufficient for supporting its actual application in a study and the analysis the resulting data, if key biological and experimental attributes about the resource are not resolvable from this identifier in some way.
A second facet of "identifiability" concerns an actual characterization of the core biological and experimental attributes that define a resource. This is particularly relevant for resources that are not tied to “extrinsic” identifiers (e.g. things like constructs or immunogens, when there is no registry or effort for assigning and managing such identifiers). Determining what information is critical in this regard, and providing access to such useful metadata, requires a deeper understanding of each resource type, and consideration of each separately in it light of its biological attributes and experimental context of use. This second facet of identifiability we might call "intrinsic" identifiability, as it relates to the inherent qualities of the resource that must be known to create or effectively apply it.
The initial LAMHDI guidelines on the FORCE11 site begin to address the first facet of extrinsic identifiability by suggesting authoritative sources for identifiers of genes (NCBI gene IDs), organisms/strains (NCBI Taxonomy/MOD strain/genotype IDs), and reagents (vendor catalog numbers, Ab Registry IDs). Here, it is noted that additional efforts are needed to establish authorities for reagent types where none exist, and ensure that existing authorities adhere to common standards for assigning and managing identifiers. What is missing in these guidelines is a characterization of the key inherent attributes of these resource types that are required for identifying them independent of an assigned identifier, and for guiding their effective application. While it is assumed that such metadata will be provided by the ‘owner’ of an identifier (and resolvable through the identifier to a catalog or database entry), this is not always the case. And more importantly, many resources exist that lack extrinsic identifiers altogether, such as lab-sourced reagents or organisms. Here, defining what types of descriptive metadata should be provided for identifiability is crucial. Below we consider several resource types in this light, and provide recommendations concerning key attributes that are required for intrinsic identifiability independent of an identifier. But to begin, we discuss the issue of sequence identification, as this is tied to the identifiability of several specific resource types.
Recommendation 1 — Sequence Molecule Identification
Sequence identification is a central aspect of "intrinsic" identifiability for many resource types. Examples include specifying the sequence of an immunogenic peptide for a lab-sourced antibody, the sequence of a DNA insert in a construct, or the sequence of a transgene incorporated into the genome of an organism or cell line. In such cases, a meaningful understanding of the resource requires reporting the sequence of some biological molecule (to the extent that it is known). This can be achieved in many ways:
- by directly providing the full sequence
- by referencing a resource from which the full sequence can be determined, e.g. by providing a gene ID or accession number
- by providing recourse to estimate a sequence or generate a sequence molecule experimentally, through unambiguous reference to related entities (for example, a primer set and biological source of the template for a PCR reaction used to generate a DNA construct insert).
Note that the resolution to a fully specified sequence is not always required or possible, as it is sometimes the case that this information is not known. Note also that care should be taken when using gene identifiers to reference resources at the cDNA or protein level. This can be problematic, as specification of a gene sequence may not be sufficient to resolve a single cDNA or peptide sequence. This is because a single gene may resolve to many different transcripts or peptides (e.g. through alternative splicing), which can prevent unambiguous resolution of a gene sequence to a cDNA or peptide sequence. Here, we will need to decide on a strategy to provide unique reference to sequences at the gene, mRNA/cDNA, and protein level.
Below we will discuss where different resource types may require levels of specificity in defining a sequence. In any case, unique reference to a gene should be provided using an Entrez Gene ID. In some cases, more precise reporting of specific parts of this gene or specific variants/alleles of a gene are required. Here, the mechanism for providing unique reference using existing identifiers (such as those in GenBank’s plethora of resources) is more complicated, and beyond the scope of this document. This includes issues such as how to specify that a specific splice variant is incorporated into a resource, or specific subregion of a complete protein is used in generating an antibody, which are points for future consideration.
Good examples:
“Human IL-12 (Accession # P29459)”
“Human IL-12 residues 1-32 (ATGTGGCCCCCTGGGTCAGCCTCCCAGCCACC)”
“Human IL-10 (Gene ID: 3586; Primers: Foward: GGACTGATCGTATATATTC, Reverse: TTAAAAAAGTTGATTTCCT)”
“Trp53 (MGI:98834)”
Bad example: “IL-12”
Recommendation 2 — Reporting of Antibodies
Unique antibody identification can be achieved in two ways:
- Provision of an identifier resolving to a universal registry/database identifier from the Antibody Registry (www.antibodyregistry.org). Note that if an antibody is not found in the Antibody Registry, it should be added by the user, as this registry aims to be comprehensive single source for antibodies, including lab-generated antibodies. Note, also that our recommendation does not require provision of a lot or batch number, although issues with reproducibility of across rounds of production could be used to justify such a requirement.
- For antibodies lacking identifiers that are not publicly available, provision of sufficient protocol details on production of the antibody so as to allow reproduction. This minimally includes specifying the host organism and identity of the immunogen used. For peptide immunogens, the criteria for sequence identification in Recommendation 1 above apply, i.e. that the sequence is reported in full, or that a unique identifier resolving to single gene product is provided (including start/send coordinates against this reference sequence when only a portion of a complete protein is used).
Good examples:
Commercial Ab: (Millipore Cat# AB1542 RRID:AB_90755)
Non-commercial Ab: “OCAM antibody (K. Mori, RIKEN Cat# OCAM, RRID:AB_2314995)”
Bad example: “anti-Wnt3 antibody (Santa Cruz Biotechnology, Santa Cruz, CA)”
Recommendation 3 — Reporting of Model Organisms
- For all organisms an NCBI taxonomy name and identifier should be used to indicate species (e.g. "Mus musculus, NCBI Taxonomy ID:10090"). This can be omitted when more specific reference to a strain is provided (see below)
- For 'wild-type' strains such as 'C57BL/6' mice or 'AB' zebrafish), an unambiguous name or identifier from an authoritative source [1] should be provided as well as a source vendor, repository, or lab.
- For genetically modified strains, reference to an identifier in a MOD or other authoritative source is sufficient. In the absence of this, all known genotype information should be reported. This includes genetic background, breeding information relevant to propagating the variation, and precise alterations identified in or introduced into the genome (including the known sequence, genomic location, and zygosity of alterations). For random transgene insertions, it is not required that genomic location of insertion(s) is known, but precise sequence of inserted sequence should be unambiguously resolvable according to sequence identification criteria above. For targeted alterations, reporting the sequence and location of an alteration is required to the degree that it is known, according to sequence identification criteria above. Note that this information can be provided directly in a publication, or through reference to an external source such as a MOD record or catalog offering where such information is provided.
Good example:
“B6.129(Cg)-Kcnn2tm1.1Jpad/J (RRID:IMSR_JAX:009592) (Background: C57BL/6 , SK2delta targeted knockout (Gene ID: 140492)), obtained from Jackson Labs)”
Bad example: “SK2delta knockout mouse”
Recommendation 4 — Reporting of Cell Lines
For standard, publically available lines, an unambiguous name or identifier from an authoritative source such as the Cell Line Ontology (http://www.clo-ontology.org/), ATCC (http://www.atcc.org/) or Coriell (http://ccr.coriell.org/) should be provided as well as a source vendor, repository, or lab. For novel lab-generated cell lines, authors should provide information regarding (1) the organismal source (according to the criteria in Recommendation 3 above, and including of genotype information where applicable); (2) the anatomical source; (3) a developmental stage of origin; and (4) any unique or critical procedures applied to establish a stable lineage of cells. Additionally, some indication of passage number is recommended but not strictly required. For lines genetically modified in vitro, criteria are analogous to those for genetically modified organisms, including known information about sequence alterations, genomic location, and zygosity/copy number.
Good example: “MCF 10A cell line (IZSLER Cat# BS CL 174, RRID:CVCL_0598)”
Bad example: “MCF cells”
Recommendation 5 — Constructs
No authority exists for providing identifiers for specific constructs, however, databases such as Addgene and PlasmID do contain information about specific constructs. If a construct is not identifiable in an existing database, it is best practice to reference identifiers about the construct backbone and gene/genetic sequence for the insert. Construct backbone should be unambiguously identified and resolvable to a complete vector sequence (typically through a vendor or repository). The sequence of construct inserts should be identifiable according to sequence identification criteria above. Most expression constructs incorporate cDNA – so it is particularly important that the exons included in this insert are resolvable when more than one splice variant exists for a gene transcript. This means that specifying the name of a gene or a protein expressed may not be sufficient if this does not allow for unambiguous resolution to a cDNA sequence. It is ideal to include precise description of MCS restriction sites used for cloning. Relative location and sequence of epitope tags and regulatory sequences (promoters, enhancers, etc) should be specified (e.g. 'N-terminal dual FLAG tag' is sufficient).
Good example:
“pCruzHA-SIRT1 (Backbone: pCruzHA, Santa Cruz Biotechnology, Cat # sc-5045; Insert: SIRT1 (Gene ID: 93759); Insert cloned into BamHI and HindIII restriction sites)”
Bad example: “pCruz-SIRT1”
Recommendation 6 — Reporting of Knockdown Reagents
Specific and complete sequences should be reported according to the sequence criteria outlined above. This will typically be direct reporting of the sequence, as these are generally short oligos. However, reference to catalog numbers or other databases that record oligo sequences is also sufficient.
Good examples:
“Morpholino Moarx targeting the intron 2 – exon 2 junction (5’-GCGTCATATTTACCTGGTGAACACA)”
“MO1-gata1a morpholino (ZDB-MRPHLNO-050208-10)”
Bad example: “Moarx Morpholino”
Recommendation 7 — Reporting of Software Tools
Specific information about each software tool, including it's identifier, date accessed and / or platform is requested. For all non-commercial tools, the tool should be registered with an authority [see 2 below] and whenever possible the source code should be deposited in a repository.
Good examples:
“generated using ImageJ 1.46r software RRID:SCR_003070”
“Graphpad Prism Version 5.03, RRID:SCR_002798”
Bad example: “Prism (GraphPad Software)”
[1] Sources for unique identifiers for model organisms:
Mammalian Models
Mouse: http://www.informatics.jax.org/or http://www.findmice.org/
Rat: http://rgd.mcw.edu/
Pig: http://www.nsrrc.missouri.edu/StrainAvail.asp
Non-Mammalian Models
S. cerevisiae (budding yeast): http://www.yeastgenome.org/
S.pombe (Fission Yeast): http://www.nih.gov/science/models/Schizosaccharomyces/index.html
Neurospora (filamentous fungus): http://www.nih.gov/science/models/neurospora/
D. discoideum (social amoebae): http://dictybase.org/
C. elegans (round worm): http://www.wormbase.org/
Tetrahymena https://tetrahymena.vet.cornell.edu/index.php
Daphnia: http://www.nih.gov/science/models/daphnia/
Fly D. melanogaster (fruit fly): http://flybase.org/
Zebrafish D. rerio (zebrafish): http://zfin.org/
Xenopus (frog): http://www.xenbase.org/common/
Gallus gallus (chicken): http://128.175.126.109/cgi-bin/gbrowse/gallus/
Other Model Organisms
Arabidopsis: http://www.arabidopsis.org/
[1] Sources for unique identifiers for software tools and databases:
SciCrunch Registry: http://scicrunch.org/resource
BioSharing databases: https://biosharing.org/databases/