The following were a set of recommendations that arose from a workshop held at NIH in August 2010 as part of the Link Animal Models to Human Disease (LAMHDI), a database of animal models that allows cross-species searching. It is still relevant, although it probably needs updating, e.g., full URI's not just identifiers. Nevertheless, I believe it is still useful to consider the struggles that curators (and by extension, computers) have in identifying key entities within a scientific paper.
Publishing in the 21st century: Minimal (really) data standards
Although researchers write papers for other researchers, the primary consumers of data and information these days are not humans but computers. Computers find data, display and analyze them — provided that the data are structured in a way that allows these functions to happen. The form of the scientific paper has been honed over many hundreds of years for humans to use. But scientific papers are difficult for a computer to understand. The beauty and frustration of human language is that the same word or phrase can mean many things and the same thing can be described many different ways. While poetry is enriched by this mystery, scientific literature is hamstrung by it. Sometimes even expert scientific curators have a difficult time extracting accurate information from an article when the information is incomplete, ambiguous, or (in more cases than scientists perhaps would care to admit) missing altogether. It is no surprise that computers have a problem.
We are on the cusp of an evolution in scientific publishing, and that evolution may involve new ways of reporting information, e.g., the structured digital abstract, the use of ontologies [1],[2]. However, during this evolution, while tools and strategies are being developed and tested, those of us who are charged with providing access to information in the scientific literature — database curators — have identified three simple practices that could make extracting relevant information from the literature more efficient and thorough. We recommend that scientific journals require that authors do the following in order to meet the publishing needs of the 21st Century:
- Provide gene accession numbers for all genes referenced in the methods section of a paper, per http://www.ncbi.nlm.nih.gov/gene
- Identify the species for the subject of a study, and from which each gene product is derived, from the NCBI taxonomy and the strains from the model organism databases for mice, rats, worms, zebrafish and drosophila, employing any existing unique identifiers and correct species-specific nomenclature[3]:
- Provide catalog numbers and vendor information for all reagents and animals described in the methods section of a paper.
These requirements are minimal, really; authors will not find them onerous, reviewers can easily enforce them, and publishers will not find them difficult to implement.
Our impetus was a recent invitation-only meeting at NIH** on the initiative to Link Animal Models to Human Disease (LAMHDI), a database of animal models that allows cross-species searching. We explored best practices that had been covered by others (e.g., http://zfin.org/zf_info/author_guidelines.html; http://wiki.geneontology.org/index.php/Letter_to_Editors).
Our goal is to have authors use a permanent unique identifier, a “Social Security number”, if you will, for specific concepts in a paper. Even if a concept has many different names, all those names map to the same identifier. Similarly, different concepts cannot share the same identifier. Thus, this identifier serves to disambiguate shared names, and link the same concepts across different papers and databases.
**Editor's note: Workshop was held in August of 2010
Below we give details on our three recommendations.
Recommendation 1 — Gene accession numbers:
NCBI’s Entrez Gene is one of the resources that uses permanent unique identifiers for gene symbols and names. Data in Entrez Gene result from a mixture of curation and automated analyses. Annotation of sequences is integrated with information from collaborating model organism databases, literature review and public users (with curation by RefSeq staff as required). Once a gene is assigned a unique accession number, it can never be re-used to identify another gene, even if the gene assignment is later determined to be in error. Using this accession number in papers alongside the gene name and abbreviation lets both computers and humans identify the gene unambiguously. Because the accession number is used universally, it also links that gene to the multiplicity of databases and knowledge sources that contain the same identifier. Even if a gene has many different names or if many different genes or other types of entities are called by the same name, curators and automated agents can easily identify them by their unique accession number[4].
Recommendation 2 — Organism identification:
NCBI provides a taxonomy ID for each major species. Within a species, many unique strains and stocks may be developed, each carrying many sequence variants, mutations, and genetically modified genes. Each of the major model organism databases (MGI, ZFIN, Wormbase, Flybase, RGD) provides strain IDs for these genetically unique strains and stocks. If authors use one of these IDs in their papers, a search agent (or a human) can confidently relate the research results to the right organism. These registries also offer standardized, detailed characterizations of organisms’ genotypic information, so authors do not have to provide it. Finally, the registries help researchers obtain identifiers for new organisms, ensuring accessibility and comparability with all other records.
Recommendation 3 — Reagent identification:
Experimental results in bioscience are fundamentally reliant on reagents. For gene expression studies, antibodies are a critical tool for providing detailed spatial information on the localization of gene products. Antibodies are generally made to specific sequences that may be further modified through phosphorylation or some other event. As any experienced anatomist or cell biologist knows, different antibodies to the same protein can give very different results, even within the same laboratory. Identifying the reagents used is thus critical, not only for data mining, but also for experimental troubleshooting. Yet many papers bury or leave out this information. Curators may have to track reagents through several papers when they find sections like “…we used the reagents described in study [X]…”. Worse, sometimes the information provided may not be sufficient to identify a specific reagent. For example, a paper may say that the researchers used a mouse monoclonal antibody against GAD from Sigma – and Sigma has multiple mouse monoclonal antibodies against GAD.
Ideally, each reagent should have a unique identifier; organizations such as NIF are working on tools to provide that (http://antibodyregistry.org). In the meantime, authors should identify reagents by both the vendor and the catalog number. We recognize that catalog numbers are not foolproof: different vendors may sell the same reagent under different catalog numbers and catalog numbers may be re-used. (Indeed, some manufacturers of gene chips use identifiers over and over again for different probe sets.) However, the vendor/catalog number identifier will go a long way in providing more accurate information. Perhaps as important, providing that information is not a big burden on authors, and may be easier than the current practice at some journals to require the location of a vendor. Some journals, e.g., the Journal of Comparative Neurology, are already requiring a more complete accounting of antibodies used.
Technology is changing rapidly, and automated search and analysis agents are getting better at extracting meaning from unstructured information. But they are far from perfect. The simple steps outlined here to better identify the genes under investigation and the reagents used will accelerate the development and effectiveness of the algorithms and maximize the utility of hard earned published results. We can then turn our curators to more important work, extracting meaning and mining knowledge for new insights. Our curators can undertake the subtle distillation of meaning, rather than spend time emailing authors for basic information. The three key recommendations described here are neither onerous nor controversial because they utilize existing technologies and informatics resources. Thus far, isolated efforts by individual communities have not been successful. We believe that it is time for those charged with providing access to the literature, NCBI and the journals, to take a firm stand on adapting scientific publishing for the 21st century.
We urge you to adopt our recommendations.
Contributors to this letter: Maryann Martone, UCSD (Neuroscience Information Framework; (http://neuinfo.org), Janan Eppig, Jackson Laboratories, Judith Turner (Turner Consulting Group; LAMHDI, http://lamhdi.org), Mark Ellisman, UCSD, Anita Bandrowski, UCSD, Monte Westerfield, University of Oregon, Brian Canada, PSU, Keith Cheng, Penn State College of Medicine, Kara Dolinski, Princeton, Mike Tyers, Princeton, Dave Anderson, Washington University
Appendix:
Examples*:
Organism: “We compared the horizontal optokinetic reaction (OKR) and response properties of retinal slip neurons in the nucleus of the optic tract and dorsal terminal nucleus (NOT-DTN) of albino and wild-type ferrets (Mustela putorius furo; NCBI Taxonomy ID: 9669).” (Hoffman et al. J Neurosci. 2004 Apr 21;24(16):4061-9).
Strain: “Wild-type zebrafish strains AB (ZFIN ID: ZDB-GENO-960809-7), … and Tübingen (ZFIN ID: ZDB-GENO-990623-3) were kept and bred as described [50] . “ (Yang et al., Genome Biol. 2007;8(10):R227. )
Antibodies:”… immunolabeling of the GABAAR _1, _2, _3, and _5 subunit, in each respective mice (Kralic et al., 2006). The monoclonal mouse antibody bd-17 (US Biological, bovine, cat # G1016; 1:400) directed against both _2 and _3 subunits of GABAARs, recognize the major GABAAR …” (Sadlauod et al., Journal of Neuroscience, March 3, 2010 • 30(9):3358 –3369 )
* These sentences were extracted from the referenced articles. However, the identifiers for organism and strain were inserted by the authors of this letter for demonstration purposes; they were not supplied by the original author. The antibody information was, however, included in the paper and is a good example of the recommended best practice.
[1] Ceol A et al., FEBS Lett. 2008 Apr 9;582(8):1171-7. Epub 2008 Mar 6.
[3] Links to the nomenclature standards for each database are provided at the end of this letter
[4] For more information on the scope of the gene name problem, see Hirschman et al. (2010) Molecular Genetics and Genomics 283, Number 5, 415-425.
Nomenclature standards for each of the model organism databases
Mice: http://www.informatics.jax.org/mgihome/nomen/index.shtml
Rats: http://rgd.mcw.edu/nomen_rules.html
Flies: http://flybase.org/static_pages/docs/nomenclature/nomenclature3.html
Zebrafish: http://zfin.org/zf_info/nomen.html
Worms: http://www.wormbase.org/about/userguide/nomenclature
1 thought on “Publishing in the 21st century: Minimal (really) data standards”
Addendum
I would also add that we should publish the complete set of materials in methods within each paper, rather than reference a previous paper (and another, and another)