Resource Identification and Tracking in the Neuroscience Literature
Excecutive Summary and Action Items
1) Perform pre-pilot project (2 months-Resource Identification Group: NIF, NITRC, INCF, Monarch, Cross-Ref, antibodies-online, eagle-i and other interested parties):
- Form the Resource Identification Group: The RIG will develop and evaluate the specific technologies and implementation. Ensuring that other groups who are working in this area are involved will be important for the success of the project.
- Make sure that the appropriate identifiers are available for all model organisms
- Establish a single website with an easy to use front end for obtaining identifiers
- Prepare instructions for authors
- Perform usability studies with naive users (~25)
- Present results to workshop consortium
- Get initial commitments from publishers for proposed pilot project: what journals, what resources
- Discuss potential implementation per journal
- Include a link to a demonstration site and the results of the usability study
- Allow flexibility in implementation
- Launch pilot project at SFN???
- Contact Biocreative to see if they are interested in hosting a text mining challenge
- Mozilla Foundation: Open Science and Science in the Web
- Society for Neuroscience?
- It will be easier to design and troubleshoot experiments, because researchers would be able to find all studies that used a particular reagent, animal or tool
- Researchers would be able to find studies that used a particular type of reagent, e.g., a mouse monoclonal antibody, even if that information was not explicitly included in the paper, because a complete characterization of the entity is present in an external database that can be accessed at time of query
- It will be easier to aggregate and compare results across studies, using both human effort and data mining approaches
- Problems found in a resource, e.g., specificity of an antibody, an error in a database or algorithm, can be easily propagated across the literature, even retrospectively. With proper tools, readers could be alerted to any potential problem thereby reducing time, effort and money wasted on problematic resources and incorrect conclusions based on the results of these studies .
Background: Summary of first meeting held at Society for Neuroscience, Oct 15th, 2012
- The author did not supply sufficient identifying information, e.g., catalog number, such that an antibody could be reliably found in a vendor catalog. Rather, general information , e.g., mouse monoclonal antibody against actin from Sigma, was provided. As many vendors sell multiple antibodies that fit these descriptions, we could not identify the reagent used.
- The vendor no longer sold the antibody referenced or the vendor no longer existed, so information could not be discovered about its properties. In many cases, the same manufacturer would sell their products through multiple vendors, with no ability to cross reference
- The same antibody identifier, e.g., clone ID, could point to multiple antibodies
- Methods were not referenced within the paper, but readers were referred to other papers, which then referred to other papers…
- Authors did not supply sufficient information to identify the exact transgenic animal used, e.g., stock number. As with antibodies, a given reference to a transgenic from Jackson Labs could not be resolved to a particular transgenic line, but could point to more than one.
- The notation adopted by the IMSR does not lend itself for use by automated agents or search systems. It employs superscripts, subscripts and special characters.
- NIF’s semi-automated pipeline did fairly well at recognizing research resources listed in the NIF catalog within papers, except for those resources with names that were very common or short, e.g., R, Enzyme.
- In trying to determine meaningful use of the resource, as opposed to mentions of the resource but no actual use within the study, NIF focused its search on the Materials and Methods section. However, NIF had access to the materials and methods of sections of only a subset of PubMed Central, that is the PMC Open Access Subset which is a relatively small part of the total collection of articles in PMC, and other open access journals.
- Who would do the identification?
- Author? Algorithm? Curator? Editor? How would the information be verified once supplied?
- Would a special tool be needed? If so, who will pay?
- Would it scale to 40,000 papers/month?
- Is the information available from authors in general?
- Difficult to implement only for neuroscience journals, as journals have many different journals in their portfolio
- Granularity: Would we be able to specify the requirements at a level of granularity that would be useful but still feasible?
- What will be the benefit to the user? How will we show that?
Summary of discussion at June 26th meeting
- The issue of proper resource identification is not unique to neuroscience. Nicole Vasilevsky and her colleagues from eagle-i (https://www.eagle-i.net/) performed a comprehensive study of resource reporting across a spectrum of journals and fields, tracking the reporting of antibodies, cell lines, model organisms, knockdown reagents and constructs. Although the results differed across fields and type of resource, the general conclusions reached by the neuroscience pilot held: most papers did not contain sufficient identifying information for either a human or automated algorithm to identify the resources used. The study is under review in PeerJ.
- Vasilevsky et al examined the reporting requirements of journals and found no correlation between proper identification and the stringency of reporting requirements
- Although the pilot project did not address availability of the information from the author systematically, in a case study of a single laboratory at Carnegie Mellon University, Anita de Waard and colleagues found that the identifying information for reagents and animals was kept in good order by the researcher, i.e., the appropriate identifying information was available, but this information by and large did not make it into published papers.
- Although only an N of 1, the finding affirms the contention that authors simply do not think to put this information in a paper. In contrast, the vendor location and city are routinely supplied, because this information is requested from many journals and mentors teach their students to supply it.
- Scalability: NIF and Elsevier worked on a text mining project to see if a machine-learning algorithm could be used to automate the process of resource identification. They focused on antibodies and tools registered within the NIF Antibody Registry and NIF Catalog (databases and software tools). Over 500 articles were hand annotated and then used for text mining. The algorithm was reasonably accurate at detecting antibodies and identifying them if the catalog numbers were provided (~87%), although the many different styles of reporting catalog numbers decreased the total number identified (~63%). Identification of tools was better, approaching 100%. The algorithms are still under development, but the results were encouraging in that:
- This project suggested that automated text mining would be helpful in verifying information supplied by the authors.
- This project also suggested that at some point, a “resource identification” step could be incorporated into the manuscript submission pipeline that would be able to assist authors in identifying their resources.
- Commercial antibody providers, specifically antibodies-online, that seeks to provide more transparency in the antibody market, are interested in helping to support such efforts. NIF has interacted with antibodies-online (http://www.antibodies-online.com/), who are experts in the antibody market and are wiling to help underwrite costs for the NIF Antibody Registry, an on-line database for assigning unique identifiers to antibody reagents.
Discussion of presentations
Scope: 3 types of entities should be identified as an initial pilot project: 1) Antibodies; 2) Tools; 3) Model organisms
For tools, the scope should be those that are registered within the NIF Registry and not all commercial tools or instruments used. The NIF Registry focuses on digital resources that are largely, although not exclusively, produced by the academic community. Note that the NIF Registry links with NITRC (Neuroimaging Tools and Resource Clearinghouse; http://nitrc.org), which has catalogued software tools and databases of for neuroimaging. For the purposes of this proposal, references to the NIF Registry will also include NITRC, as that is the authoritative source of neuroimaging tools.
Who: The issue of whether the author should be asked to supply this information or whether we should attempt to use semi-automated means to identify potential research resources and then go back to the author was discussed. One can envision a two step process where the authors are asked to supply the information and then the article is screened via NLP for verification. The need to ensure that the process was not overly onerous for the author was emphasized.
When: Should the process of resource identification be done at time of submission, during review or after acceptance? The general feeling was that during review or after acceptance would be the time when we would likely get the most compliance. If this process is done during review, then the reviewers would need to be alerted that they should look for this information and be able to communicate with the author that they need to supply this information. We do not want to make this an absolute requirement for publication, as we all recognize that the authors may not possess this information and we don’t want them supplying false information in order to have the article published. If it is after acceptance, then the onus would be on the staff or the editor to ensure compliance.
How: If authors are going to supply these identifiers, then it needs to be easy for them to obtain them. Dr. Martone felt that the proper identifiers were sometimes difficult to find in the model organism databases, but that NIF could help with a simple service. NIF also would need to be made more simple, as it is currently difficult to know where to look. Communication with the Mutant Mouse Resources is necessary to ensure that proper identifiers are being given to all mouse strains.
Dr. Pollock brought up the issue of having animals identified through a bar code or and perhaps spiking reagents with a sequence or some other identifier that could be automatically read. It is clear that novel technology solutions are now possible or on the horizon and that investments into laboratory information management need to be made. Once the research community begins to make the shift towards a web-enabled platform for scholarly communication-one that handles all types of diverse research objects-we believe that there will be numerous opportunities to streamline the process of working with these objects.
The implementation group mapped out what an end-to-end workflow might look like for a pilot project and beyond. The minimum requirement is that we have the appropriate registries that are viewed as authoritative for the entities to be identified.
1) Tagging: The option of having an independent group like NIF do the tagging, rather than the author, was discussed but would likely bring up privacy concerns from authors. As with the feasibility group, one can see pros and cons to performing the resource identification at different steps in the publication process: at time of submission, during review, after acceptance.
2) Verification step: The suggestion was made that we contact Biocreative (http://biocreative.sourceforge.net): “Critical Assessment of Information Extraction systems in Biology”, a organization that runs challenges for evaluating text mining and information extraction systems applied to the biological domain. We could make the verification of research resources within the materials and methods section a challenge project.
3) Where would the identifiers be? The request was that any identifiers supplied would be available in a uniform format across publishers, would not be stripped out by PubMed, and be available to 3rd parties outside of the paywall. In the NIF-Elsevier pilot, identifiers are placed in the author-supplied keyword field, which is indexed by PubMed. This solution may be unwieldy if larger numbers of antibodies are used, for example. Alternatively, Geoff Bilder suggested that they could be stored in a single URL that points to a metadata record. Placing the identifiers in text is something that is done already for entities like gene accession numbers, but unless the text was accessible, this would not satisfy the requirements for 3rd party accessibility. However, with access to materials and methods, these identifiers could be extracted and placed in a location outside of a paywall. Clearly, as indicated in Mike Huerta’s talk, the NIH Data Catalog will face similar issues.
4) Sustainability: The issue of sustainability of projects like NIF was brought up, as some publishers are concerned about investing in a strategy only to have the database disappear. Of course, no one can guarantee that any organization will exist in perpetuity. Possible solutions are to replicate the services, e.g., the INCF and eagle-i both offered to mirror the NIF system, to provide robustness. Geoff Bilder also noted that if the identifiers and systems are covered by a CC-0 license, then they would be available to anyone to pick up should NIF go out of business.
At the end of the session, almost all attendees indicated their interest in a pilot project to identify antibodies, model organisms and tools in a machine processable form across neuroscience journals. One goal of this project will be to gather data on the best implementation strategy to engage the authors in providing these identifiers and in establishing a scalable process for verifying that the correct identifiers are used. Another goal will also be to provide a demonstration project to the research community that will show the benefits of machine processable information within papers by making it easier to find research resources.
Considerable groundwork has been done and the major resources (NIF Registry, Antibody Registry, NITRC, NIF Integrated Model Organism database) required for this project are largely in place, before a large scale pilot project can be launched, we’ll need to do a pre-pilot. Thus far, the work done by NIF and Monarch has not engaged the author but has relied on curators or automated agents to identify research resources. As the author must be engaged in this process, a pre-pilot was outlined where a small group of users is given 5-10 papers and asked to supply appropriate identifiers for antibodies, tools and animals. We would monitor whether:
naive users were able to understand which entities needed to be identified
naive users were able to look up the appropriate identifiers
users got frustrated or annoyed at the process
What percentage of appropriate entities within papers were available through NIF
We didn’t discuss what would constitute success for this pre-pilot phase, but clearly we would like to see that a majority of users could successfully complete the task. This pre-pilot could be conducted via webinar so that it did not involve a large expense.
Once the system is in place for obtaining the appropriate identifiers, a larger scale pilot project would be launched across journals. This project would involve asking the authors to supply the correct identifiers at some point in the publication process: at submission, during review or after acceptance. We will leave it up to the individual journals and publishers how they would like to implement the the stage at which they send the author the request, in order to give them some flexibility and to allow us to test different strategies for acquiring this information. Ideally, the project would run for a specified period of time, e.g., one month, during which time all articles from a particular journal would be tagged. Again, the journals and publishers can have some flexibility in choosing the journals and the exact number of articles. However, it is important that high impact journals participate in this project, as authors are usually highly motivated to comply with requests from high impact journals and because it would give high visibility to the project.
Authors would be notified by the editors by email that they are participating in a pilot project to make science more reproducible and to make articles easier for machines to read. NIF will provide the appropriate instructions and a link to the website where the authors can obtain the information. Geoff Bilder offered to work with colleagues, e.g., Steve Pettifer, to create a nice front end for the system.
For the initial project, the authors should insert the identifiers into their materials and methods section, as they would a gene accession number or a URL for a tool. Some journals have author guidelines for this type of citation, and we would follow this convention, e.g., BMC Genomics states that nucleic acid sequences, protein sequences, and the atomic coordinates of macromolecular structures should be deposited in the appropriate database, and that the accession number should be provided in square brackets with their corresponding database name (e.g. [EMBL:AB026295, GenBank:U49845, PDB:1BFM] (Kafkas et al., 2013)).
To oversee the implementation issues and ensure that the effort can extend beyond neuroscience, we will create a Resource Identification Group that includes participants in this workshop and others who have expertise and tools relevant to the pilot. We will utilize FORCE11 (Future of Research Communications and e-Scholarship; https://www.force11.org), a community platform for stakeholders interesting in advancing scholarly communication through technology, to align our efforts with those underway in different areas of biomedicine, as the goal is to establish a uniform citation style. FORCE11 is already coordinating discussions on data citation styles (https://www.force11.org/data-citation-synthesis-working-group/) and can provide feedback and advice about the proposed implementation.
Once the papers have been annotated with resource identifiers, we would then need access to the full text so that we could verify and extract these identifiers. For the initial pilot project, we need not determine the final solution about where they identifiers are to be stored outside of the paywall, as an outside organization like NIF, INCF or Cross Ref can store them. As per the discussion above, the pilot project may involve mirroring at all 3 sites. After the pilot project is complete, we will follow up with a questionnaire to find out how the authors viewed the task. We will also provide them with a link where they can view the results of the pilot, and give them a search interface so that they can find papers that used their reagent or tool. We hope to engage the publishers to create various widgets that might provide this information through the article itself.
1) Perform pre-pilot project (2 months-Resource Identification Group: NIF, NITRC, INCF, Monarch, Cross-Ref, antibodies-online, eagle-i and other interested parties):
Form the Resource Identification Group: The RIG will develop and evaluate the specific technologies and implementation. Ensuring that other groups who are working in this area are involved will be important for the success of the project.
Make sure that the appropriate identifiers are available for all model organisms
Establish a single website with an easy to use front end for obtaining identifiers
Prepare instructions for authors
Perform usability studies with naive users (~25)
Present results to workshop consortium
2) Discuss potential pilot project with publishers (meeting attendees) – 1 month
Get initial commitments from publishers for proposed pilot project: what journals, what resources
Discuss potential implementation per journal
3) Prepare detailed proposal for publishers (at completion of pre-pilot project) (Resource Identification Group)
Include a link to a demonstration site and the results of the usability study
Allow flexibility in implementation
Launch pilot project at SFN???
3) Continue to improve the automated pipeline and authoring/curation tools (Resource Identification Group)
Contact Biocreative to see if they are interested in hosting a text mining challenge
4) Seek sponsorship for implementation and promoting the project (all)
Mozilla Foundation: Open Science and Science in the Web
Society for Neuroscience?