Below are a few examples of tools and thoughts for advancing scientific communication. We hope this will be a living document to help search for the right toolkit to tackle this digital jungle. Maintaining a list of tools and examplars of research documentation done well can help us use each others’ work in the best possible way – asking our own questions, but – in terms of tools, at least – joining our fellow scholars in inventing this future.
New! Try out the new search function, brought to FORCE11 via the Neuroscience Information Framework .
Relevant papers and web articles can also be added to our FORCE11 Reading List in Mendeley by joining the FORCE11 group . You are welcome to add relevant articles, regardless of whether or not they are open access, but articles that are not open access should be tagged as such.
Computational Linguistics/Text Mining Efforts
Hypothesis/claim-based representation of the rhetorical structure of a scientific paper
Mapping initiatives between ontologies
Metadata standards and ontologies
Modular formats for science publishing
Publications and reports relevant to scholarly digital publication and data
Semantic publishing initiatives and other enriched forms of publication
Structured Digital Abstracts – modeling science (especially biology) as triples
Structured experimental methods and workflows
Alternative metrics: Alternative ways of measuring impact
2. Total impact
3. Publish or Perish (Harzing.com). Software that allows authors to perform citation analysis using Google Scholar. Calculates a variety of impact factors.
Author Identification:
1. I am not a Scientist I am a Number
2. Open Reseacher & Contributor ID (ORCID)
Annotation
1. W3C Open Annotation Community Group
2. Ciccarese P, Ocana M, Das S, Clark T. AO: An Open Annotation Ontology for Science on the Web. Paper at Bio-ontologies 2010 http://esw.w3.org/images/c/c4/AO_paper_Bio-Ontologies_2010_preprint.pdf.
3. How to express and exchange annotations. https://github.com/nichtich/marginalia/wiki/Support-of-PDF-annotations.
4. PDFX A PDF-to-XML converter for scientific articles via Utopia Documents.
5. Peter Sefton on annotation: http://ptsefton.com/2010/11/05/towards-beyond-the-pdf-a-summary-of-work-weve-been-doing.htm/comment-page-1#id9.
Authoring Tools:
1. Knowledge Blog– using WordPress for Science.
3. Google Docs!
4. David Argue's zebrafish HTML/JavaScript paper format: http://www.zfishbook.org/NGP/.
5. The Scalar Project . The Alliance for Networking Visual Culture. Our work explores new forms of scholarly publishing aimed at easing the current economic crisis faced by many university presses while also serving as a model for media-rich digital publication.
7. WordPress with relatively modest customizations: http://wiki.code4lib.org/index.php/Code4Lib_Journal_WordPress_Customi…
Citation analysis
1. Publish or Perish (Harzing.com). Software that allows authors to perform citation analysis using Google Scholar. Calculates a variety of impact factors.
Computational Linguistics/Text Mining Efforts
1. AcroMine (NaCTeM, University of Manchester). Automatically determines the full forms of acronyms.
2. Argumentational Zoning, work by Simone Teufel and others.
4. Current work in defining elements within Chemistry papers, with Colin Batchelor.
5. Automatic recognition of sentence types in biomedical abstracts. Tsujii lab, University of Tokyo. Title, conclusion, method, objective, result. See MEDIE (advanced search) for a demo.
6. GENIA (Tsujii Lab, University of Tokyo) and GREC (NacTeM, University of Manchester). Textual corpora anotated with biomedical events – permit system training to identify and structure relevant information in biomedical documents automatically.
7. Hypothesis identification at Xerox. The Xerox Integrated Parser is used to find key retorical statements in biology research papers.
8. In-Context Summaries. The work of Stephen Wan of CSIRO, Sydney, providing in-browser summaries of referenced papers, weighted by the textual context of the in-text reference.
9. Linking of biomedical Named Entities in document to related database entries – such links are provided in the BioLexicon. Examples of search engines providing such links are MEDIE and UKPMC.
10. Metaknowledge annotation of biomedical events. NaCTeM, University of Manchester.
11. Annotation of interpretative information for biomedical events along 5 different dimensions: Knowledge Type (fact, analysis, observation, etc), Certainty Level, Polarity, Manner and Source.
12. OpenCalais. A web service provided by Thompson-Reuters that creates semantic markup of submitted text. Good for terms relating to current events, commerce and politics. Weak for scientific terms. Check conditions of use – Thompson-Reuters retains text submitted for its own purposes!
13. REFLECT. European Molecular Biology Laboratory, Heidelberg. A Web service that provides semantic markup for gene and protein names in submitted HRML documents, with links to relevant bioinformatics databases.
14. U-Compare. An integrated text mining/natural language processing system based on the UIMA Framework, allowing documents to be processed by various text-mining tools.
Data citation
1. Australian National Data Service has a nice page on data citation awareness: http://ands.org.au/guides/data-citation-awareness.html.
2. David Shotton’s Data Citation Best Practice Discussion Document.
3. Gary King on data sharing http://gking.harvard.edu/projects/repl.shtml.
4. The challenges with tracking dataset reuse today, based on DOIs and paper-oriented tools:http://researchremix.wordpress.com/2010/11/09/tracking-dataset-citations-using-common-citation-tracking-tools-doesnt-work/.
5. Universal Numerical Fingerprint (UNF)
6. Micah Altman, Gary King, 2007. "A Proposed Standard for the Scholarly Citation of Quantitative Data", D-Lib 3(3/4). http://www.dlib.org/dlib/march07/altman/03altman.html.
7. Oak Ridge National Laboratory Distributed Active Archive Center data citation policy http://daac.ornl.gov/citation_policy.html.
8. Data Cite.org : helping you find, access and reuse research data
Data repositories
Links to repositories that will store research data for outside users
- re3data.org – Registry of Research Data Repositories
Ereaders:
1. For some really useful articles on this issue from someone who does understand typography and design see Craig Mod's site, for example this one on Books: http://craigmod.com/journal/ebooks/.
2. Or this on how the reading experience should work: http://craigmod.com/satellite/bad_ereaders/.
Hypothesis/claim-based representation of the rhetorical structure of a scientific paper
These projects all start with the assumption that a scientific paper is, at heart, a persuasive text that makes a number of claims, that are backed by research data and references. The paper comprises a set of hypotheses supported by evidence in the form of included data or references to other work.
1. aTags. DERI, 2009- now. aTags ("associative tags") are snippets of HTML that capture the information that is most important to you in a machine-readable, interlinked format. aTags works with any Web text and can store and connect any textual element that is highlighted in a browser.
2. Cohere. KMI, 2007- now. The Cohere project, which builds on the earlier 'ClaiMaker' project, offers a web-based interface to create claims, hypotheses, or statements, and relate these to other claims using an open set of relationships. It is usable for science, but also for structuring online debateson other topics.
3. Hypotheses in Biology. UvA, 2009. A methodology and set of proto-ontologies in OWL for capturing different aspects of a text mining experiment: the biological hypothesis, text and documents, text mining, and workflow provenance.
4. HyBrow. Stanford, 2008. A prototype bioinformatics tool for designing hypotheses and evaluating them for consistency with existing knowledge
5. HypER. 2009 – now. HypER is an ad-hoc group of researchers who all represent scientific communications as a set of hypotheses, with relations to evidence. It includes representatives of LiquidPub, Cohere, SWAN, SALT, SPAR, aTags and abcde work. The main focus of HypER has shifted to the W3C HCLS work on Scientific Discourse structures.
6. SALT. DERI, 2008. SALT is a LaTeX-based authoring tool that allows authors to identify Rhetorical Structure Theory (RST-) relations between sentences in their paper. It offers the author the opportunity to define main and secondary (satellite) sentences and create relations between them
7. SWAN. Alzheimer Research Forum and Massachusetts General Hospital / Harvard Medical School, 2006 – now. The SWAN Alzheimer Knowledgebase project adds a collection of hand-curated hypotheses and claims to a research paper, which are then related through a set of discourse relationships. They can be browsed and relations between claims, as well as support networks for a specific claim, are made and visualised.
8. The SWAN Scientific Discourse Ontology is publicly available and has been harmonized with the SIOC and CiTO ontologies for wider use.
Mapping initiatives between ontologies:
1. SWAN/SIOC/CiTO alignment. 20010, HCLS SiG of W3C: Harmonization and alignment between three ontology systems of relevance to citations and rhetorical relationships between publications: SWAN, used for the SWAN project, SIOC, used to describe social media, and the SPAR ontologies CiTO (Citation Typing Ontology) and FaBiO (FRBR-aligned Bibliographic Ontology).
2. NCBI BioPortal: http://bioportal.bioontology.org/.
Metadata standards and ontologies
1. Bioinformatics Links Directory: http://www.bioinformatics.ca/links_directory/.
2. Catalogue of standards and ontologies relevant to life sciences: http://www.biosharing.org/standards_view.
3. MIBBI: Minimum Information for Biological and Biomedical Investigations: http://www.mibbi.org/.
4. NCBI BioPortal: http://bioportal.bioontology.org/.
5. Neuroscience Information Framework: http://www.neuinfo.org/nif/nifgwt.html?tab=registry.
6. Ontology of Biomedical Investigation. A broad-based community effort to develop an ontology that provides a representation for biomedical experiments.
7. Open Biological and Biomedical Ontologies: http://www.obofoundry.org/.
8. Open Archives Initiatives: Object Reuse & Exchange (OAI-ORE) : defines standards for the description and exchange of aggregations of Web resources
9. OREChem project on the Experiment Ontology – There's a slightly out of date description of this idea at:http://www.aejournal.net/content/1/1/3
10. SPAR (Semantic Publishing and Referencing Ontologies) http://purl.org/spar/.
11. ScholOnto: this is the precursor of the Cohere work; the relationship ontology is available in RDF.
Modular formats for science publishing
These propose greater granularity for the scientific paper, the 'smallest publishable unit' being smaller than the size of a full paper.
1. abcde format. Utrecht University, 2007. The abcde format is a proposal for a simple, structured format for conference papers in computer science, based on LaTeX. Each paper consists of three sections: Background, Contribution, and Discussion, and three added elements: A = Annotation, Dublin Core annotation; E = Entities, these are RDF-formatted entities of interest, including references, and (no contribution to the acronym) Core Sentences: these are sentences that are marked up by the author to be core elements. They can be extracted to form a structured abstract
2. The Annotation Ontology is an OWL vocabulary designed to support stand-off annotation of web documents, without requiring these documents to be under update control of the annotators. It is orthogonal to domain ontologies.
3. 'Coarse-grained rhetorical structure'. Work done in the HCLS SiG of the W3C since 2009. This group aims to define a 'rhetorical structure' for scientific papers, to use in authoring or mark-up tools. They are trying to come to a definition of such a format; have an intermediary proposal of their own and are beginning to make an overview of existing publisher's proposals.
4. LiquidPub. EU Project, U Trento and others (2008- 2011) A 'liquid' format for science papers is proposed, that consists of a set of research objects, connected by links.
5. Modular Physics Paper. University of Amsterdam (1999). A modular form for Physics papers: by investigating a collection of papers, a more fine-grained structure for science papers and an extensive relationships taxonomy is proposed
6. Nanopublications. NBIC, the Netherlands Bioinformatics Centre: The notion of a nanopublication is basically a general scientific assertion, represented using controlled vocabularies as “triples” (subject-predicate-object) in the semantic-web standard RDF format, with additional meta-data concerning provenance.
7. The Concept Web Alliance proposes to model scientific research as sets of triples (CWA Nanopublications, 2010).
8. The definition of the format has been published (The Anatomy of a Nanopublication).
9. See also The Value of Data, motivating the use of nanopublications in Nature Genetics.
Open Citations
1. The Open Citation Corpus. University of Oxford, 2010 onward.
2. A public RDF triplestore of biomedical literature citations encoded as Open Linked Data, linked using CiTO, the Citation Typing Ontology. Encoding references to some 3.4 million to unique papers, representing >20% of all PubMed Central papers published between 1950 and 2010, including all the most highly cited papers in every biomedical field. Citation data freely available under a CC0 waiver from http://opencitations.net/data/ in a variety of formats including RDF and BibJSON. Hopefully soon to include data citations from the Dryad data repository.
Peer Review: new models
1. F1000 Research. New on-line journal that employs a fast publication process followed by open peer review.
Provenance
A key part of science is knowing the provenance of a paper, experiment, data item, etc. Provenance includes attribution, sources, experimental workflow, citations and quotes, i.e. who, what, when where, why.
1. A comprehensive review of provenance research: Moreau, L. (2010) The Foundations for Provenance on the Web. Foundations and Trends in Web Science, 2 (2–3). pp. 99-241. ISSN 1555-077X.
2. Open Provenance Model – a model for the interoperable exchange of provenance information arising out of a series of Provenance Challenges focusing on understanding the compatibility and interchange of information between provenance systems
3. Open Provenance Model Vocabulary: http://open-biomed.sourceforge.net/opmv/ns.html.
4. W3C Provenance Working Group – follow-on from the group below. Will standardize an exchange format for provenance on the Web.
1. W3C incubator group on provenance – mission was to provide a state-of-the art understanding and develop a roadmap in the area of provenance for Semantic Web technologies, development, and possible standardization. Finishes Dec. 2010.
5. Workflows4Ever. This EU Project has a strong provenance of workflows component.
Publications and reports relevant to scholarly digital publication and data
More relevant papers and web articles are also available through our FORCE11 library in Mendeley
1. Charles Bailey's Scholarly Electronic Publishing Bibliography: http://www.digital-scholarship.org/sepb/sepb.html.
The Scholarly Electronic Publishing Bibliography presents over 3,800 articles, books, and a limited number of other textual sources that are useful in understanding scholarly electronic publishing efforts on the Internet. It covers digital copyright, digital libraries, digital preservation, digital rights management, digital repositories, economic issues, electronic books and texts, electronic serials, license agreements, metadata, publisher issues, open access, and other related topics.
2. Publishing Research Consortium list of links Publishing Research Links
1. Geoffrey Boulton, Michael Rawlins, Patrick Vallance, Mark Walport (2011). Science as a public enterprise: the case for open data. The Lancet, Volume 377, Issue 9778, Pages 1633 – 1635, 14 May 2011. doi:10.1016/S0140-6736(11)60647-8.
2. Liz Lyon (2010). Open science in the data decade – article in Issue 20 of the Central Government edition of Public Service Review. http://www.ukoln.ac.uk/ukoln/staff/e.j.lyon/ publications.html#central-government-2010-04.
3. Liz Lyon (2007). Dealing with Data: Roles, Rights, Responsibilities and Relationships – Consultancy Report. http://www.ukoln.ac.uk/ukoln/staff/e.j.lyon/publications.html#2007-06-19.
4. O'Donnell RP, Supp SR, Cobbold SM. (2010). Hindrance of conservation biology by delays in the submission of manuscripts. Conserv. Biol. 24 (2): 615-620. Epub 2010 Jan 11. http://www.ncbi.nlm.nih.gov/pubmed/20067489.
5. Open Biology. The Royal Society has just launched Open Biology, its first fully open access journal. Open Biology is a rapid, open-access, peer-reviewed online journal publishing high quality research in cell biology, developmental and structural biology, molecular biology, biochemistry, neuroscience, immunology, microbiology and genetics. The Editor-in-Chief, Professor David Glover (FRS) from the University of Cambridge, aims to provide a journal with a fair and speedy review system, run by active, practicing scientists with high expertise in this area, allowing good papers to be published quickly.
6. Sommers J (2010). The delay in sharing research data is costing lives. Nature Medicine 16 (7): 744. https://chordoma.box.net/shared/static/azpn8pxuzk.pdf.
Semantic publishing initiatives and other enriched forms of publication
1. Adventures in Semantic Publishing. Oxford University, 2009. A paper reporting a manually marked-up version of an epidemiplogical research paper in PLoS Neglected Tropical Diseases, with data enhancements, better browsing, reference linking and citation typing
2. Article of the Future. Cell, 2009 onwards. Tabbed and hyperlinked presentation of the article; Graphical Abstract and Highlights on the landing page
3. Open Access journals published by Pensoft Journals come with semantic enhancements. Example: PhytoKeys.
4. Project Prospect. Royal Society of Chemistry, 2009 onwards. RSC editors annotate compounds, concepts and data within the articles and linking these to additional electronic resources such as biological databases.
5. The Scalar Project . The Alliance for Networking Visual Culture. Our work explores new forms of scholarly publishing aimed at easing the current economic crisis faced by many university presses while also serving as a model for media-rich digital publication.
6. Semantic Biochemical Journal. 2010 onwards that uses Utopia, an innovative PDF reader which allows enrichment of the PDF with interactive figures and active data.
Structured Digital Abstracts – modeling science (especially biology) as triples
Representing scientific information as sets of triples. There is a special interest in this representation within biology and life sciences. Some intiiatives include:
1. FEBS Letters SDA, 2008 – now. The journal FEBS Letters adds curator-created triples to describe protein-protein interaction to every appropriate paper
2. The Structured Digital Abstract, Seringhaus/Gerstein, 2008. This paper basically proposes to include a 'structured XML-readable summary of pertinent facts'
Structured experimental methods and workflows
1. Eagle-I – eagle-i is a distributed platform for creating and sharing semantically rich data. It is built around semantic web technologies and follows linked open data principles. In its current incarnation and operational deployment, eagle-i focuses on biomedical research resources.
2. crowdLabs : a platform for sharing and executing computational tasks.
3. Investigation/Study/Assay (ISA). European Bioinformatics Institute and University of Oxford, 2009 – present. The ISA infrastructure is a general-purpose format and freely available desktop software suite targeted to curators and experimentalists that assists in management of experimental metadata, engages with minimum information checklists, ontologies and formats, perticularly relating to genomics data for submission to international public repositories (e.g. ENA for genomics, PRIDE for proteomics and ArrayExpress for transcriptomics).
4. Knowledge Engineering from Experimental Design (KEfED): A structured way of constructing 'observational assertions' based on statistical relationships from experiments. The model is general-purpose and forms a basis for reasoning over experimental data.
5. My Experiment: A platform to create and exchange experimental workflow components.
6. VisTrails: an open-source data analysis and visualization tool that supports the creation of documents whose results have deep captions that point to their provenance, and thus can be reproduced and verified. Provenance-rich results derived by VisTrails can be included in LaTeX, Wiki, Microsoft Word and PowerPoint documents.
7. The NIH Neuroscience Information Framework has developed a registry of over 800,000 unique antibodies from the neuroscience literature, with sourcing and availability information, based on a semantic annotation pipeline supported by the Domeo web annotation toolkit.
8. Workflows 4Ever: Wf4Ever addresses some of the biggest challenges for the preservation of scientific workflows in data intensive science, including: (a) the consideration of complex digital objects that comprise both their static and dynamic aspects, including workflow models, the provenance of their executions, and interconnections between workflows and related resources, (b) the provision of access, manipulation, sharing, reuse and evolution functions to these complex digital objects, (c) integral lifecycle management functions for workflows and their associated materials. To address these challenges, the Wf4Ever project will investigate and develop technological infrastructure for the preservation and efficient retrieval and reuse of scientific workflows in a range of disciplines.
Text Extraction
1. LA-PDFText A PDF-to-XML converter for scientific articles from the Biomedical Knowledge Engineering group @ the Information Sciences Institute.
2. PDFX A PDF-to-XML converter for scientific articles via Utopia Documents.