Structured Digital Abstracts: The FEBS Letters experiment in conjunction with the BioCreative II.5 online challenge.
For biology, and especially Systems Biology, to establish networks of protein-protein interactions, the necessary data is mostly provided by interaction databases such as MINT, IntAct, or BioGRID. Establishing the proteome’s network is a fundamental goal in Molecular Biology to provide an holistic view of the cells internal workings. However, the manual extraction of this information is a time-consuming task and databases are continuously lagging behind in the number of published articles that have been integrated into their repositories as structured information, usable for protein network reconstruction.
An approach to relieve this situation was initiated by FEBS Letters. In 2008 with fellowship salary support from FEBS, they brought the Structured Digital Abstract (SDA) into existence. SDAs are appendices to regular, textual abstracts that follow very tight syntactical conventions, so that the texts are both human and machine readable: They describe protein-protein interactions, i.e., the database identifiers and interaction type of interacting proteins, and include the experimental method as evidence for a given interaction, using a controlled vocabulary for the type and method. To produce these SDAs, FEBS Letters asked authors during most of 2008 to provide the necessary annotations on a voluntary basis. Later, these data were refined by MINT curators prior to publication to ensure a high quality standard of these annotations.
On the other hand, over the last decade, a broad range of text-mining and information-extraction tools have been developed to address the issue of producing annotations on biologically relevant texts, including protein-protein interactions. These tools are only useful if they are designed to meet real-life tasks and if their performance can be estimated and compared. The BioCreative challenge (Critical Assessment of Information Extraction in Biology) consists of a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems, similar to the CASP challenges for protein structure prediction. So far, there have been three such community efforts: BioCreative I, in 2004 (published as special issue in BMC Bioinformatics in 2005), BioCreative II in 2007 (published as Genome Biology special issue in 2008), and BioCreative II.5 in 2009 (coming in 2010, as special issue in TCBB).
The last two BioCreative challenges focused on protein-protein interaction extraction. The main tasks were to extract UniProt accessions (database identifiers) for proteins that have experimental interaction evidence descriptions in the body of the articles, and then report list of binary interaction pairs. To this end, text-mining requires a collection of relevant texts (called a "corpus") that can subsequently be used by automated machine learning systems to produce the corresponding annotations. Traditionally, encountering a sufficiently sized corpus together with a set of relevant annotations on that corpus (called a "ground truth", or "gold standard") has been one of the two main hurdles for text-mining. Since BioCreative II, the organizers have collaborated with the MINT database (and, for BC II, also with IntAct), which contributed their curation facilities to provide the necessary high-accuracy annotations. From BioCreative and other challenges the possibly most long-lasting product is the corresponding annotated corpus.
These freely available corpora allows researchers to further improve systems and methods.
With this background, for BioCreative II.5, a unique setting was created: In conjunction with FEBS Letters and MINT, the BioCreative organizers announced a challenge to the biological text-mining community, asking them to reproduce these annotations with their automated systems in a realistic, online setting - i.e., by providing web-servers that could reproduce these annotations upon request within limited time constraints. However, to hold such a challenge, a reasonable sized corpus is required that can be distributed to the participants. Therefore, the organizers approached FEBS for rights to distribute two years worth of FEBS Letters publications to the participants, in total 1190 articles, in machine-readable format (XML). In addition, MINT made an additional effort to provide exceptionally high-quality annotations for the 122 articles in this set that did contain protein-protein interaction evidence and also covers the articles authors annotated during the FEBS Letters experiment. The other 1068 articles were used as negative examples for training machine-learning systems to discern relevant from irrelevant papers; a task database curators carry out manually, making a decision whether to annotate an article or not.
The BioCreative II.5 challenge then was held in spring 2009, with a subsequent workshop in October, in Madrid (Spain). The three tasks were (1) article classification (annotation-relevant or not), (2) UniProt accession assignment (identifying the relevant, interacting proteins), and (3) the extraction of the actual binary interaction pairs. Naturally, the results were lower than human-made annotations, however the performance of systems was significantly higher than anticipated and is clearly good enough to assist human annotators in tasks such as deciding which articles to annotate, or to reduce the time-consuming task of finding the correct database identifiers for the annotations. Details of this collaborative effort between FEBS Letters, MINT, and BioCreative will be published in short, while a special issue on the challenge itself is planned for publication in Transactions on Computational Biology and Bioinformatics (TCBB) in summer 2010.
The BioCreative II.5 corpora were kindly provided by Elsevier and FEBS. This machine-readable corpus of nearly 1200 FEBS Letters articles is freely accessible to computational biologists, and together with the MINT and BioCreative annotations it provides a large resource of intrinsic scientific value.
Both, the corpus and more information about the BioCreative resources, can be found on the BioCreative website at www.biocreative.org; to download the corpus and other data, see the Resources sections (Note: this requires users to create an account and accept the terms of using the data provided exclusively for scientific purposes.)
The FEBS Letters experiment proved to be such a success that SDAs are now added to all FEBS Letters manuscripts that contain protein-protein interactions. In addition, in 2009, FEBS Journal also starting publishing manuscripts with SDAs bringing added-value to their manuscripts as well.
For more details:
FEBS Letters - http://www.febsletters.org/content/sda_summary
FEBS Journal - http://www.febsjournal.org/structured_digital_abstracts.asp
FEBS Letters and FEBS Journal now have the functionality to add Structured Digital Abstracts to papers containing protein-protein interactions. Novel interactions described in manuscripts are linked directly to a molecular interaction database (MINT).
Advantages of Structured Digital Abstracts for the scientific community:
An SDA is an extension of an article abstract comprising a series of sentences that contain links to the database entries (blue text below is a hyperlink).
Abe, M., Suzuki, H., Nishitsuji, H., Shida, H. and Takaku, H. (2010). Interaction of human T-cell lymphotropic virus type I Rex protein with Dicer suppresses RNAi silencing. FEBS Letters 584, 4313-4318
Double-stranded RNAs suppress the expression of homologous genes through an evolutionarily conserved process called RNA interference (RNAi) or post-transcriptional gene silencing. A bidentate nuclease called Dicer has been implicated as the protein responsible for the production of short interfering RNAs (siRNAs). In our experiments, Rex overexpression reduced the efficiency of short hairpin RNA (shRNA)-mediated RNAi. The interaction of Dicer with Rex inhibited the conversion of shRNA to siRNA. These results suggest that the interaction of Dicer with HTLV-I Rex inhibits Dicer activity and thereby reduces the efficiency of the conversion of shRNA to siRNA.
Structured summary of protein interactions
Rex physically interacts with Dicer: shown by pull down (view interaction)
An SDA brings added value to the manuscript, as the reader has immediate access to the 'big picture' about their protein of interest.