Database: the Journal of Biological Databases and Curation

Yifan Peng, Anthony Rios, Ramakanth Kavuluru, Zhiyong Lu
Mining relations between chemicals and proteins from the biomedical literature is an increasingly important task. The CHEMPROT track at BioCreative VI aims to promote the development and evaluation of systems that can automatically detect the chemical-protein relations in running text (PubMed abstracts). This work describes our CHEMPROT track entry, which is an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions...
Fei Chen, Jiawei Zhang, Junhao Chen, Xiaojiang Li, Wei Dong, Jian Hu, Meigui Lin, Yanhui Liu, Guowei Li, Zhengjia Wang, Liangsheng Zhang
With over 6000 species in seven classes, red algae (Rhodophyta) have diverse economic, ecological, experimental and evolutionary values. However, red algae are usually absent or rare in comparative analyses because genomic information of this phylum is often under-represented in various comprehensive genome databases. To improve the accessibility to the ome data and omics tools for red algae, we provided 10 genomes and 27 transcriptomes representing all seven classes of Rhodophyta. Three genomes and 18 transcriptomes were de novo assembled and annotated in this project...
Lana Yeganova, Won Kim, Donald C Comeau, W John Wilbur, Zhiyong Lu
PubMed® is a search engine providing access to a collection of over 27 million biomedical bibliographic records as of 2017. PubMed processes millions of queries a day, and understanding these queries is one of the main building blocks for successful information retrieval. In this work, we present Field Sensor, a domain-specific tool for understanding the composition and predicting the user intent of PubMed queries. Given a query, the Field Sensor infers a field for each token or sequence of tokens in a query in multi-step process that includes syntactic chunking, rule-based tagging and probabilistic field prediction...
P Corbett, J Boyle
In this paper, we explore the application of artificial neural network ('deep learning') methods to the problem of detecting chemical-protein interactions in PubMed abstracts. We present here a system using multiple Long Short Term Memory layers to analyse candidate interactions, to determine whether there is a relation and which type. A particular feature of our system is the use of unlabelled data, both to pre-train word embeddings and also pre-train LSTM layers in the neural network. On the BioCreative VI CHEMPROT test corpus, our system achieves an F score of 61...
Wei-Sheng Wu, Yu-Xuan Jiang, Jer-Wei Chang, Yu-Han Chu, Yi-Hao Chiu, Yi-Hong Tsao, Torbjörn E M Nordling, Yan-Yuan Tseng, Joseph T Tseng
Translational regulation plays an important role in protein synthesis. Dysregulation of translation causes abnormal cell physiology and leads to diseases such as inflammatory disorders and cancers. An emerging technique, called ribosome profiling (ribo-seq), was developed to capture a snapshot of translation. It is based on deep sequencing of ribosome-protected mRNA fragments. A lot of ribo-seq data have been generated in various studies, so databases are needed for depositing and visualizing the published ribo-seq data...
Chen Li, Zhiqiang Rao, Qinghua Zheng, Xiangrong Zhang
Current research of bio-text mining mainly focuses on event extractions. Biological networks present much richer and meaningful information to biologists than events. Bio-entity coreference resolution (CR) is a very important method to complete a bio-event's attributes and interconnect events into bio-networks. Though general CR methods have been studies for a long time, they could not produce a practically useful result when applied to a special domain. Therefore, bio-entity CR needs attention to better assist biological network extraction...
Huiwei Zhou, Zhuang Liu, Shixian Ning, Yunlong Yang, Chengkun Lang, Yingyu Lin, Kun Ma
Automatically extracting protein-protein interactions (PPIs) from biomedical literature provides additional support for precision medicine efforts. This paper proposes a novel memory network-based model (MNM) for PPI extraction, which leverages prior knowledge about protein-protein pairs with memory networks. The proposed MNM captures important context clues related to knowledge representations learned from knowledge bases. Both entity embeddings and relation embeddings of prior knowledge are effective in improving the PPI extraction model, leading to a new state-of-the-art performance on the BioCreative VI PPI dataset...
Lu Zhou, Qingyu Xiao, Jie Bi, Zhen Wang, Yixue Li
The rabbit is a very important species for both biomedical research and agriculture animal breeding. They are not only the most-used experimental animals for the production of antibodies, but also widely used for studying a variety of human diseases. Here we developed RabGTD, the first comprehensive rabbit database containing both genome and transcriptome data generated by next-generation sequencing. Genomic variations coming from 79 samples were identified and annotated, including 33 samples of wild rabbits and 46 samples of domestic rabbits with diverse populations...
Kai Feng, Xi-Lin Hou, Meng-Yao Li, Qian Jiang, Zhi-Sheng Xu, Jie-Xia Liu, Ai-Sheng Xiong
Celery (Apium graveolens L.) is a plant belonging to the Apiaceae family, and a popular vegetable worldwide because of its abundant nutrients and various medical functions. Although extensive genetic and molecular biological studies have been conducted on celery, its genomic data remain unclear. Given the significance of celery and the growing demand for its genomic data, the whole genome of 'Q2-JN11' celery (a highly inbred line obtained by artificial selfing of 'Jinnan Shiqin') was sequenced using HiSeq 2000 sequencing technology...
Xin Chen, Zhengqiang Miao, Mayur Divate, Zuxianglan Zhao, Edwin Cheung
The identification and functional characterization of novel biomarkers in cancer requires survival analysis and gene expression analysis of both patient samples and cell line models. To help facilitate this process, we have developed KM-Express. KM-Express holds an extensive manually curated transcriptomic profile of 45 different datasets for prostate and breast cancer with phenotype and pathoclinical information, spanning from clinical samples to cell lines. KM-Express also contains The Cancer Genome Atlas datasets for 30 other cancer types with matching cell line expression data for 23 of them...
Elton J R Vasconcelos, Vinícius C Mesel, Lucas F daSilva, David S Pires, Guilherme M Lavezzo, Adriana S A Pereira, Murilo S Amaral, Sergio Verjovski-Almeida
Long non-coding RNAs (lncRNAs) have been widely discovered in several organisms with the help of high-throughput RNA sequencing. LncRNAs are over 200 nt-long transcripts that do not have protein-coding (PC) potential, having been reported in model organisms to act mainly on the overall control of PC gene expression. Little is known about the functionality of lncRNAs in evolutionarily ancient non-model metazoan organisms, like Schistosoma mansoni, the parasite that causes schistosomiasis, one of the most prevalent infectious-parasitic diseases worldwide...
Fabienne Lambusch, Dagmar Waltemath, Olaf Wolkenhauer, Kurt Sandkuhl, Christian Rosenke, Ron Henkel
Computational models in biology encode molecular and cell biological processes. Many of these models can be represented as biochemical reaction networks. Studying such networks, one is mostly interested in systems that share similar reactions and mechanisms. Typical goals of an investigation thus include understanding of model parts, identification of reoccurring patterns and recognition of biologically relevant motifs. The large number and size of available models, however, require automated methods to support researchers in achieving their goals...
Kabita Tripathy, Balwant Singh, Nisha Singh, Vandna Rai, Gauri Misra, Nagendra Kumar Singh
Rice is a staple food for the people of Asia that supplies more than 50% of the food energy globally. It is widely accepted that the crop domestication process has left behind substantial useful genetic diversity in their wild progenitor species that has huge potential for developing crop varieties with enhanced resistance to an array of biotic and abiotic stresses. In this context, Oryza rufipogon, Oryza nivara and their intermediate types wild rice germplasm/s collected from diverse agro-climatic regions would provide a rich repository of genes and alleles that could be utilized for rice improvement using genomics-assisted breeding...
Muhammad Nabeel Asim, Muhammad Wasim, Muhammad Usman Ghani Khan, Waqar Mahmood
Biomedical information retrieval systems are becoming popular and complex due to massive amount of ever-growing biomedical literature. Users are unable to construct a precise and accurate query that represents the intended information in a clear manner. Therefore, query is expanded with the terms or features that retrieve more relevant information. Selection of appropriate expansion terms plays key role to improve the performance of retrieval task. We propose document frequency chi-square, a newer version of chi-square in pseudo relevance feedback for term selection...
Cheng Zeng, Weihua Zhan, Lei Deng
Annotating functional terms with individual domains is essential for understanding the functions of full-length proteins. We describe SDADB, a functional annotation database for structural domains. SDADB provides associations between gene ontology (GO) terms and SCOP domains calculated with an integrated framework. GO annotations are assigned probabilities of being correct, which are estimated with a Bayesian network by taking advantage of structural neighborhood mappings, SCOP-InterPro domain mapping information, position-specific scoring matrices (PSSMs) and sequence homolog features, with the most substantial contribution coming from high-coverage structure-based domain-protein mappings...
Huan Zhang, Asim Ali, Jianing Gao, Rongjun Ban, Xiaohua Jiang, Yuanwei Zhang, Qinghua Shi
PIWI-interacting RNAs (piRNAs) are essential for transcriptional and post-transcriptional regulation of transposons and coding genes in germline. With the development of sequencing technologies, length variations of piRNAs have been identified in several species. However, the extent to which, piRNA isoforms exist, and whether these isoforms are functionally distinct from canonical piRNAs remain uncharacterized. Through data mining from 2154 datasets of small RNA sequencing data from four species (Homo sapiens, Mus musculus, Danio rerio and Drosophila melanogaster), we have identified 8 749 139 piRNA isoforms from 175 454 canonical piRNAs, and classified them on the basis of variations on 5' or 3' end via the alignment of isoforms with canonical sequence...
Yining Liu, Mingyu Luo, Zhaochen Jin, Min Zhao, Hong Qu
Leukemia is a group of cancers with increased numbers of immature or abnormal leucocytes that originated in the bone marrow and other blood-forming organs. The development of differentially diagnostic biomarkers for different subtypes largely depends on understanding the biological pathways and regulatory mechanisms associated with leukemia-implicated genes. Unfortunately, the leukemia-implicated genes that have been identified thus far are scattered among thousands of published studies, and no systematic summary of the differences between adult and childhood leukemia exists with regard to the causative genetic mutations and genetic mechanisms of the various subtypes...
Sangrak Lim, Jaewoo Kang
In this article, we describe our system for the CHEMPROT task of the BioCreative VI challenge. Although considerable research on the named entity recognition of genes and drugs has been conducted, there is limited research on extracting relationships between them. Extracting relations between chemical compounds and genes from the literature is an important element in pharmacological and clinical research. The CHEMPROT task of BioCreative VI aims to promote the development of text mining systems that can be used to automatically extract relationships between chemical compounds and genes...
Cong Pian, Guangle Zhang, Tengfei Tu, Xiangyu Ma, Fei Li
Long non-coding RNAs (lncRNAs) are endogenous molecules longer than 200 nucleotides, and lack coding potential. LncRNAs that interact with microRNAs (miRNAs) are known as a competing endogenous RNAs (ceRNAs) and have the ability to regulate the expression of target genes. The ceRNAs play an important role in the initiation and progression of various cancers. However, until now, there is no a database including a collection of experimentally verified, human ceRNAs. We developed the LncCeRBase database, which encompasses 432 lncRNA-miRNA-mRNA interactions, including 130 lncRNAs, 214 miRNAs and 245 genes from 300 publications...
Nikita Gupta, Ajeet Singh, Shafaque Zahra, Shailesh Kumar
Transfer RNA-derived fragments (tRFs) represent a novel class of small RNAs (sRNAs) generated through endonucleolytic cleavage of both mature and precursor transfer RNAs (tRNAs). These 14-28 nt length tRFs that have been extensively studied in animal kingdom are to be explored in plants. In this study, we introduce a database of plant tRFs named PtRFdb (, for the scientific community. We analyzed a total of 1344 sRNA sequencing datasets of 10 different plant species and identified a total of 5607 unique tRFs (758 tRF-1, 2269 tRF-3 and 2580 tRF-5), represented by 487 765 entries...
