Read by QxMD icon Read

International Journal of Data Mining and Bioinformatics

Aron Henriksson
The scarcity of large labelled datasets comprising clinical text that can be exploited within the paradigm of supervised machine learning creates barriers for the secondary use of data from electronic health records. It is therefore important to develop capabilities to leverage the large amounts of unlabelled data that, indeed, tend to be readily available. One technique utilises distributional semantics to create word representations in a wholly unsupervised manner and uses existing training data to learn prototypical representations of predefined semantic categories...
2015: International Journal of Data Mining and Bioinformatics
Yan Wang, Tingting He, Xingpeng Jiang, Jie Yuan, Xianjun Shen
In this paper, we develop a novel regularisation method for MVAR via weighted fusion which considers the correlation among variables. In theory, we discuss the grouping effect of weighted fusion regularisation for linear models. By virtue of the probability method, we show that coefficients corresponding to highly correlated predictors have small differences. A quantitative estimate for such small differences is given regardless of the coefficients signs. The estimate is also improved when consider empirical approximation error if the model fit the data well...
2015: International Journal of Data Mining and Bioinformatics
Fatema Tuz Zohora, M Sohel Rahman
In this paper, an algorithm is proposed that detects the existence of a common ancestor gene sequence for non-overlapping transposition metric given two input DNA sequences. We consider two cases: fixed length transposition and all length transposition. For the first one, the algorithm has the time complexity of O(n3), where n is the length of input sequences. In case of all length transposition, theoretical worst case time complexity of the algorithm is proven to be O(n4). However, practically the worst case and the average case time complexity for all length transposition are found to be O(n3) and O(n2) respectively...
2015: International Journal of Data Mining and Bioinformatics
Benjamin Ulfenborg, Karin Klinga-Levan, Björn Olsson
In silico prediction of novel miRNAs from genomic sequences remains a challenging problem. This study presents a genome-wide miRNA discovery software package called GenoScan and evaluates two hairpin classification methods. These methods, one ensemble-based and one using logistic regression were benchmarked along with 15 published methods. In addition, the sequence-folding step is addressed by investigating the impact of secondary structure prediction methods and the choice of input sequence length on prediction performance...
2015: International Journal of Data Mining and Bioinformatics
Xiaofeng Song, Lizhen Hu, Ping Han, Xuejiang Guo, Jiahao Sha
Rsp5, E3 ligases conserved from yeast to mammals, plays a key role in diverse processes in yeast. However, many of Rsp5 substrates are still unclear. Therefore we proposed an in silico method to recognise new substrates of Rsp5. To investigate the molecular determinants that affect the interaction between Rsp5 and its substrate, we have systematically analysed many features that perhaps correlated with the Rsp5 substrate recognition. It is found that PPxY motif, transmembrane region, disorder region and N-linked glycosylation modification are the most important features for substrate recognition...
2015: International Journal of Data Mining and Bioinformatics
Stephan Richter, Ingo Fetzer, Martin Thullner, Florian Centler, Peter Dittrich
Knowledge of metabolic processes is collected in easily accessable online databases which are increasing rapidly in content and detail. Using these databases for the automatic construction of metabolic network models requires high accuracy and consistency. In this bipartite study we evaluate current accuracy and consistency problems using the KEGG database as a prominent example and propose design principles for dealing with such problems. In the first half, we present our computational approach for classifying inconsistencies and provide an overview of the classes of inconsistencies we identified...
2015: International Journal of Data Mining and Bioinformatics
Nadia Ben Nsira, Thierry Lecroq, Mourad Elloumi
In the last decade, biology and medicine have undergone a fundamental change: next generation sequencing (NGS) technologies have enabled to obtain genomic sequences very quickly and at small costs compared to the traditional Sanger method. These NGS technologies have thus permitted to collect genomic sequences (genes, exomes or even full genomes) of individuals of the same species. These latter sequences are identical to more than 99%. There is thus a strong need for efficient algorithms for indexing and performing fast pattern matching in such specific sets of sequences...
2015: International Journal of Data Mining and Bioinformatics
C Gunavathi, K Premalatha
Cuckoo Search (CS) optimisation algorithm is used for feature selection in cancer classification using microarray gene expression data. Since the gene expression data has thousands of genes and a small number of samples, feature selection methods can be used for the selection of informative genes to improve the classification accuracy. Initially, the genes are ranked based on T-statistics, Signal-to-Noise Ratio (SNR) and F-statistics values. The CS is used to find the informative genes from the top-m ranked genes...
2015: International Journal of Data Mining and Bioinformatics
Wael Zakaria, Yasser Kotb, Fayed F M Ghaleb
The MCR-Miner algorithm is aimed to mine all maximal high confident association rules form the microarray up/down-expressed genes data set. This paper introduces two new algorithms: IMCR-Miner and PMCR-Miner. The IMCR-Miner algorithm is an extension of the MCR-Miner algorithm with some improvements. These improvements implement a novel way to store the samples of each gene into a list of unsigned integers in order to benefit using the bitwise operations. In addition, the IMCR-Miner algorithm overcomes the drawbacks faced by the MCR-Miner algorithm by setting some restrictions to ignore repeated comparisons...
2015: International Journal of Data Mining and Bioinformatics
Watshara Shoombuatong, Panuwat Mekha, Jeerayut Chaijaruwanich
Prediction of different classes within the human leukocyte antigen (HLA) gene family can provide insight into the human immune system and its response to viral pathogens. Therefore, it is desirable to develop an efficient and easily interpretable method for predicting HLA gene class compared to existing methods. We investigated the HLA gene prediction problem as follows: (a) establishing a dataset (HLA262) such that the sequence identity of the complete HLA dataset was reduced to 30%; (b) proposing a feature set of informative physicochemical properties that cooperate with SVM (named HLAPred) to achieve high accuracy and sensitivity (90...
2015: International Journal of Data Mining and Bioinformatics
Maryam Farhadian, Hossein Mahjub, Abbas Moghimbeigi, Paulo J G Lisboa, Jalal Poorolajal, Muharram Mansoorizadeh
Microarray technology allows simultaneous measurements of expression levels for thousands of genes. An important aspect of microarray studies includes the prediction of patient survival based on their gene expression profile. This naturally calls for the use of a dimension reduction procedure together with the survival prediction model. In this study, a new method based on wavelet transform for survival-relevant gene selection is presented. Cox proportional hazard model is typically used to build prediction model for patients' survival using the selected genes...
2015: International Journal of Data Mining and Bioinformatics
Limin Li, Shuqin Zhang
The existence of confounders such as population structure in genome-wide association study makes it difficult to apply machine learning methods directly to solve biological problems. It is still unclear how to effectively correct confounders. In this work, we propose an Orthogonal Projection Correction (OPC) method to correct confounders. This is achieved by orthogonally decomposing each feature to a confounding component and a non-confounding component, such that the original data can be best reconstructed by only the non-confounding components of features...
2015: International Journal of Data Mining and Bioinformatics
Ali Katanforoush, Ehsan Mahdavi
MicroRNAs (miRNAs) are a class of short RNA molecules that regulate gene expression by binding directly to messenger RNAs. Conventional approaches to miRNA target prediction estimate the accessibility of target sites and the strength of the binding miRNA by finding optimums of some energy models, which involves O(n3) computations. Alternatively, we narrow down potential binding sites of miRNAs to suboptimal hits of a pairwise alignment algorithm called Fitting Alignment in O(n2). We invoke a same algorithm, once for all candidate sites to measure the site accessibilities...
2015: International Journal of Data Mining and Bioinformatics
Shuliang Wang, Yiping Zhao
The computational framework used the traditional similarity measures to find out the significant relationships in biological annotations. But its prerequisites that the biological annotations do not cooccur with each other is particular. To overcome it, in this paper a new method Improved Algorithm for Maximal Information Coefficient (IAMIC) is suggested to discover the hidden regularities between biological annotations. IAMIC approximates a novel similarity coefficient on maximal information coefficient with generality and equitability, by bettering axis partition through quadratic optimisation instead of violence search...
2015: International Journal of Data Mining and Bioinformatics
Tom Johnsten, Laura Fain, Leanna Fain, Ryan G Benton, Ethan Butler, Lewis Pannell, Ming Tan
Analysing and classifying sequences based on similarities and differences is a mathematical problem of escalating relevance and importance in many scientific disciplines. One of the primary challenges in applying machine learning algorithms to sequential data, such as biological sequences, is the extraction and representation of significant features from the data. To address this problem, we have recently developed a representation, entitled Multi-Layered Vector Spaces (MLVS), which is a simple mathematical model that maps sequences into a set of MLVS...
2015: International Journal of Data Mining and Bioinformatics
Yuan Zhang, Yue Cheng, Liang Ge, Nan Du, Kebin Jia, Aidong Zhang
Many clustering methods have been developed to identify functional modules in Protein-Protein Interaction (PPI) networks but the results are far from satisfaction. To overcome the noise and incomplete problems of PPI networks and find more accurate and stable functional modules, we propose an integrative method, bipartite graph-based Non-negative Matrix Factorisation method (BiNMF), in which we adopt multiple biological data sources as different views that describe PPIs. Specifically, traditional clustering models are adopted as preliminary analysis of different views of protein functional similarity...
2015: International Journal of Data Mining and Bioinformatics
Fei Han, Shanxiu Yang, Jian Guan
In this paper, a hybrid approach based on clustering and Particle Swarm Optimisation (PSO) is proposed to perform gene selection and classification for microarray data. In the new method, firstly, genes are partitioned into a predetermined number of clusters by K-means method. Since the genes in each cluster have much redundancy, Max-Relevance Min-Redundancy (mRMR) strategy is used to reduce redundancy of the clustered genes. Then, PSO is used to perform further gene selection from the remaining clustered genes...
2015: International Journal of Data Mining and Bioinformatics
Dengju Yao, Jing Yang, Xiaojuan Zhan, Xiaorong Zhan, Zhiqiang Xie
High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function...
2015: International Journal of Data Mining and Bioinformatics
Guangyu Cui, Byungmin Kim, Saud Alguwaizani, Kyungsook Han
The Gene Ontology (GO) has been used in estimating the semantic similarity of proteins since it has the largest and reliable vocabulary of gene products and characteristics. We developed a new method which can assess Protein-Protein Interactions (PPI) using the branching factor and information content of the common ancestor of interacting proteins in the GO hierarchy. We performed a comparative evaluation of the measure with other GO-based similarity measures and evaluation results showed that our method outperformed others in most GO domains...
2015: International Journal of Data Mining and Bioinformatics
Su-Ping Deng, De-Shuang Huang
In order to analyse the similarity among microbial communities on functional state after assigning 16S rRNA sequences from all microbial communities to species. It's an important addition to the species-level relationship between two compared communities and can quantify their differences in function. We downloaded all functional annotation data of several microbiotas. It's developed to identify the functional distribution and the significantly enriched functional categories of microbial communities. We analysed the similarity between two microbial communities on functional state...
2015: International Journal of Data Mining and Bioinformatics
Fetch more papers »
Fetching more papers... Fetching...
Read by QxMD. Sign in or create an account to discover new knowledge that matter to you.
Remove bar
Read by QxMD icon Read

Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"