keyword
MENU ▼
Read by QxMD icon Read
search

Hadoop

keyword
https://www.readbyqxmd.com/read/29194413/an-interface-for-biomedical-big-data-processing-on-the-tianhe-2-supercomputer
#1
Xi Yang, Chengkun Wu, Kai Lu, Lin Fang, Yong Zhang, Shengkang Li, Guixin Guo, YunFei Du
Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion-a big data interface on the Tianhe-2 supercomputer-to enable big data applications to run on Tianhe-2 via a single command or a shell script...
December 1, 2017: Molecules: a Journal of Synthetic Chemistry and Natural Product Chemistry
https://www.readbyqxmd.com/read/29185792/metres-an-efficient-database-for-genomic-applications
#2
Jordi Vilaplana, Rui Alves, Francesc Solsona, Jordi Mateo, Ivan Teixidó, Marc Pifarré
MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis...
November 29, 2017: Journal of Computational Biology: a Journal of Computational Molecular Cell Biology
https://www.readbyqxmd.com/read/29178837/vispa2-a-scalable-pipeline-for-high-throughput-identification-and-annotation-of-vector-integration-sites
#3
Giulio Spinozzi, Andrea Calabria, Stefano Brasca, Stefano Beretta, Ivan Merelli, Luciano Milanesi, Eugenio Montini
BACKGROUND: Bioinformatics tools designed to identify lentiviral or retroviral vector insertion sites in the genome of host cells are used to address the safety and long-term efficacy of hematopoietic stem cell gene therapy applications and to study the clonal dynamics of hematopoietic reconstitution. The increasing number of gene therapy clinical trials combined with the increasing amount of Next Generation Sequencing data, aimed at identifying integration sites, require both highly accurate and efficient computational software able to correctly process "big data" in a reasonable computational time...
November 25, 2017: BMC Bioinformatics
https://www.readbyqxmd.com/read/29068640/handling-data-skew-in-mapreduce-cluster-by-using-partition-tuning
#4
Yufei Gao, Yanjie Zhou, Bing Zhou, Lei Shi, Jiacai Zhang
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew...
2017: Journal of Healthcare Engineering
https://www.readbyqxmd.com/read/29065568/handling-data-skew-in-mapreduce-cluster-by-using-partition-tuning
#5
Yufei Gao, Yanjie Zhou, Bing Zhou, Lei Shi, Jiacai Zhang
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew...
2017: Journal of Healthcare Engineering
https://www.readbyqxmd.com/read/29060620/deepdeath-learning-to-predict-the-underlying-cause-of-death-with-big-data
#6
Hamid Reza Hassanzadeh, Ying Sha, May D Wang
Multiple cause-of-death data provides a valuable source of information that can be used to enhance health standards by predicting health related trajectories in societies with large populations. These data are often available in large quantities across U.S. states and require Big Data techniques to uncover complex hidden patterns. We design two different classes of models suitable for large-scale analysis of mortality data, a Hadoop-based ensemble of random forests trained over N-grams, and the DeepDeath, a deep classifier based on the recurrent neural network (RNN)...
July 2017: Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society
https://www.readbyqxmd.com/read/28961134/apriori-versions-based-on-mapreduce-for-mining-frequent-patterns-on-big-data
#7
Jose Maria Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastian Ventura
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed...
September 27, 2017: IEEE Transactions on Cybernetics
https://www.readbyqxmd.com/read/28945604/a-distributed-fuzzy-associative-classifier-for-big-data
#8
Armando Segatori, Alessio Bechini, Pietro Ducange, Francesco Marcelloni
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes...
September 19, 2017: IEEE Transactions on Cybernetics
https://www.readbyqxmd.com/read/28884169/cloud-engineering-principles-and-technology-enablers-for-medical-image-processing-as-a-service
#9
Shunxing Bao, Andrew J Plassard, Bennett A Landman, Aniruddha Gokhale
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system...
April 2017: Proceedings of the IEEE International Conference on Cloud Engineering
https://www.readbyqxmd.com/read/28873323/survey-of-gene-splicing-algorithms-based-on-reads
#10
Xiuhua Si, Qian Wang, Lei Zhang, Ruo Wu, Jiquan Ma
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made...
September 5, 2017: Bioengineered
https://www.readbyqxmd.com/read/28861861/gate-monte-carlo-simulation-of-dose-distribution-using-mapreduce-in-a-cloud-computing-environment
#11
Yangchuan Liu, Yuguo Tang, Xin Gao
The GATE Monte Carlo simulation platform has good application prospects of treatment planning and quality assurance. However, accurate dose calculation using GATE is time consuming. The purpose of this study is to implement a novel cloud computing method for accurate GATE Monte Carlo simulation of dose distribution using MapReduce. An Amazon Machine Image installed with Hadoop and GATE is created to set up Hadoop clusters on Amazon Elastic Compute Cloud (EC2). Macros, the input files for GATE, are split into a number of self-contained sub-macros...
August 31, 2017: Australasian Physical & Engineering Sciences in Medicine
https://www.readbyqxmd.com/read/28822042/implementation-of-a-big-data-accessing-and-processing-platform-for-medical-records-in-cloud
#12
Chao-Tung Yang, Jung-Chun Liu, Shuo-Tsung Chen, Hsin-Wen Lu
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture...
August 18, 2017: Journal of Medical Systems
https://www.readbyqxmd.com/read/28737699/efficient-retrieval-of-massive-ocean-remote-sensing-images-via-a-cloud-based-mean-shift-algorithm
#13
Mengzhao Yang, Wei Song, Haibin Mei
The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform...
July 23, 2017: Sensors
https://www.readbyqxmd.com/read/28736473/theoretical-and-empirical-comparison-of-big-data-image-processing-with-apache-hadoop-and-sun-grid-engine
#14
Shunxing Bao, Frederick D Weitendorf, Andrew J Plassard, Yuankai Huo, Aniruddha Gokhale, Bennett A Landman
The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e...
February 11, 2017: Proceedings of SPIE
https://www.readbyqxmd.com/read/28661707/using-hadoop-mapreduce-for-parallel-genetic-algorithms-a-comparison-of-the-global-grid-and-island-models
#15
Filomena Ferrucci, Pasquale Salza, Federica Sarro
The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time...
June 29, 2017: Evolutionary Computation
https://www.readbyqxmd.com/read/28655296/sparkblast-scalable-blast-processing-using-in-memory-operations
#16
Marcelo Rodrigo de Castro, Catherine Dos Santos Tostes, Alberto M R Dávila, Hermes Senger, Fabricio A B da Silva
BACKGROUND: The demand for processing ever increasing amounts of genomic data has raised new challenges for the implementation of highly scalable and efficient computational systems. In this paper we propose SparkBLAST, a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework. As a proof of concept, some radionuclide-resistant bacterial genomes were selected for similarity analysis...
June 27, 2017: BMC Bioinformatics
https://www.readbyqxmd.com/read/28610458/large-scale-parallel-genome-assembler-over-cloud-computing-environment
#17
Arghya Kusum Das, Praveen Kumar Koppa, Sayan Goswami, Richard Platania, Seung-Jong Park
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research...
June 2017: Journal of Bioinformatics and Computational Biology
https://www.readbyqxmd.com/read/28475668/mardre-efficient-mapreduce-based-removal-of-duplicate-dna-reads-in-the-cloud
#18
Roberto R Expósito, Jorge Veiga, Jorge González-Domínguez, Juan Touriño
Summary: This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing...
September 1, 2017: Bioinformatics
https://www.readbyqxmd.com/read/28423824/querying-archetype-based-electronic-health-records-using-hadoop-and-dewey-encoding-of-openehr-models
#19
Erik Sundvall, Fang Wei-Kleiner, Sergio M Freire, Patrick Lambrix
Archetype-based Electronic Health Record (EHR) systems using generic reference models from e.g. openEHR, ISO 13606 or CIMI should be easy to update and reconfigure with new types (or versions) of data models or entries, ideally with very limited programming or manual database tweaking. Exploratory research (e.g. epidemiology) leading to ad-hoc querying on a population-wide scale can be a challenge in such environments. This publication describes implementation and test of an archetype-aware Dewey encoding optimization that can be used to produce such systems in environments supporting relational operations, e...
2017: Studies in Health Technology and Informatics
https://www.readbyqxmd.com/read/28358893/halvade-rna-parallel-variant-calling-from-transcriptomic-data-using-mapreduce
#20
Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows...
2017: PloS One
keyword
keyword
4199
1
2
Fetch more papers »
Fetching more papers... Fetching...
Read by QxMD. Sign in or create an account to discover new knowledge that matter to you.
Remove bar
Read by QxMD icon Read
×

Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"