keyword
MENU ▼
Read by QxMD icon Read
search

Hadoop

keyword
https://www.readbyqxmd.com/read/28961134/apriori-versions-based-on-mapreduce-for-mining-frequent-patterns-on-big-data
#1
Jose Maria Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastian Ventura
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed...
September 27, 2017: IEEE Transactions on Cybernetics
https://www.readbyqxmd.com/read/28945604/a-distributed-fuzzy-associative-classifier-for-big-data
#2
Armando Segatori, Alessio Bechini, Pietro Ducange, Francesco Marcelloni
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes...
September 19, 2017: IEEE Transactions on Cybernetics
https://www.readbyqxmd.com/read/28884169/cloud-engineering-principles-and-technology-enablers-for-medical-image-processing-as-a-service
#3
Shunxing Bao, Andrew J Plassard, Bennett A Landman, Aniruddha Gokhale
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system...
April 2017: Proc IEEE Int Conf Cloud Eng
https://www.readbyqxmd.com/read/28873323/survey-of-gene-splicing-algorithms-based-on-reads
#4
Xiuhua Si, Qian Wang, Lei Zhang, Ruo Wu, Jiquan Ma
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made...
September 5, 2017: Bioengineered
https://www.readbyqxmd.com/read/28861861/gate-monte-carlo-simulation-of-dose-distribution-using-mapreduce-in-a-cloud-computing-environment
#5
Yangchuan Liu, Yuguo Tang, Xin Gao
The GATE Monte Carlo simulation platform has good application prospects of treatment planning and quality assurance. However, accurate dose calculation using GATE is time consuming. The purpose of this study is to implement a novel cloud computing method for accurate GATE Monte Carlo simulation of dose distribution using MapReduce. An Amazon Machine Image installed with Hadoop and GATE is created to set up Hadoop clusters on Amazon Elastic Compute Cloud (EC2). Macros, the input files for GATE, are split into a number of self-contained sub-macros...
August 31, 2017: Australasian Physical & Engineering Sciences in Medicine
https://www.readbyqxmd.com/read/28822042/implementation-of-a-big-data-accessing-and-processing-platform-for-medical-records-in-cloud
#6
Chao-Tung Yang, Jung-Chun Liu, Shuo-Tsung Chen, Hsin-Wen Lu
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture...
August 18, 2017: Journal of Medical Systems
https://www.readbyqxmd.com/read/28737699/efficient-retrieval-of-massive-ocean-remote-sensing-images-via-a-cloud-based-mean-shift-algorithm
#7
Mengzhao Yang, Wei Song, Haibin Mei
The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform...
July 23, 2017: Sensors
https://www.readbyqxmd.com/read/28736473/theoretical-and-empirical-comparison-of-big-data-image-processing-with-apache-hadoop-and-sun-grid-engine
#8
Shunxing Bao, Frederick D Weitendorf, Andrew J Plassard, Yuankai Huo, Aniruddha Gokhale, Bennett A Landman
The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e...
February 11, 2017: Proceedings of SPIE
https://www.readbyqxmd.com/read/28661707/using-hadoop-mapreduce-for-parallel-genetic-algorithms-a-comparison-of-the-global-grid-and-island-models
#9
Filomena Ferrucci, Pasquale Salza, Federica Sarro
The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time...
June 29, 2017: Evolutionary Computation
https://www.readbyqxmd.com/read/28655296/sparkblast-scalable-blast-processing-using-in-memory-operations
#10
Marcelo Rodrigo de Castro, Catherine Dos Santos Tostes, Alberto M R Dávila, Hermes Senger, Fabricio A B da Silva
BACKGROUND: The demand for processing ever increasing amounts of genomic data has raised new challenges for the implementation of highly scalable and efficient computational systems. In this paper we propose SparkBLAST, a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework. As a proof of concept, some radionuclide-resistant bacterial genomes were selected for similarity analysis...
June 27, 2017: BMC Bioinformatics
https://www.readbyqxmd.com/read/28610458/large-scale-parallel-genome-assembler-over-cloud-computing-environment
#11
Arghya Kusum Das, Praveen Kumar Koppa, Sayan Goswami, Richard Platania, Seung-Jong Park
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research...
June 2017: Journal of Bioinformatics and Computational Biology
https://www.readbyqxmd.com/read/28475668/mardre-efficient-mapreduce-based-removal-of-duplicate-dna-reads-in-the-cloud
#12
Roberto R Expósito, Jorge Veiga, Jorge González-Domínguez, Juan Touriño
Summary: This article presents MarDRe, a de novo cloud-ready duplicate and near-duplicate removal tool that can process single- and paired-end reads from FASTQ/FASTA datasets. MarDRe takes advantage of the widely adopted MapReduce programming model to fully exploit Big Data technologies on cloud-based infrastructures. Written in Java to maximize cross-platform compatibility, MarDRe is built upon the open-source Apache Hadoop project, the most popular distributed computing framework for scalable Big Data processing...
September 1, 2017: Bioinformatics
https://www.readbyqxmd.com/read/28423824/querying-archetype-based-electronic-health-records-using-hadoop-and-dewey-encoding-of-openehr-models
#13
Erik Sundvall, Fang Wei-Kleiner, Sergio M Freire, Patrick Lambrix
Archetype-based Electronic Health Record (EHR) systems using generic reference models from e.g. openEHR, ISO 13606 or CIMI should be easy to update and reconfigure with new types (or versions) of data models or entries, ideally with very limited programming or manual database tweaking. Exploratory research (e.g. epidemiology) leading to ad-hoc querying on a population-wide scale can be a challenge in such environments. This publication describes implementation and test of an archetype-aware Dewey encoding optimization that can be used to produce such systems in environments supporting relational operations, e...
2017: Studies in Health Technology and Informatics
https://www.readbyqxmd.com/read/28358893/halvade-rna-parallel-variant-calling-from-transcriptomic-data-using-mapreduce
#14
Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, Jan Fostier
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows...
2017: PloS One
https://www.readbyqxmd.com/read/28317049/programming-and-runtime-support-to-blaze-fpga-accelerator-deployment-at-datacenter-scale
#15
Muhuan Huang, Di Wu, Cody Hao Yu, Zhenman Fang, Matteo Interlandi, Tyson Condie, Jason Cong
With the end of CPU core scaling due to dark silicon limitations, customized accelerators on FPGAs have gained increased attention in modern datacenters due to their lower power, high performance and energy efficiency. Evidenced by Microsoft's FPGA deployment in its Bing search engine and Intel's 16.7 billion acquisition of Altera, integrating FPGAs into datacenters is considered one of the most promising approaches to sustain future datacenter growth. However, it is quite challenging for existing big data computing systems-like Apache Spark and Hadoop-to access the performance and energy benefits of FPGA accelerators...
October 2016: Proceedings of the ... ACM Symposium on Cloud Computing [electronic Resource]: SOCC ... ... SoCC (Conference)
https://www.readbyqxmd.com/read/28316653/large-scale-virtual-screening-on-public-cloud-resources-with-apache-spark
#16
Marco Capuccini, Laeeq Ahmed, Wesley Schaal, Erwin Laure, Ola Spjuth
BACKGROUND: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level...
2017: Journal of Cheminformatics
https://www.readbyqxmd.com/read/28243601/mapreduce-algorithms-for-inferring-gene-regulatory-networks-from-time-series-microarray-data-using-an-information-theoretic-approach
#17
Yasser Abduallah, Turki Turki, Kevin Byron, Zongxuan Du, Miguel Cervantes-Cervantes, Jason T L Wang
Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. To date, numerous algorithms have been developed to infer gene regulatory networks. However, as the number of identified genes increases and the complexity of their interactions is uncovered, networks and their regulatory mechanisms become cumbersome to test...
2017: BioMed Research International
https://www.readbyqxmd.com/read/28208684/a-real-time-high-performance-computation-architecture-for-multiple-moving-target-tracking-based-on-wide-area-motion-imagery-via-cloud-and-graphic-processing-units
#18
Kui Liu, Sixiao Wei, Zhijiang Chen, Bin Jia, Genshe Chen, Haibin Ling, Carolyn Sheaff, Erik Blasch
This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI). More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT) system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©)...
February 12, 2017: Sensors
https://www.readbyqxmd.com/read/28110735/optimizing-r-with-sparkr-on-a-commodity-cluster-for-biomedical-research
#19
Martin Sedlmayr, Tobias Würfl, Christian Maier, Lothar Häberle, Peter Fasching, Hans-Ulrich Prokosch, Jan Christoph
BACKGROUND AND OBJECTIVES: Medical researchers are challenged today by the enormous amount of data collected in healthcare. Analysis methods such as genome-wide association studies (GWAS) are often computationally intensive and thus require enormous resources to be performed in a reasonable amount of time. While dedicated clusters and public clouds may deliver the desired performance, their use requires upfront financial efforts or anonymous data, which is often not possible for preliminary or occasional tasks...
December 2016: Computer Methods and Programs in Biomedicine
https://www.readbyqxmd.com/read/28093410/fastdoop-a-versatile-and-efficient-library-for-the-input-of-fasta-and-fastq-files-for-mapreduce-hadoop-bioinformatics-applications
#20
Umberto Ferraro Petrillo, Gianluca Roscigno, Giuseppe Cattaneo, Raffaele Giancarlo
Summary: MapReduce Hadoop bioinformatics applications require the availability of special-purpose routines to manage the input of sequence files. Unfortunately, the Hadoop framework does not provide any built-in support for the most popular sequence file formats like FASTA or BAM. Moreover, the development of these routines is not easy, both because of the diversity of these formats and the need for managing efficiently sequence datasets that may count up to billions of characters. We present FASTdoop, a generic Hadoop library for the management of FASTA and FASTQ files...
May 15, 2017: Bioinformatics
keyword
keyword
4199
1
2
Fetch more papers »
Fetching more papers... Fetching...
Read by QxMD. Sign in or create an account to discover new knowledge that matter to you.
Remove bar
Read by QxMD icon Read
×

Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"