keyword
MENU ▼
Read by QxMD icon Read
search

Hadoop

keyword
https://www.readbyqxmd.com/read/29346410/an-evaluation-of-multi-probe-locality-sensitive-hashing-for-computing-similarities-over-web-scale-query-logs
#1
Graham Cormode, Anirban Dasgupta, Amit Goyal, Chi Hoon Lee
Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users' queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop)...
2018: PloS One
https://www.readbyqxmd.com/read/29342232/informational-and-linguistic-analysis-of-large-genomic-sequence-collections-via-efficient-hadoop-cluster-algorithms
#2
Umberto Ferraro Petrillo, Gianluca Roscigno, Giuseppe Cattaneo, Raffaele Giancarlo
Motivation: Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e., how many times each k-mer in {A;C; G; T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing...
January 12, 2018: Bioinformatics
https://www.readbyqxmd.com/read/29320579/emotional-modelling-and-classification-of-a-large-scale-collection-of-scene-images-in-a-cluster-environment
#3
Jianfang Cao, Yanfei Li, Yun Tian
The development of network technology and the popularization of image capturing devices have led to a rapid increase in the number of digital images available, and it is becoming increasingly difficult to identify a desired image from among the massive number of possible images. Images usually contain rich semantic information, and people usually understand images at a high semantic level. Therefore, achieving the ability to use advanced technology to identify the emotional semantics contained in images to enable emotional semantic image classification remains an urgent issue in various industries...
2018: PloS One
https://www.readbyqxmd.com/read/29297337/reconstructing-evolutionary-trees-in-parallel-for-massive-sequences
#4
Quan Zou, Shixiang Wan, Xiangxiang Zeng, Zhanshan Sam Ma
BACKGROUND: Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel...
December 14, 2017: BMC Systems Biology
https://www.readbyqxmd.com/read/29295690/hadoop-mcc-efficient-multiple-compound-comparison-algorithm-using-hadoop
#5
Guan-Jie Hua, Che-Lun Hung, Chuan Yi Tang
In this paper, we propose a novel heterogeneous high performance computing method, named as Hadoop-MCC, integrating Hadoop and GPU, to compare huge amount of chemical structures efficiently. The proposed method gains the high availability and fault tolerance from Hadoop, as Hadoop is used to scatter input data to GPU devices and gather the results from GPU devices. A comparison of LINGO is performed on each GPU device in parallel. According to the experimental results, the proposed method on multiple GPU devices can achieve better computational performance than the CUDA-MCC on a single GPU device...
January 2, 2018: Combinatorial Chemistry & High Throughput Screening
https://www.readbyqxmd.com/read/29194413/an-interface-for-biomedical-big-data-processing-on-the-tianhe-2-supercomputer
#6
Xi Yang, Chengkun Wu, Kai Lu, Lin Fang, Yong Zhang, Shengkang Li, Guixin Guo, YunFei Du
Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion-a big data interface on the Tianhe-2 supercomputer-to enable big data applications to run on Tianhe-2 via a single command or a shell script...
December 1, 2017: Molecules: a Journal of Synthetic Chemistry and Natural Product Chemistry
https://www.readbyqxmd.com/read/29185792/metres-an-efficient-database-for-genomic-applications
#7
Jordi Vilaplana, Rui Alves, Francesc Solsona, Jordi Mateo, Ivan Teixidó, Marc Pifarré
MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis...
November 29, 2017: Journal of Computational Biology: a Journal of Computational Molecular Cell Biology
https://www.readbyqxmd.com/read/29178837/vispa2-a-scalable-pipeline-for-high-throughput-identification-and-annotation-of-vector-integration-sites
#8
Giulio Spinozzi, Andrea Calabria, Stefano Brasca, Stefano Beretta, Ivan Merelli, Luciano Milanesi, Eugenio Montini
BACKGROUND: Bioinformatics tools designed to identify lentiviral or retroviral vector insertion sites in the genome of host cells are used to address the safety and long-term efficacy of hematopoietic stem cell gene therapy applications and to study the clonal dynamics of hematopoietic reconstitution. The increasing number of gene therapy clinical trials combined with the increasing amount of Next Generation Sequencing data, aimed at identifying integration sites, require both highly accurate and efficient computational software able to correctly process "big data" in a reasonable computational time...
November 25, 2017: BMC Bioinformatics
https://www.readbyqxmd.com/read/29068640/handling-data-skew-in-mapreduce-cluster-by-using-partition-tuning
#9
Yufei Gao, Yanjie Zhou, Bing Zhou, Lei Shi, Jiacai Zhang
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew...
2017: Journal of Healthcare Engineering
https://www.readbyqxmd.com/read/29065568/handling-data-skew-in-mapreduce-cluster-by-using-partition-tuning
#10
Yufei Gao, Yanjie Zhou, Bing Zhou, Lei Shi, Jiacai Zhang
The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data processing algorithm called Partition Tuning-based Skew Handling (PTSH). In comparison with the one-stage partitioning strategy used in the traditional MapReduce model, PTSH uses a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew...
2017: Journal of Healthcare Engineering
https://www.readbyqxmd.com/read/29060620/deepdeath-learning-to-predict-the-underlying-cause-of-death-with-big-data
#11
Hamid Reza Hassanzadeh, Ying Sha, May D Wang
Multiple cause-of-death data provides a valuable source of information that can be used to enhance health standards by predicting health related trajectories in societies with large populations. These data are often available in large quantities across U.S. states and require Big Data techniques to uncover complex hidden patterns. We design two different classes of models suitable for large-scale analysis of mortality data, a Hadoop-based ensemble of random forests trained over N-grams, and the DeepDeath, a deep classifier based on the recurrent neural network (RNN)...
July 2017: Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society
https://www.readbyqxmd.com/read/28961134/apriori-versions-based-on-mapreduce-for-mining-frequent-patterns-on-big-data
#12
Jose Maria Luna, Francisco Padillo, Mykola Pechenizkiy, Sebastian Ventura
Pattern mining is one of the most important tasks to extract meaningful and useful information from raw data. This task aims to extract item-sets that represent any type of homogeneity and regularity in data. Although many efficient algorithms have been developed in this regard, the growing interest in data has caused the performance of existing pattern mining techniques to be dropped. The goal of this paper is to propose new efficient pattern mining algorithms to work in big data. To this aim, a series of algorithms based on the MapReduce framework and the Hadoop open-source implementation have been proposed...
September 27, 2017: IEEE Transactions on Cybernetics
https://www.readbyqxmd.com/read/28945604/a-distributed-fuzzy-associative-classifier-for-big-data
#13
Armando Segatori, Alessio Bechini, Pietro Ducange, Francesco Marcelloni
Fuzzy associative classification has not been widely analyzed in the literature, although associative classifiers (ACs) have proved to be very effective in different real domain applications. The main reason is that learning fuzzy ACs is a very heavy task, especially when dealing with large datasets. To overcome this drawback, in this paper, we propose an efficient distributed fuzzy associative classification approach based on the MapReduce paradigm. The approach exploits a novel distributed discretizer based on fuzzy entropy for efficiently generating fuzzy partitions of the attributes...
September 19, 2017: IEEE Transactions on Cybernetics
https://www.readbyqxmd.com/read/28884169/cloud-engineering-principles-and-technology-enablers-for-medical-image-processing-as-a-service
#14
Shunxing Bao, Andrew J Plassard, Bennett A Landman, Aniruddha Gokhale
Traditional in-house, laboratory-based medical imaging studies use hierarchical data structures (e.g., NFS file stores) or databases (e.g., COINS, XNAT) for storage and retrieval. The resulting performance from these approaches is, however, impeded by standard network switches since they can saturate network bandwidth during transfer from storage to processing nodes for even moderate-sized studies. To that end, a cloud-based "medical image processing-as-a-service" offers promise in utilizing the ecosystem of Apache Hadoop, which is a flexible framework providing distributed, scalable, fault tolerant storage and parallel computational modules, and HBase, which is a NoSQL database built atop Hadoop's distributed file system...
April 2017: Proceedings of the IEEE International Conference on Cloud Engineering
https://www.readbyqxmd.com/read/28873323/survey-of-gene-splicing-algorithms-based-on-reads
#15
Xiuhua Si, Qian Wang, Lei Zhang, Ruo Wu, Jiquan Ma
Gene splicing is the process of assembling a large number of unordered short sequence fragments to the original genome sequence as accurately as possible. Several popular splicing algorithms based on reads are reviewed in this article, including reference genome algorithms and de novo splicing algorithms (Greedy-extension, Overlap-Layout-Consensus graph, De Bruijn graph). We also discuss a new splicing method based on the MapReduce strategy and Hadoop. By comparing these algorithms, some conclusions are drawn and some suggestions on gene splicing research are made...
September 5, 2017: Bioengineered
https://www.readbyqxmd.com/read/28861861/gate-monte-carlo-simulation-of-dose-distribution-using-mapreduce-in-a-cloud-computing-environment
#16
Yangchuan Liu, Yuguo Tang, Xin Gao
The GATE Monte Carlo simulation platform has good application prospects of treatment planning and quality assurance. However, accurate dose calculation using GATE is time consuming. The purpose of this study is to implement a novel cloud computing method for accurate GATE Monte Carlo simulation of dose distribution using MapReduce. An Amazon Machine Image installed with Hadoop and GATE is created to set up Hadoop clusters on Amazon Elastic Compute Cloud (EC2). Macros, the input files for GATE, are split into a number of self-contained sub-macros...
August 31, 2017: Australasian Physical & Engineering Sciences in Medicine
https://www.readbyqxmd.com/read/28822042/implementation-of-a-big-data-accessing-and-processing-platform-for-medical-records-in-cloud
#17
Chao-Tung Yang, Jung-Chun Liu, Shuo-Tsung Chen, Hsin-Wen Lu
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture...
August 18, 2017: Journal of Medical Systems
https://www.readbyqxmd.com/read/28737699/efficient-retrieval-of-massive-ocean-remote-sensing-images-via-a-cloud-based-mean-shift-algorithm
#18
Mengzhao Yang, Wei Song, Haibin Mei
The rapid development of remote sensing (RS) technology has resulted in the proliferation of high-resolution images. There are challenges involved in not only storing large volumes of RS images but also in rapidly retrieving the images for ocean disaster analysis such as for storm surges and typhoon warnings. In this paper, we present an efficient retrieval of massive ocean RS images via a Cloud-based mean-shift algorithm. Distributed construction method via the pyramid model is proposed based on the maximum hierarchical layer algorithm and used to realize efficient storage structure of RS images on the Cloud platform...
July 23, 2017: Sensors
https://www.readbyqxmd.com/read/28736473/theoretical-and-empirical-comparison-of-big-data-image-processing-with-apache-hadoop-and-sun-grid-engine
#19
Shunxing Bao, Frederick D Weitendorf, Andrew J Plassard, Yuankai Huo, Aniruddha Gokhale, Bennett A Landman
The field of big data is generally concerned with the scale of processing at which traditional computational paradigms break down. In medical imaging, traditional large scale processing uses a cluster computer that combines a group of workstation nodes into a functional unit that is controlled by a job scheduler. Typically, a shared-storage network file system (NFS) is used to host imaging data. However, data transfer from storage to processing nodes can saturate network bandwidth when data is frequently uploaded/retrieved from the NFS, e...
February 11, 2017: Proceedings of SPIE
https://www.readbyqxmd.com/read/28661707/using-hadoop-mapreduce-for-parallel-genetic-algorithms-a-comparison-of-the-global-grid-and-island-models
#20
Filomena Ferrucci, Pasquale Salza, Federica Sarro
The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time...
June 29, 2017: Evolutionary Computation
keyword
keyword
4199
1
2
Fetch more papers »
Fetching more papers... Fetching...
Read by QxMD. Sign in or create an account to discover new knowledge that matter to you.
Remove bar
Read by QxMD icon Read
×

Search Tips

Use Boolean operators: AND/OR

diabetic AND foot
diabetes OR diabetic

Exclude a word using the 'minus' sign

Virchow -triad

Use Parentheses

water AND (cup OR glass)

Add an asterisk (*) at end of a word to include word stems

Neuro* will search for Neurology, Neuroscientist, Neurological, and so on

Use quotes to search for an exact phrase

"primary prevention of cancer"
(heart or cardiac or cardio*) AND arrest -"American Heart Association"