Most recent papers in the journal Algorithms for Molecular Biology : AMB

#1

JOURNAL ARTICLE

Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, Travis Gagie

FM-indexes are crucial data structures in DNA alignment, but searching with them usually takes at least one random access per character in the query pattern. Ferragina and Fischer [1] observed in 2007 that word-based indexes often use fewer random accesses than character-based indexes, and thus support faster searches. Since DNA lacks natural word-boundaries, however, it is necessary to parse it somehow before applying word-based FM-indexing. In 2022, Deng et al. [2] proposed parsing genomic data by induced suffix sorting, and showed that the resulting word-based FM-indexes support faster counting queries than standard FM-indexes when patterns are a few thousand characters or longer...

38600518

April 10, 2024: Algorithms for Molecular Biology: AMB

#2

JOURNAL ARTICLE

Space-efficient computation of k-mer dictionaries for large values of k.

Diego Díaz-Domínguez, Miika Leinonen, Leena Salmela

Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small k as a hash table keeping keys explicitly (i.e., k-mer sequences) takes <mml:math xmlns:mml="https://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>O</mml:mi> <mml:mo>(</mml:mo> <mml:mi>N</mml:mi> <mml:mfrac><mml:mi>k</mml:mi> <mml:mi>w</mml:mi></mml:mfrac> <mml:mo>)</mml:mo></mml:mrow> </mml:math> computer words, N being the number of distinct k-mers and w the computer word size, which is impractical for long values of k...

38581000

April 5, 2024: Algorithms for Molecular Biology: AMB

#3

JOURNAL ARTICLE

Infrared: a declarative tree decomposition-powered framework for bioinformatics.

Hua-Ting Yao, Bertrand Marchand, Sarah J Berkemer, Yann Ponty, Sebastian Will

MOTIVATION: Many bioinformatics problems can be approached as optimization or controlled sampling tasks, and solved exactly and efficiently using Dynamic Programming (DP). However, such exact methods are typically tailored towards specific settings, complex to develop, and hard to implement and adapt to problem variations. METHODS: We introduce the Infrared framework to overcome such hindrances for a large class of problems. Its underlying paradigm is tailored toward problems that can be declaratively formalized as sparse feature networks, a generalization of constraint networks...

38493130

March 16, 2024: Algorithms for Molecular Biology: AMB

#4

JOURNAL ARTICLE

Median quartet tree search algorithms using optimal subtree prune and regraft.

Shayesteh Arasti, Siavash Mirarab

Gene trees can be different from the species tree due to biological processes and inference errors. One way to obtain a species tree is to find one that maximizes some measure of similarity to a set of gene trees. The number of shared quartets between a potential species tree and gene trees provides a statistically justifiable score; if maximized properly, it could result in a statistically consistent estimator of the species tree under several statistical models of discordance. However, finding the median quartet score tree, one that maximizes this score, is NP-Hard, motivating several existing heuristic algorithms...

38481327

March 13, 2024: Algorithms for Molecular Biology: AMB

#5

JOURNAL ARTICLE

Suffix sorting via matching statistics.

Zsuzsanna Lipták, Francesco Masillo, Simon J Puglisi

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods...

38475889

March 12, 2024: Algorithms for Molecular Biology: AMB

#6

JOURNAL ARTICLE

Finding maximal exact matches in graphs.

Nicola Rizzo, Manuel Cáceres, Veli Mäkinen

BACKGROUND: We study the problem of finding maximal exact matches (MEMs) between a query string Q and a labeled graph G. MEMs are an important class of seeds, often used in seed-chain-extend type of practical alignment methods because of their strong connections to classical metrics. A principled way to speed up chaining is to limit the number of MEMs by considering only MEMs of length at least <mml:math xmlns:mml="https://www.w3.org/1998/Math/MathML"><mml:mi>κ</mml:mi></mml:math> ( <mml:math xmlns:mml="https://www...

38468275

March 11, 2024: Algorithms for Molecular Biology: AMB

#7

JOURNAL ARTICLE

SparseRNAfolD: optimized sparse RNA pseudoknot-free folding with dangle consideration.

Mateo Gray, Sebastian Will, Hosna Jabbari

MOTIVATION: Computational RNA secondary structure prediction by free energy minimization is indispensable for analyzing structural RNAs and their interactions. These methods find the structure with the minimum free energy (MFE) among exponentially many possible structures and have a restrictive time and space complexity ( <mml:math xmlns:mml="https://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>O</mml:mi> <mml:mo>(</mml:mo> <mml:msup><mml:mi>n</mml:mi> <mml:mn>3</mml:mn></mml:msup> <mml:mo>)</mml:mo></mml:mrow> </mml:math> time and <mml:math xmlns:mml="https://www...

38433200

March 3, 2024: Algorithms for Molecular Biology: AMB

#8

JOURNAL ARTICLE

Recombinations, chains and caps: resolving problems with the DCJ-indel model.

Leonard Bohnenkämper

One of the most fundamental problems in genome rearrangement studies is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers...

38414060

February 27, 2024: Algorithms for Molecular Biology: AMB

#9

JOURNAL ARTICLE

Unifying duplication episode clustering and gene-species mapping inference.

Paweł Górecki, Natalia Rutecka, Agnieszka Mykowiecka, Jarosław Paszek

We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of partially leaf-labeled gene trees labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates...

38355611

February 14, 2024: Algorithms for Molecular Biology: AMB

#10

JOURNAL ARTICLE

Global exact optimisations for chloroplast structural haplotype scaffolding.

Victor Epain, Rumen Andonov

BACKGROUND: Scaffolding is an intermediate stage of fragment assembly. It consists in orienting and ordering the contigs obtained by the assembly of the sequencing reads. In the general case, the problem has been largely studied with the use of distances data between the contigs. Here we focus on a dedicated scaffolding for the chloroplast genomes. As these genomes are small, circular and with few specific repeats, numerous approaches have been proposed to assemble them. However, their specificities have not been sufficiently exploited...

38321522

February 6, 2024: Algorithms for Molecular Biology: AMB

#11

JOURNAL ARTICLE

Predicting horizontal gene transfers with perfect transfer networks.

Alitzel López Sánchez, Manuel Lafond

BACKGROUND: Horizontal gene transfer inference approaches are usually based on gene sequences: parametric methods search for patterns that deviate from a particular genomic signature, while phylogenetic methods use sequences to reconstruct the gene and species trees. However, it is well-known that sequences have difficulty identifying ancient transfers since mutations have enough time to erase all evidence of such events. In this work, we ask whether character-based methods can predict gene transfers...

38321476

February 6, 2024: Algorithms for Molecular Biology: AMB

#12

JOURNAL ARTICLE

Co-linear chaining on pangenome graphs.

Jyotshna Rajput, Ghanshyam Chandra, Chirag Jain

Pangenome reference graphs are useful in genomics because they compactly represent the genetic diversity within a species, a capability that linear references lack. However, efficiently aligning sequences to these graphs with complex topology and cycles can be challenging. The seed-chain-extend based alignment algorithms use co-linear chaining as a standard technique to identify a good cluster of exact seed matches that can be combined to form an alignment. Recent works show how the co-linear chaining problem can be efficiently solved for acyclic pangenome graphs by exploiting their small width and how incorporating gap cost in the scoring function improves alignment accuracy...

38279113

January 27, 2024: Algorithms for Molecular Biology: AMB

#13

JOURNAL ARTICLE

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.

Jason Fan, Jamshed Khan, Noor Pratap Singh, Giulio Ermanno Pibiri, Rob Patro

The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections...

38254124

January 22, 2024: Algorithms for Molecular Biology: AMB

#14

JOURNAL ARTICLE

Dollo-CDP: a polynomial-time algorithm for the clade-constrained large Dollo parsimony problem.

Junyan Dai, Tobias Rubel, Yunheng Han, Erin K Molloy

The last decade of phylogenetics has seen the development of many methods that leverage constraints plus dynamic programming. The goal of this algorithmic technique is to produce a phylogeny that is optimal with respect to some objective function and that lies within a constrained version of tree space. The popular species tree estimation method ASTRAL, for example, returns a tree that (1) maximizes the quartet score computed with respect to the input gene trees and that (2) draws its branches (bipartitions) from the input constraint set...

38191515

January 8, 2024: Algorithms for Molecular Biology: AMB

#15

JOURNAL ARTICLE

Investigating the complexity of the double distance problems.

Marília D V Braga, Leonie R Brockmann, Katharina Klerx, Jens Stoye

BACKGROUND: Two genomes [Formula: see text] and [Formula: see text] over the same set of gene families form a canonical pair when each of them has exactly one gene from each family. Denote by [Formula: see text] the number of common families of [Formula: see text] and [Formula: see text]. Different distances of canonical genomes can be derived from a structure called breakpoint graph, which represents the relation between the two given genomes as a collection of cycles of even length and paths...

38178195

January 4, 2024: Algorithms for Molecular Biology: AMB

#16

JOURNAL ARTICLE

EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment.

Chengze Shen, Baqiao Liu, Kelly P Williams, Tandy Warnow

BACKGROUND: Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. RESULTS: We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i...

38062452

December 7, 2023: Algorithms for Molecular Biology: AMB

#17

Correction: Constructing founder sets under allelic and non-allelic homologous recombination.

Konstantinn Bonnet, Tobias Marschall, Daniel Doerr

No abstract text is available yet for this article.

38057863

December 6, 2023: Algorithms for Molecular Biology: AMB

#18

JOURNAL ARTICLE

Automated design of dynamic programming schemes for RNA folding with pseudoknots.

Bertrand Marchand, Sebastian Will, Sarah J Berkemer, Yann Ponty, Laurent Bulteau

Although RNA secondary structure prediction is a textbook application of dynamic programming (DP) and routine task in RNA structure analysis, it remains challenging whenever pseudoknots come into play. Since the prediction of pseudoknotted structures by minimizing (realistically modelled) energy is NP-hard, specialized algorithms have been proposed for restricted conformation classes that capture the most frequently observed configurations. To achieve good performance, these methods rely on specific and carefully hand-crafted DP schemes...

38041153

December 1, 2023: Algorithms for Molecular Biology: AMB

#19

JOURNAL ARTICLE

Quartets enable statistically consistent estimation of cell lineage trees under an unbiased error and missingness model.

Yunheng Han, Erin K Molloy

Cancer progression and treatment can be informed by reconstructing its evolutionary history from tumor cells. Although many methods exist to estimate evolutionary trees (called phylogenies) from molecular sequences, traditional approaches assume the input data are error-free and the output tree is fully resolved. These assumptions are challenged in tumor phylogenetics because single-cell sequencing produces sparse, error-ridden data and because tumors evolve clonally. Here, we study the theoretical utility of methods based on quartets (four-leaf, unrooted phylogenetic trees) in light of these barriers...

38041123

December 1, 2023: Algorithms for Molecular Biology: AMB

#20

JOURNAL ARTICLE

New algorithms for structure informed genome rearrangement.

Eden Ozeri, Meirav Zehavi, Michal Ziv-Ukelson

We define two new computational problems in the domain of perfect genome rearrangements, and propose three algorithms to solve them. The rearrangement scenarios modeled by the problems consider Reversal and Block Interchange operations, and a PQ-tree is utilized to guide the allowed operations and to compute their weights. In the first problem, [Formula: see text] ([Formula: see text]), we define the basic structure-informed rearrangement measure. Here, we assume that the gene order members of the gene cluster from which the PQ-tree is constructed are permutations...

38037088

December 1, 2023: Algorithms for Molecular Biology: AMB

Use the journals feature with a free QxMD account.

Algorithms for Molecular Biology : AMB

Save your favorite articles in one place with a free QxMD account.

Read

Search Tips