ntCard: A streaming algorithm for cardinality estimation in genomics data.
Authors | Hamid Mohamadi, Hamza Khan & Inanc Birol |
---|---|
Abstract | Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task. |
Journal Name and Citation | Bioinformatics. 2017 Jan 5. pii: btw832. doi: 10.1093/bioinformatics/btw832. [Epub ahead of print] |
Date of Publication | 2017/01/05 |
Publication Link | https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btw832 |