Ultra-scalable, low memory genomic data analysis

In the life sciences, genomic data analysis faces the problem of ever increasing data amounts, and many users are struggling to find the best way to transfer their existing pipelines and tools to the cloud. Tools that work well on a single machine are often entirely unable to benefit from a cloud setting, or require advanced engineering to do so. This leaves scientists in a bind, since they would rather spend their time thinking about the biology than become engineers.

At JNP Solutions, we have been attacking this problem for biological sequence analysis, such as NGS (next generation sequencing) data and analysis of large chromosomes. Our open source software Discount (Distributed Counting) is capable of analysing very large amounts of genomic data on Apache Spark (tested up to 10 TB, and expected to work up to at least PB scale). By building on original, peer-reviewed research, we were able to reduce the memory requirement of some analyses to 1/8 of the best competing tool and ensure smooth scalability, as well as reducing cost.

Our proprietary extensions build on the Discount technology to implement scaled-up replacements for well-known open source tools such as the KMC3 k-mer counter and the Kraken 1 and Kraken 2 taxonomic classifiers. In the case of KMC3, we can create massive distributed k-mer indexes that are 1/7 to 1/9 the size of KMC3 indexes on disk. In the case of Kraken 2, we can build reference libraries around 100x larger than what you would typically use on a single machine. This greatly enhances accuracy. Because our tools are based on Apache Spark and designed to work on distributed data, we are unconstrained by available memory or disk space, as we can simply add more machines when data amounts grow. This was achieved by rethinking all methods from the ground up for a cloud setting.

Our tools can run out of the box on all major cloud providers, on an in-house cluster, or on a single laptop. They support both batch usage and interactive notebooks. Furthermore, because our outputs can be made backwards compatible, our command-line tools can also be used as drop-in replacements for traditional tools in existing pipelines. This provides a migration path for researchers who are not sure how benefit from the cloud. For researchers who already use the cloud, our algorithms and cloud-native code may provide a significant speed boost.

Whether you are curious about our existing tools or need new tools developed for your use case, contact us to schedule a demo today at: info@jnpsolutions.io.

Mastodon Twitter LinkedIn

Ultra-scalable, low memory genomic data analysis

You May Also Enjoy

Bumping the NGS read classification rate to 97%

Let’s rewrite scientific tools manually

Statement on generative AI

Slacken and Discount are now available in Bioconda