This document summarizes challenges in assembling large DNA sequence data sets and strategies to address them.
1. The cost to generate DNA sequence data is decreasing rapidly, creating data sets too large for most computers to assemble. Hundreds to thousands of such data sets are generated each year.
2. Techniques like streaming compression and low-memory probabilistic data structures allow assembly memory usage to scale linearly with the sample size rather than the total data, enabling assembly of larger datasets.
3. Benchmarking different computational platforms revealed that while some platforms have faster processors, the ability to store large amounts of data locally is also important for assembly tasks. Scaling algorithms, rather than just optimizing code, is key to addressing