DNA as the data storage medium
By 2025 the volume of data produced in the world will have reached 250 zettabytes (1 zettabyte = 1021 bytes). Current storage media have insufficient storage capacity or suffer from obsolescence. Preserving even a fraction of this data means finding a storage device with density and durability characteristics significantly superior to those of existing systems. The European OligoArchive project, launched in October 2019 for three years, proposes to use DNA (DeoxyriboNucleic Acid) as a storage medium. Raja Appuswamy, researcher at EURECOM partner of the project, explains further.
In what global context did the European OligoArchive project come about?
Raja Appuswamy Today, everything in our society is driven by data. If data is the oil that fuels the metaphorical AI vehicle, storage technologies are the cog that keep the wheel spinning. For decades, we wanted fast storage devices that can quickly deliver data, and optical, magnetic, and solid state storage technologies evolved to meet this requirement. As data-driven decision becomes a part of our society, we are increasingly faced with a new need–one for cheap, long-term storage devices that can safely store the collective knowledge we generate for hundreds or even thousands of years. Imagine you have a photograph that you would like to pass down to your great-great grand children. Where would you store it? How much space would it take? How much energy would it use? How much would it cost? Would your storage media still be readable two generations from now? This is the context for project OligoArchive.
What is at stake in this project?
RA Today, tape drives are the gold standard when it comes to data archival across all disciplines, from Hollywood movie archives to particle accelerator facilities. But tape media suffers from several fundamental limitations that makes it unsuitable for long-term data storage. First, the storage density of tape -the amount of data you can store per inch- is improving at a 30% rate annually; archival data, in contrast, that has a growth rate of 60%. Second, if one stores 1PB in 100 tape drives today, within five years, it would be possible to store the same data in just 25 drives. While this might sound like a good thing, using tape for archival storage implies constant data migration with each new generation of tape, and such migrations cost millions of dollars.
This problem is so acute that Hollywood movie archives have openly admitted that we are living in a dead period during which the productions of several independent artists will not be saved for the future! At the rate at which we are generating data for feeding our AI machinery, enterprises will soon be at this point. Thus, the storage industry as a whole has come to the realization that a radically new storage technology is required if we are to preserve data across generations.
What will be the advantages of the technology developed by OligoArchive?
RA Project OligoArchive undertakes the ambitious goal of retasking DNA–a biological building block–to function as a radically new digital storage media. DNA possesses three key properties that make it relevant for digital data storage. First, it is an extremely dense three-dimensional storage medium that has the theoretical ability to store 455 Exabytes in 1 gram. The sum total of all data generated world wide (global datasphere) is projected to be 175 Zettabytes by 2025. This could be stored in just under half a kilogram of DNA. Second, DNA can last several millenia as demonstrated by experiments that have the read DNA of ancient, extinct animal species from fossils that are dated back thousands of years. If we can bring back the wolly mammoth to life from its DNA, we can store data in DNA for millenia. Third, the density of DNA is fixed by nature, and we will always have the ability and the need to read DNA–everything from archeology to precision medicine depend on it. Thus, DNA is an immortal storage medium does not have the media obsolescence problem and hence, can never become out dated unlike other storage media (remember floppy disks?).
What expertise do EURECOM researchers bring?
The Data Science department at EURECOM is contributing to several aspects of this project. First, we are building on our deep expertise in storage systems to architect various aspects of using DNA as a storage media, like developing solutions for implementing a block abstraction over DNA, or providing random access to data stored in DNA. Second, we are combining our expertise in data management and machine learning to develop novel, structure-aware encoding and decoding algorithms that can reliably store and retrieve data in DNA, even though the underlying biological tasks of synthesis (writing) and sequencing (reading) introduce several errors.
Who are your partners and what are their respective contributions?
The consortium brings together a truly multi-disciplinary group of people with diverse expertise across Europe. Institute of Mollecular and Cellular Pharmacology (IPMC) in Sophia Antipolis, the home to the largest sequencing facility in the PACA region, is a partner that contributes its biological expertise to the project. Our partners at I3S, CNRS, are working on new compression techniques customized for DNA storage that will drastically reduce the amount of DNA needed to store digital content. Our colleagues at Imperial College London (UK) are building on our work and pushing the envelope further by using DNA not just a storage media, but a computational substrate by showing that some SQL database operations that run in-silico (on a CPU) today can be translated efficiently into in-vitro biochemical reactions directly on DNA. Finally, we also have HelixWorks, a startup from Ireland that specializes is investigating novel enzymatic synthesis techniques for reducing the cost of generating DNA, as an industrial partner.
What results are expected and ultimately what will be the applications?
The ambitious end goal of the project is to build a DNA disk–a fully working end-to-end prototype that shows that DNA can indeed function as a replacement for current archival storage technology like tape. Application wise, archival storage is a billion dollar industry, and we believe that DNA is a fundamentally disruptive technology that has the potential to reshape this market. But we believe that our project have an impact on areas beyond archival storage.
First, our work on DNA computation opens up an entirely new field of research on near-molecule data processing that mirrors the current trend of moving computation closer to data to avoid time-consuming data movement. Second, most of the models and tools we develop for DNA storage are actually applicable for analyzing genetic data in other contexts. For instance, the algorithm we are developing for reading data back from DNA provides a scalable solution for sequence clustering–a classic computational genomics problem with several applications. Thus, our work will also contribute to advances in computational genomics.
Learn more about OligoArchive
Leave a Reply
Want to join the discussion?Feel free to contribute!