Heqian
Zhang
a,
Jiaquan
Huang
a,
Xiaoyu
Wang
a,
Zhizeng
Gao
b,
Song
Meng
cde,
Hang
Li
f,
Shanshan
Zhou
g,
Shang
Wang
h,
Shan
Wang
i,
Xunyou
Yan
j,
Xinwei
Yang
k,
Xiaoluo
Huang
*l and
Zhiwei
Qin
*a
aCenter for Biological Science and Technology, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China. E-mail: z.qin@bnu.edu.cn
bSchool of Marine Sciences, Sun Yat-sen University, Southern Laboratory of Ocean Science and Engineering, Zhuhai, Guangdong 519000, China
cState Key Laboratory of Drug Research & Natural Products Research Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
dUniversity of Chinese Academy of Sciences, Beijing, 100049, China
eZhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Zhongshan, Guangdong 528400, China
fSchool of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, Guangdong 510006, China
gState Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing, 100101, China
hCollege of Plant Science, Jilin University, Changchun, Jilin 130062, China
iState Key Laboratory of Microbial Technology, Shandong University, Qingdao, Shandong 266237, China
jCollege of Life Science, Langfang Normal University, Langfang, Hebei 065000, China
kSchool of Pharmaceutical Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
lShenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China. E-mail: huangxl@siat.ac.cn
First published on 14th October 2024
Microorganisms are valuable resources as antibiotic producers, biocontrol agents, and symbiotic agents in various ecosystems and organisms. Over the past decades, there has been a notable increase in the identification and generation of both wild-type and genetically modified microbial strains from research laboratories worldwide. However, a substantial portion of the information represented in these strains remains scattered across the scientific literature. To facilitate the work of future researchers, in this perspective article, we advocate the adoption of the DNA-based natural language (DBNL) algorithm standard and then demonstrate it using a Streptomyces species as a proof of concept. This standard enables the sophisticated genome sequencing and subsequent extraction of valuable information encoded within a particular microbial species. In addition, it allows the access of such information for the continued research and applications even if a currently cultivated microbe cannot be cultured in the future. Embracing the DBNL algorithm standard promises to enhance the efficiency and effectiveness of microbial research, paving the way for innovative solutions and discoveries in diverse fields.
The genetic landscape of microorganisms is characterized by its dynamism, shaped by myriad evolutionary forces, environmental exigencies, and stochastic occurrences. This complicated interplay dictates the preservation, loss, and acquisition of genetic traits within microbial populations over time, presenting formidable challenges to researchers. One significant hurdle arises from the inability to culture a substantial portion of microorganisms from diverse environments in laboratory settings.1,2 This prompts us to ask a pertinent question: can the microbes that are currently cultivable continue to be cultured in the distant future, for example, a century from now? In addition, since obtaining timely first-hand information on specific model organisms, whether wild-type or genetically modified, and unique microbial resources, is paramount, the following question then arises: where and how can such information be acquired?
Generally, the predominant research outcomes in the scientific community have relied heavily on extensive academic publications. However, the dissemination of such publications typically entails a relatively long time period. In light of recent advancements in digital technologies, a thought experiment has emerged: could the most crucial characteristics of any cultivated microbe be encoded and stored as DNA sequences within their live cells, with the ability to decode this information into a natural language readable by humans? This question takes on particular significance considering that such timely acquisition can now be achieved through straightforward sequencing and analysis – a technology that is no longer laborious and costly, but generally affordable and accessible.3
DNA digital data storage is the process of encoding and decoding binary data to synthesize strands of DNA.4 The synthesized DNA stores the aforementioned information in its host cell. Compared to canonical methods, DNA storage offers numerous advantages, including high storage density, extended storage time, minimal footprint, and low maintenance costs. Moreover, advancements in gene recombination technology enable the encryption of information within DNA, embedding it into biological cells.5 Because biological cells have the ability to replicate, knowledge stored in organisms can be passed down from one generation to another. In addition, information contained within biological cells cannot be accessed by computer hackers, who often operate through electronic-based systems, making this information potentially highly secret. These valuable properties make it a very promising medium for specialized data storage applications, such as scientific knowledge. However, the lack of a clear pipeline or guidelines for realistic data storage precludes its future deployment.
Here we introduce the “DNA-based natural language” (DBNL), which is embedded in microbial cells on the basis of DNA digital data storage. An illustrated diagram has been shown in Fig. 1. Our primary objective is to ensure the preservation of the key information of these bacteria when they are intentionally or unintentionally introduced into the environment. Simultaneously, we strive to provide valuable first-hand information for future researchers that is easily accessible through straightforward sequencing and analysis. The integration of DBNL into microbial systems addresses the urgent need for long-term data preservation, particularly as certain microbial strains become unculturable over time. It not only enhances data storage capabilities but also ensures the integrity and accessibility of essential biological information. By employing error-correction techniques and redundancy, DBNL serves as a promising solution for the challenges faced in microbial research, facilitating better data retrieval and analysis in future studies. We anticipated the embedded information will serve as a valuable resource, allowing researchers to learn the discoveries of their predecessors, particularly from a metagenomic perspective. This dimension holds substantial reference value for future researchers.
To further illustrate this perspective, we present a case study using the recently discovered rhizosphere species Streptomyces ginsengnesis G7 (see following sections). This strain is notable for its biocontrol potential, as it produces natural products that serve as both herbicides and antibiotics. Therefore, in the context of this strain, we aim to summarize pertinent information, focusing on its secondary metabolite production, antifungal activities, and herbicidal utilities.
Error correction is another critical aspect when encoding DBNL for microorganisms. Current methods include inserting error correction codes and introducing data redundancy into the encoded DNA sequences. Classic error correction codes, such as the Hamming code9 and the Reed–Solomon code,10 can address single nucleotide errors to some extent, but they struggle with significant fragment replacement or loss. For bacteria that replicate over extended periods, relying solely on an error correction code may not suffice. Therefore, in order to ensure high-fidelity data storage and retrieval in microorganisms, encoded DNA should be designed with additional redundancy. To this end, two practical approaches exist for this purpose: one is the classical computational “XOR” conversion, which can generate varying degrees of redundancy.8 Although no standard exists for the optimal level of redundancy to ensure stable data reproduction and high-precision decoding, a one-third redundancy can mitigate significant sequence loss. Another method is to create multiple copies of the data within a single cell; these DNA copies can repair each other in the event of specific faults.
Our project began with the compilation of a comprehensive summary outlining the primary functions of this biocontrol strain (Fig. 3). Using the Wukong encoder, we translated these insights into a DNA sequence (see ESI† for details). This sequence was synthesized and then integrated into the G7 chromosome using advanced protocols. Notably, this synthesized DNA sequence can be decoded into readable natural language through genome sequencing and decoding, ensuring its accessibility and comprehensibility.
To address potential concerns regarding the stability of foreign DNA, particularly in scenarios lacking selective pressure or neofunctionalization, we conducted genetic experiments wherein synthetic foreign DNA was integrated into the bacterial chromosome at the certain recombination site. For this purpose, we employed a bacterial φC31 integrase vector, pIB139, containing the attP attachment site, which corresponds to the attB site in the bacterial genome. Four of the recombined mutants were confirmed by PCR testing and subsequently underwent sequencing. We then sub-cultured these strains, harvested their spores and mycelia, and performed a repeat sequencing. Notably, no mutational changes were observed. These findings strongly suggest that our integration strategy ensures the stability of foreign DNA within the bacterial genome.
Although it may be impractical to categorize the entirety of the boundless array of microorganisms, the significance of pioneering work becomes evident when considering the dedication of more than 2200 laboratories spanning across 108 countries worldwide working on such efforts.23 Additionally, while the probability of mutations during DNA replication is indeed non-negligible, we believe these mutations typically result in point changes or duplications rather than the complete loss of exogenous DNA fragments. As such, this issue can be effectively managed by incorporating additional copies of the DNA sequences encoding the information. Moreover, we anticipate that these endeavours could be coupled with currently well-developed large language models, which would further contribute to error correction and increase the tolerance of bacterial DNA mutation and precision of encoded information.
While the current limitations in the amount of information that can be stored in DNA present significant challenges, they also drive innovation in this field. By condensing bespoke information and embracing new methodologies, such as automated synthesis platforms, researchers can not only reduce costs but also improve the efficiency of DNA storage solutions. Future efforts must focus on standardizing encoding techniques to facilitate interoperability across various applications and research disciplines. Such advancements will be crucial in overcoming the existing limitations and unlocking the full potential of DNA data storage in diverse scientific fields.
Finally, in light of the potential implications of employing DBNL standards, we recognize the importance of adhering to biosecurity and ethical guidelines. The integration of DBNL-encoded organisms into research and application contexts necessitates a thorough understanding of existing regulations surrounding genetically modified organisms (GMOs). It is essential that the development and use of these organisms comply with stringent biosecurity protocols to mitigate any unintended ecological or environmental consequences. Prior to deploying DBNL-encoded organisms, rigorous approval processes must be undertaken, which include comprehensive assessments conducted by relevant regulatory bodies. These processes will ensure that potential risks are adequately evaluated and managed. Moreover, we advocate for the establishment of a robust risk assessment framework that examines possible environmental impacts, particularly those related to biodiversity and ecosystem health. Addressing these considerations is crucial to fostering public trust and acceptance of GMO technologies. In addition, ethical principles should guide the development and application of DBNL in microbial systems, promoting responsible innovation and ensuring that research practices are aligned with societal values. By integrating bioethical considerations into the research and implementation phases, we can better prevent unintended consequences and enhance the sustainability of biotechnological advancements.
Taken together, these research undertakings, focused on cultivated microorganisms, underscore the imperative nature of such initiatives and should not be underestimated. Therefore, we call upon the community to commit to and actively engage in this field.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00251b |
This journal is © The Royal Society of Chemistry 2024 |