Deciphering DNA nucleotide sequences and their rotation dynamics with interpretable machine learning integrated C3N nanopores†
Abstract
A solid-state nanopore combined with the quantum transport method has garnered substantial attention and intrigue for DNA sequencing due to its potential for providing rapid and accurate sequencing results, which could have numerous applications in disease diagnosis and personalized medicine. However, the intricate and multifaceted nature of the experimental protocol poses a formidable challenge in attaining precise single nucleotide analysis. Here, we report a machine learning (ML) framework combined with the quantum transport method to accelerate high-throughput single nucleotide recognition with C3N nanopores. The optimized eXtreme Gradient Boosting Regression (XGBR) algorithm has predicted the fingerprint transmission of each unknown nucleotide and their rotation dynamics with root mean square error scores as low as 0.07. Interpretability of ML black box models with the game theory-based SHapley Additive exPlanation method has provided a quasi-explanation for the model working principle and the complex relationship between electrode–nucleotide coupling and transmission. Moreover, a comprehensive ML classification of nucleotides based on binary, ternary, and quaternary combinations shows maximum accuracy and F1 scores of 100%. The results suggest that ML in tandem with a nanopore device can potentially alleviate the experimental hurdles associated with quantum tunneling and facilitate fast and high-precision DNA sequencing.