Deekshant Wadhwaa,
Philipp Mensing
b,
James Hardenb,
Paula Brancoa,
Vincent Tabard-Cossa
b and
Kyle Briggs
*b
aDepartment of Computer Science, University of Ottawa, Canada
bDepartment of Physics, University of Ottawa, Canada. E-mail: kbriggs@uOttawa.ca
First published on 27th May 2025
As the complexity of solid-state nanopore experiments increases, analysis of the resulting electrical signals to determine biomolecular details becomes a challenge. State of the art techniques for this task perform poorly when transient signal characteristics approach the bandwidth limitations of the measurement electronics. In this work, we address this challenge through an algorithm, called Nano Trees, for fitting piecewise constant functions. Nano Trees leverages machine learning algorithms to provide fits to the noisy piecewise constant data that is characteristic of nanopore ionic current signals, producing accurate fits on transients as short as twice the rise time of the measurement system. We demonstrate the performance of our algorithm on several real and synthetic datasets. These findings underscore the generalizability and accuracy of this approach in the regime of fast molecular translocations.
The problem of fitting noisy piecewise constant data is ubiquitous, appearing in many scientific fields aside from nanopore analysis. For example, the same problem appears in the analysis of anomalous network traffic14,15 and neuronal activity patterns.16,17 Effective techniques to analyze this type of signal have the potential to be generally useful beyond just nanopore science. Nanopore analysis tasks involve the categorization of signals, for example, to recognize a rare target biomolecule signature from a complex mixture. Solid-state nanopore data often suffers from issues arising from the speed of molecular translocation, causing important signal features to be distorted by the rise time of the measurement electronics.18 Present analysis methods often struggle to classify and identify molecules due to the fast kinetics of molecular passage compared to available measurement bandwidth and to their associated inability to accurately characterize fast transient signals.18,19 Currently, the need for nanopore data analysis is served by a varied patchwork of techniques, many of which are specific to a single experimental context, such as basecalling20–22 or event classification.23 The lack of a framework that standardizes the general analysis case has led to differences in statistical treatment and makes quantitative comparisons between labs challenging. There is, therefore, a need for a method that can be readily adjusted to work effectively across multiple molecular targets and nanopore types.
While numerous methods have been proposed over the years, only a few can fit an arbitrary number of piecewise constant sublevels away from the baseline within a noisy and bandwidth-limited nanopore signal, and/or categorize events by type through recognition of patterns encoded in the sublevel structure. A commonly used class of methods involves variations on the Cumulative Sum (CUSUM) algorithm.24–26 This algorithm18,19,27 assumes that the signal to be analyzed is a piecewise constant signal overlaid with Gaussian-distributed noise (though it can be generalized to other noise distributions)17 and iteratively applies a modified t-test to each new data point to determine the likelihood that the local mean has undergone a step change of known magnitude. Like most nanopore analysis frameworks, CUSUM performs poorly when fitting transients that are short compared to the response time of the measurement system,19 leading to miscalculations of sublevel duration and blockage depth when transients are faster than 4 times the rise time of the system, or missing them entirely. Data clustering algorithms, like DBSCAN24 (density-based spatial clustering of applications with noise), have also been used to get an initial guess of the number of sublevels, detect abrupt changes, and then iteratively checked against their adjacent levels to see if they are sufficiently apart compared to a user-defined threshold, otherwise merged.25
An alternative to probabilistic sublevel fitting is to use an approximation to the transfer function of the measurement system to extract the underlying signal from the distorted measurement. The Adaptive Time-Series Analysis (ADEPT) algorithm18,19,28,29 fits a linear sum of exponential step functions over the event to determine the position of individual sublevels using standard nonlinear fitting techniques. This algorithm works under the assumption that the nanopore operates as a simple RC-equivalent circuit and that the rate-limiting factor that dominates the signal distortion is the response time of the measurement electronics, an assumption that breaks down when the signal is subjected to heavy filtering that can result in the time-response of the filter dominating that of the RC response.18 While quite effective for events with a relatively small number of well-separated sublevels (typically just one or two), the use of nonlinear fitting over many parameters (3 independent parameters for each sublevel) means that the algorithm performs increasingly poorly as events get more complex and often suffers from difficult-to-debug numerical errors that result in rejection of valid events. Moreover, because an estimate of the number of sublevels needed for fitting is a required input to the algorithm, the same challenges with respect to heavily distorted sublevels exist. This approach was recently extended using more general functional forms.6,27,30–32 The approach suggested by Lucas et al.30 is a statistical approach for characterising short-lived events, but it is not clear from published data how well it generalises to multi-step events. Another algorithm is presented by Gu et al.31 uses second order differentials to extract blockage states from single level events, but generalization to multi-step events is not demonstrated. CUSUM+ is widely used as the basis for event fitting in other packages and is chosen as the main point of comparison for the algorithm developed in this work. Other approaches to fitting signals to nanopore data have been developed,27,30,32 but all suffer from challenges when considering short transients approaching the response time of the measurement electronics.
In this work, we present a framework to improve fitting and characterization of nanopore signals that contain fast transient events, showing excellent fit accuracy down to twice the system response time. We also present a framework for optimizing fitting parameters, which we anticipate will assist with the standardization of statistical analysis of complex nanopore signals across different experimental contexts. Results of this optimized fitting can help with the task of categorizing nanopore events in mixed samples.33 A representative example of a nanopore signal undergoing a step change as a result of the translocation of two mock molecular states is depicted in Fig. 1. The conceived molecule generating such a two-state signal is illustrated in Fig. 1a, having its diameter increase halfway along its contour length. The true underlying signal that we hope to extract is well-approximated by a piecewise constant function, as can be seen in the schematic in Fig. 1b. However, a truly instantaneous transition between states requires infinite bandwidth, and finite response time and bandwidth limitations imposed by measurement electronics, as well as any low-pass filtering applied, result in distortion of the signal. The case where the rise time is dictated by the RC response of the measurement circuit is shown in Fig. 1c. The system is also subject to several sources of electrical noise that further distort the signal, which are discussed in detail in other studies.34 An example of the full raw signal (i.e., subject to noise, low-pass filtering, and bandwidth limitations) can be seen in Fig. 1d. A more detailed description of the sources of nanopore noise is available elsewhere.35
![]() | ||
Fig. 1 (a) a schematic diagram of a mock molecule of varying thickness translocating a nanopore from top to bottom. (b) The signal produced by this molecule without noise and with infinite measurement bandwidth. (c) The signal produced by this molecule, considering finite bandwidth electronics, using the prediction from.19 (d) The signal that would be produced by this molecule overlaid with systemic noise. (e) Examples of the level of distortion of a noiseless step function arising from bandwidth limitations as the duration of the translocation approaches the temporal resolution limits of the measurement system, with rise time set to 5 samples and event durations of 20, 4, and 1 times the rise time, respectively. Adapted with permission.18 |
This distortion becomes especially problematic when the duration of an important feature of the signal approaches the response time of the measurement system or the bandwidth of the recording device, which causes the signal to be attenuated as shown in Fig. 1e, and to vanish entirely into the noise for durations that are shorter than the system rise time.
Consequently, to accurately analyze nanopore data, the approach should be to first denoise the signal (for example, through Bessel filtering or wavelet filtering36), then to correct distortions arising from the finite bandwidth of the system, and finally to evaluate the physical validity of the extracted sublevel structure. With the goal of enabling such an accurate decoding of sublevels all the way down to the bandwidth- and hardware-imposed limitations of nanopore measurement, we present here a method of decoding and fitting sublevels to nanopore data, termed Nano Trees, intended for use in decoding arbitrary piecewise constant signals overlaid with noise and systemic distortion.
Each of the steps that forms the pipeline is a modular component that operates independently of the others, and can be reordered or reapplied as needed. The selection and order of passes used here were found to be effective for fitting the data discussed in this work, but is not necessarily prescribed for all nanopore data, and can easily be updated as needed. Full automation of the selection of these hyperparameters is the subject of ongoing work.
The full Nano Trees pipeline, as well as a pseudocode implementation, is described in detail in ESI Section S1† and in a related master's thesis.42 In short, data is normalized and subsequently grouped into sublevels in a hierarchical manner, beginning with an overfit of the data and iteratively merging sublevels and improving the fit through increasingly fine-grained passes over the underlying data until it is deemed to be physically accurate according to a set of context-specific, user-specified hyperparameters. Fig. 2 provides a block diagram view of the process that employs different supervised machine learning algorithms for fitting nanopore data.
Class 1. Event consisting of a single transient symmetric peaked sublevel.
Class 2. Event with transient symmetric peaked sublevel (asymmetric peak) at the beginning.
Class 3. Event with transient symmetric peaked sublevel (symmetric peak) in the middle.
Class 4. Event with asymmetric transient peaked sublevel at the end.
Class 5. Event with a transient sloped sublevel at the beginning.
Class 6. Event with a transient sloped sublevel at the end.
Class 7. Event containing a double peaked sublevel pair with transient separation.
For these classes of events, we varied the duration of the sublevel in class 1, the duration of the peaked sublevels in classes 2–4, the duration of the sloped sublevel in classes 5 and 6, and the gap between the two peaks in class 7. The true duration is varied in the inclusive range of 2–10 times the simulated rise time of the system.19 For the sake of a direct comparison with real data, the rise time is simulated to be 1 μs. The signal was overlaid with uncorrelated white noise such that the SNR of the various transients is 6 times the standard deviation of the baseline current. The signal is then sampled at 5 MHz and low-pass filtered to a bandwidth of 1 MHz to show a typical nanopore measurement using state-of-the-art electronics. In this dataset, we have 500 events contributing to every data point, for 4500 events per class and 31500 events overall.
The first metric is shape accuracy, which simply considers whether the sublevel structure of the fit is correct. To pass, a fit must have the correct number of sublevels in the correct order with respect to depth, have an error in the fitted blockage depth of the transient level not exceeding three standard deviations of the baseline noise, and have an error on the fitted duration of the transient level not exceeding 3 times the system rise time. The second row in Fig. 3 shows the percentage of events that passed as a function of transient duration. It is immediately clear that Nano Trees outperforms CUSUM+ significantly when transients are shorter than 4 times the system rise time, while matching performance for transients that achieve a steady state. Only the fits that had the correct shape (as defined above) were assessed for blockage and duration accuracy.
Of the fits that have the correct shape, we also compared the error in both the duration (third row in Fig. 3) and blockage depth (bottom row in Fig. 3, where a value of zero indicates a perfectly accurate fit). Duration error is calculated as the signed difference between the fitted duration of the transient part of the signal and the known ground truth duration used to simulate the event. Blockage error is calculated as the signed difference between the fitted blockage level and the true blockage depth used in the simulation. Both errors are expressed as multiples of the standard deviation of the baseline noise or the rise time of the system, as appropriate. A value of zero for either metric indicates a fit that coincides with the ground truth. As with the fraction of events for which the shape is accurate, Nano Trees outperforms CUSUM+ for short transients without sacrificing performance on longer ones. It is worth noting that both algorithms underestimate true blockage depth to increasing degrees as the transient duration approaches zero, which is to be expected given systemic distortion, though this effect is suppressed in Nano Trees compared to CUSUM+. Of note, the sloped sublevels (classes 5 and 6) display a clear directional bias, with Nano Trees underestimating or overestimating the blockage depth depending on whether the event is sloping upwards or downwards to a greater degree than CUSUM+. This is likely because the transients are themselves asymmetric and can be improved using more sophisticated algorithms to estimate the depth of sloped sublevels. In principle, approaches such as ADEPT19 or the approaches suggested by Lucas et al.30 or Gu et al.31 could be implemented as additional modular passes in Nano Trees pipeline to correct these errors.
These signals provide clear insight into the limitations and types of errors that arise when using Nano Trees for fitting and highlight the improvements available over incumbent analysis methods in the regime of fast transients. The insights from this synthetic data are critical to inform evaluation of the quality of fits to real datasets, and to understand the strengths and limitations of the approach.
Nano Trees fit accuracy | CUSUM+ fit accuracy | |
---|---|---|
10 dataset | 94.55% | 59.18% |
01 dataset | 89.65% | 48.96% |
Combined | 92.12% | 54.10% |
The algorithm is versatile and tuneable through its various hyperparameters to adjust to any type of piecewise constant data with systemic distortion and additive noise, and we are actively working on full automation of the relevant parameters. When applied to real nanopore data in particularly challenging cases, Nano Trees outperforms CUSUM+ by a considerable margin, while still falling short of the accuracy required for fully unsupervised decoding of arbitrary nanopore signals. Ongoing research and improvement will focus on reducing the false negative chance when transients approach the system's rise time, while inclusion of methods developed by others19,30,31 as additional passes in the Nano Trees pipeline may provide a means to correct for directional bias in sublevel current estimation.
In the long term, the sublevel structure extracted by this analysis framework will form the feature set used by event classification methods in a variety of contexts, including single-molecular diagnostics, protein detection, and molecular information storage.
The definition and effect of each hyperparameter on the resulting fit is discussed in the ESI Section S6.† Users are advised to follow the hyperparameter descriptions and tuning procedures discussed in ESI Section S7† for obtaining the optimal configuration of these hyperparameters to obtain good fits. The hyperparameters used for the case studies in this work are given in ESI Section S8.†
Footnote |
† Electronic supplementary information (ESI) available: The Nano Trees pipeline (ESI Section S1); detailed pseudo-code algorithms for updating current estimates function (ESI Section S2), merging small current steps pass (ESI Section S3), sublevel categorization procedure (ESI Section S4), and splitting sublevels pass (ESI Section S5); glossary of all hyperparameters used by Nano Trees (ESI Section S6); steps for hyperparameter tuning (ESI Section S7); hyperparameter values used by Nano Trees for fitting all datasets presented in this research (ESI Section S8). See DOI: https://doi.org/10.1039/d5dd00060b |
This journal is © The Royal Society of Chemistry 2025 |