Home >> Nanopore training course >> Analyzing your data >> Fitting events
Table of Contents
The examples in this tutorial were done using Nanolyzer™, Northern Nanopore's data analysis software.
1. Event fitting algorithms
Event fitting with Nanolyzer uses one of two algorithms: CUSUM+, an adaptation of the CUSUM algorithm specific for nanopore data for long events with internal substructure, and stepfit, an adaptation of the nonlinear fitting methods originally used in MOSAIC for short, relatively featureless events [1,2]. Setting up analysis can be an intimidating task at first glance due to the number of input parameters needed, but for the most part defaults can be used. This section will cover in detail all of the available parameters, starting with the ones that will need to change every run, and finishing with a general reference to the rest. The end goal is a fit like the figure below, where internal sub-levels are fitted, de-noised, and characterized for downstream post-processing.
2. Event fitting setup
To get started, once you have loaded your raw data, select Data->New Analysis while on the Raw Data tab. This will cause several things to happen: first, you will be asked to select a folder to store the output of the analysis run. The program will create a folder called "output" inside the folder you specify. Following that, the baseline current statistics of the currently selected data view will be fitted, and finally, a new menu will be populated with configuration parameters for event fitting.
There are 4 sections of parameters to be considered, selected using the drop-down menu that will initially be labeled "IO Settings". As a result, this section looks very long and daunting, but not to worry: most parameters do not need to be changed between analysis runs, and the program will remember your settings and reload them between runs so that the actual work needed each time should take less than a minute once you are used to the setup. It looks like this:
2.a. IO Settings
This section contains values that pertain to reading the raw data into Nanolyzer for processing and event fitting.
Sampling frequency
The sampling frequency in Hz should be automatically detected and populated based on the type of data loaded. Check to verify that it is correct, but no change should be needed.
Chunk Size
While reading through your file to find events, Nanolyzer will intake data in segments of length in seconds equal to this parameter. Generally, this should be at minimum twice as long as the longest event you plan to detect. When going through the file Nanolyzer will calculate new local baseline statistics on each segment, so the value should be chosen to be short enough that you do not expect significant baseline variation over the timescale of a single chunk. For most applications 1 second is a good starting point, though for pores with a lot of low-frequency noise it can be beneficial to make it shorter.
Start Time and End Time
If you uncheck the "Read Whole File" box it is possible to specify start and end times to analyze, causing Nanolyzer to ignore the rest of the file. This is useful for a clogged pore, for example, where you do not want to waste time attempting to fit the clog.
2.b. Event Segmentation Settings
This section contains parameters relating to coarsely identifying molecular translocation events in the signal for downstream fitting.
Event Direction
The vast majority of nanopore work considers blockages in current, which will be the case for the default setting of 0/unchecked here. If you seek to fit current enhancements, set this parameter to 1/checked.
Minimum and Maximum Baseline
This parameter is the signed value of the smallest baseline that you consider to be valid in picoamperes, and should have been automatically estimated as part of the initial baseline fit. Verify visually on the plot that the value is reasonable and adjust if needed. The fitted bounds will be indicated by the transparent green box on the plot. Note that it will not update visually if you change the value manually, but the value entered in this part of the IO Settings will be used for fitting.
Manual Baseline Override/Mean/Stdev
If preferred, you can cause Nanolyzer to forego local baseline fitting and always take a specific mean and standard deviation value that you specify here. This is generally not recommended, as the local baseline fitting allows the analysis to be sensitive to odd local behaviors that would otherwise be missed, but for pathologically poorly-behaved datasets for which automatic fitting is failing it is an option that can be explored.
Detection Threshold/Hysteresis
The beginning of an event is signaled when the current trace passes
where s is the standard deviation, m the mean current of the baseline, and t is the threshold specified. The end of an event is signaled when the current returns to
where y is the specified hysteresis. By default these values are 6 and 7 respectively, meaning that an event is called when the current deviates from the mean by 6 standard deviations, and the end of the event is called when the current returns to baseline and passes beyond it by 1 standard deviation (7-6=1). The diagram below provides an illustration.
If your data has very high signal-to-noise ratio, this value can be increased to reduce the rate of false positives. Generally, setting y=t+1 is a safe rule of thumb, which allows you to avoid the problem in the bottom image in the above diagram, where a noisy spike in the current would otherwise cause the event to be called ended early.
Use Data Filter/Data Filter Cutoff Frequency/Data Filter Order
These parameters will be default be set equal to the filter parameters set in the visualization parameters above, and determine the filter that is applied to the data before segmenting events using the rest of the settings.
2.c. Event Fitting Settings
This section pertains to the settings used when performing detailed de-noising fits to the events identified by the Event Segmentation Settings. Parameters here will dictate how sensitive the algorithm is to changes in the current, and how closely the fitted piecewise constant signal will follow changes to the current values.
Sensitivity
This is the single most important parameter of the run, and should usually be changed every run. It sets the sensitivity of the fitting algorithm to internal sub-levels (larger value means less sensitive) and should be set equal to the number of baseline current standard deviations by which the current must change for that change to be considered the start of a new sub-level. The diagram on the right clarifies the explanation. In this case there are three obvious sub-level changes visible. The second and third steps are smaller than the first, and are about equal, so Sensitivity should be set to the value of those steps, normalized by the baseline standard deviation:
Elasticity
For datasets that have events spanning a very wide range of passage times, fits will eventually have a false positive and call a sub-level even when no significant change has occurred. While most of these are dealt with by the Minimum Step Size parameter, sometimes this is insufficient.
If your event durations span multiple orders of magnitude, the elasticity parameter can be used to make the fits less sensitive when processing very long events. When this parameter is nonzero, the effective Sensitivity parameter will become dependent on event length. For the shortest 25% of your events, Sensitivity will be used without modification. For the longest 25% of events, Sensitivity+Elasticity will be used. In the middle, an intermediate value will be interpolated linearly on the base-10 log of the event duration.
As a general rule of thumb, if your event durations span more than three orders of magnitude, set Elasticity equal to Sensitivity.
Minimum Step Size
Sometimes very small steps can be fitted due to false positives. If a step in the current is fitted that is smaller than this parameter times the local baseline standard deviation, it will simply be ignored.
Minimum Step Separation
The algorithm used to fit events assumes a piecewise constant signal with white noise overlaid. This is not actually the case, since a bandwidth-limited signal cannot ever have a step change. If not dealt with, the algorithm will fit a staircase-like pattern to the transitions between events. This parameter specifies a time in microseconds that the algorithm should pause detection of new sub-levels after it detects another one. It should generally be set to about 5 times the rise-time for your setup, which for most solid-state nanopore systems is about 1 microsecond, but could be more or less depending on the capacitance and access resistance leading up to the pore. A more detailed discussion of rise times and bandwidth limitation on nanopore signals is available in our previous post.
Internal Detection Threshold/Hysteresis
This parameter has no direct effect on the fitting, but can provide a useful metric for downstream post-processing. The definitions are identical to the threshold and hysteresis from the Event Segmentation Settings but instead of calling the start or end of the event, it simply increments a counter every time either threshold value is crossed. You can use this to characterize events by the number of times they enter a deep blockage state, and use it later to filter out events that have a particular characteristic shape.
If you do not need this capability, you can simply ignore it. It will not have any effect on the fits.
Baseline Wait Time
In certain nanopore systems, there can be significant distortion to the baseline current following passage of a molecule. In such cases we do not want to attempt to fit the local baseline level during that distortion. This parameter sets a number of microseconds of baseline data to ignore following the end of an event. For the vast majority of systems, this can be safely set to 0 and forgotten about.
Nonlinear Fit Threshold
Nanolyzer uses two algorithms for fitting. For short, single-level events, it can be better to use a model of the event inspired by a simple RC-filter equivalent circuit originally used in MOSAIC. The nonlinear fit threshold sets a value in microseconds such that any event shorter than this setting will be fitted using this model instead of the statistical one above. Generally, this can be set equal to 2-4x the Minimum Step Separation.
Attempt Recovery
Occasionally, the statistical fitting detailed above can fail. If that happens, checking this box will attempt to recover the failed event fit by trying the other algorithm instead.
Use Event Filter/Event Filter Cutoff Frequency/Event Filter Order
As with the Data Filter, this sets the filter parameters to be used when fitting the events. It will default to the value selected for viewing the data, but it can be beneficial in some cases to apply different filters at different stages of the process, so the option exists here should a different value be desired.
2.d. Event Rejection Criteria
This section is quite simple, and pertains to hard data filters that will be applied to reject events that fall outside the time limits imposed here.
Minimum/Maximum Event Length
As the name would suggest, events that fall outside of these bounds are simply rejected and no fit attempt is made at all. Minimum Event Length can reasonably be set equal to Minimum Step Separation, while Maximum Event Length should not be larger than half of the Chunk Size parameter.
3. Running Analysis
Once the parameters above are set to your satisfaction, simply press "Save and Run Analysis". A command window will open showing progress, and a short report will be printed at the end, as shown below.
Note that failed fits are not necessarily a problem. Occasionally events are rejected for good reason. This report should be carefully evaluated for sanity as a first check to ensure that the analysis parameters were appropriate to the problem.
The first sanity check is to verify that the failed fits are in acceptable proportion and reasonable. Given the breadth of possible nanopore experiments there is no rule of thumb for determining this. The possible reasons for event rejection by CUSUM are as follows.
Error Meaning
Baseline differs Baseline before and after the event differ
Too long Event length exceeds allowed maximum
Too short Event length below allows minimum
Too few levels Fitting was unable to find at least two steps
Cannot read data Input file is likely corrupted
Cannot pad event Two events occur too closely together
Fitted step too small Deepest blockage less than allowed minimum
All others A numerical error occurred
Whether the error rate is acceptable depends on the downstream analysis. In general, rejecting events is not an issue for statistical analysis downstream as long as the reason that events are rejected is uncorrelated with the statistical metrics being studied.
An example where event rejections might be problematic would be if the target translocation rate was the desired output metric, and many events were rejected with the error "Cannot pad event". This error occurs when events happen very closely spaced in time, and is therefore going to bias the capture rate extracted toward longer inter-event times.
If mean passage time is a desired output metric and many events are being rejected as being too short or too long, this could also be problematic, and should be carefully considered to ensure that rejected events are valid.
4. Visualizing the fits
Once the analysis has finished, the rest of the tabs in Nanolyzer will be enabled as the results of the analysis are automatically loaded. You can also reload analysis results without rerunning the analysis using File->Load Analysis and selecting the output folder from a previous run.
Before we move on to exploring the resulting metadata statistics, it is possible to see visually the events that are accepted and rejected in the Raw Data viewing tab. After analysis is complete, updating the raw data trace will include transparent green overlays for successfully fitted events, and red ones for rejected events, as shown below.
Note that to visualize data after reloading, you will also need to reload the corresponding raw data as before in order to plot it and see the overlay. Changing the filter settings for viewing data to be different from that used for fitting previously may cause the overlay to appear visually inaccurate.
Finally, you can navigate to the "Event Viewer" tab and click through to see the fits and visually check that they are appropriate. If you find that sub-levels are being missed or over-fitted, adjust the parameters and rerun the analysis. Once you are happy with the fits, we move on to visualizing the event metadata that you have extracted. You will note that some events have only an orange fit line, while some have both green and orange. This depends on the algorithm being used. In both cases, the orange fit gives the fully de-noised fit from which most of the metadata is extracted.
As always, should you have difficulty getting the fits just right, reach out to us and we will be happy to assist you.
5. References
[1] J. H. Forstater et al., “MOSAIC: A modular single-molecule analysis interface for decoding multistate nanopore data,” Anal. Chem., vol. 88, no. 23, 2016, doi: https://www.doi.org/10.1021/acs.analchem.6b03725.
[2] K. Briggs, “Solid-State Nanopores: Fabrication, Application, and Analysis,” uOttawa, 2018.
Last edited: 2021-08-02
Comments