Clustering and substructure labeling

Home >> Nanopore training course >> Analyzing your data >> Clustering and sub-structure labeling


Table of Contents

  1. Clustering

  2. Clustering algorithms

  3. Gaussian mixture clustering

  4. HDBscan clustering

  5. Feature selection

  6. Feature normalization

  7. Sub-population labeling

  8. Sub-level labeling

  9. Experimentation


The examples in this tutorial were done using Nanolyzer™, Northern Nanopore's data analysis software.


Clustering

Identification and labeling of patterns and clusters in nanopore data is a critical part of solid-state nanopore research. When you first started in the field you might have been looking at the difference between folded and unfolded DNA, and now researchers looks at complex translocation events from proteins, information-bearing synthetic polymers, sugars, 3D DNA origami nanostructures, and hybrid molecules, all of which contribute unique properties to the signal that Nanolyzer can fit.


The Clustering tab is used in Nanolyzer for two main purposes. The first is to identify sub-populations within your events. For example, if you have a mixed sample with multiple different molecule types, you can use Nanolyzer to separate events into categories based on their metadata with just a few clicks. The second is to identify and label substructure within an event. For example, if you have an information-bearing polymer with sub-levels encoding bits of information, you can use Nanolyzer to label those sub-levels and bin your molecules according to the information encoded inside them, again all with just a few clicks.


Clustering algorithms

There are currently two clustering methods implemented in Nanolyzer, with more on the way. The first is a Gaussian mixture model, which assumes that the population is made up of a user-specified number of sub-populations which form N-dimensional Gaussians in the parameter space selected. The second is called HDBscan, which is a shape-independent model of clustering that attempts to automatically detect how many significant clusters there are based on user-specified limits on the sensitivity and number of points in cluster.


Gaussian Mixture Clustering

The parameters for Gaussian mixture clustering include the number of clusters, and the stability. Number of clusters is exactly what it sounds like. If you have easily visually identified clusters of data, simply specify the number of clusters and it will usually identify them without trouble. The stability parameter simply refits the model using a different initial guess up to the number of times specified, choosing the best fit (in the least-squares sense) over the set of all fits. This is necessary since Gaussian mixtures using random seeds for the initial guess, and is not deterministic in principle. It is possible to get different cluster assignments between runs of the algorithm, a problem which is mitigated in the limit of a very large stability parameter.

Gaussian mixtures should be used when your have populations of events in parameter space that are roughly N-dimensional Gaussian distributions. Note that it will assign a cluster to all events, even outliers, so it is usually best to filter outliers prior to clustering if this method is to be used.


HDBscan Clustering

HDBscan parameters are less intuitive. The first parameter, Min Cluster Size, is exactly what it sounds like. When splitting your events into a number of clusters, it will not split into clusters containing fewer points than this. The second parameters, Min Samples, determines how conservative the algorithm will be when assigning points to a cluster or to noise - a larger value of this parameter will result in more points being rejected as outliers. Finally, the sensitivity parameter roughly determines how often clusters are split up. A very small value will result in many small clusters (limited by the choice of Min Cluster Size and Min Samples), while a large value will result in fewer, larger clusters.

Unlike the Gaussian mixture model, HDBscan has less obvious behavior with respect to the inputs and so some experimentation may be required to give good fits. However, it has two major advantages: first, it is not limited to accurately fitting N-dimensional Gaussian sub-populations, which makes it more flexible for datasets containing sub-populations that do not fit that model, and it automates outlier selection, reducing the influence of bad fits and other artefacts on the labeling.


Feature Selection

The clusters that are built depend on the feature selection. In general, the most common features on which to cluster are Passage Time or Sub-level Duration, Maximum Blockage Depth or Sub-level Blockage, and Number of Levels, though many other combinations are possible. If you are finding that you are unable to separate populations that you believe should be separated, adding another feature (for example Sub-level Standard Deviation) will often allow better separation. Note that you can cluster on as many features as you want simultaneously, though you will for obvious reason only be able to visualize any three of them at a time.



Feature Normalization

Regardless of the algorithm chosen, the choice of features on which to cluster and the normalization method applied internally to that feature will make a significant difference. Every feature added will start out with default normalization, which is different depending on the columns chosen. The four options are Max, Gauss, MAD, and None. Max normalization simply normalizing a feature by the maximum value over the entire dataset. Gauss normalization does the standard transform to N(0,1), which makes clustering less sensitive overall. MAD normalization is similar to Gauss, but uses median absolute deviation and the sample median instead of standard deviation and mean, while None simply uses the raw values of the column. If you notice that the clustering is overly sensitive to a particular parameter, try changing the normalization method used. In order of sensitivity from least to greatest, the normalization methods of choice are usually Max, Gauss, MAD, and None.



Sub-population labeling

To label events by sub-population, simply choose at least two metadata features that correspond to events only (not sub-levels), check off up to three features to be plotted, choose normalization and log-scaling options, and press "Update Event IDs". This will generate a visual representation of the clusters found, along with integer labels for each cluster that can now be used in the Statistics tab for visualization or the Data Manager for event database filtering. The system will also assign a confidence score in its cluster assignment. Adjust the clustering parameters until you are satisfied with the assignments.


Sub-level labeling

To label sub-level, choose at least two metadata features on which to cluster, at least one of which must correspond to a sub-level feature. As before, tick off up to three features to plot, choose normalization and log-scaling options, and press "Update Sublevels". This will plot a visual representation of sub-level assignments. This operation will calculate two new sub-level metadata features that can now be used in visualization and event database filtering operations: the labels for each sub-level and confidence with which that label has been applied. Adjust the clustering parameters until you are satisfied with the assignments.


Experimentation

There is no general rule for the best combination of algorithm, parameters, and feature set for a given task, since your research is likely entirely unique. Experiment with different parameters to get a feel for how the clustering changes and develop methods that work for your data. As always, we are happy to help if you get stuck, simply reach out and ask.


Analysis Table of Contents

Previous Topic: Visualizing event metadata

Next Topic: Event database filtering


Last edited: 2021-08-02