__Home__ >> __Nanopore training course__ >> __Analyzing your data__ >> Clustering and sub-structure labeling

## Table of Contents

The examples in this tutorial were done using __Nanolyzer™__, Northern Nanopore's data analysis software.

## Clustering

Identification and labeling of patterns and clusters in nanopore data is a critical part of solid-state nanopore research. When you first started in the field you might have been looking at the difference between folded and unfolded DNA, and now researchers looks at complex translocation events from proteins, information-bearing synthetic polymers, sugars, 3D DNA origami nanostructures, and hybrid molecules, all of which contribute unique properties to the signal that Nanolyzer can fit.

The Clustering tab is used in Nanolyzer for two main purposes. The first is to identify sub-populations within your events. For example, if you have a mixed sample with multiple different molecule types, you can use Nanolyzer to separate events into categories based on their __metadata__ with just a few clicks. The second is to identify and label substructure within an event. For example, if you have an information-bearing polymer with sub-levels encoding bits of information, you can use Nanolyzer to label those sub-levels and bin your molecules according to the information encoded inside them, again all with just a few clicks.

## Clustering algorithms

There are currently two clustering methods implemented in Nanolyzer, with more on the way. The first is a __Gaussian mixture__ model, which assumes that the population is made up of a user-specified number of sub-populations which form N-dimensional Gaussians in the parameter space selected. The second is called __HDBscan__, which is a shape-independent model of clustering that attempts to automatically detect how many significant clusters there are based on user-specified limits on the sensitivity and number of points in cluster.

### Gaussian Mixture Clustering

The parameters for Gaussian mixture clustering include the number of clusters, and the stability. Number of clusters is exactly what it sounds like. If you have easily visually identified clusters of data, simply specify the number of clusters and it will usually identify them without trouble. The stability parameter simply refits the model using a different initial guess up to the number of times specified, choosing the best fit (in the least-squares sense) over the set of all fits. This is necessary since Gaussian mixtures using random seeds for the initial guess, and is not deterministic in principle. It is possible to get different cluster assignments between runs of the algorithm, a problem which is mitigated in the limit of a very large stability parameter.

Gaussian mixtures should be used when your have populations of events in parameter space that are roughly N-dimensional Gaussian distributions. Note that it will assign a cluster to all events, even outliers, so it is usually best to __filter__ outliers prior to clustering if this method is to be used.

### HDBscan Clustering

HDBscan parameters are less intuitive. The first parameter, Min Cluster Size, is exactly what it sounds like. When splitting your events into a number of clusters, it will not split into clusters containing fewer points than this. The second parameters, Min Samples, determines how conservative the algorithm will be when assigning points to a cluster or to noise - a larger value of this parameter will result in more points being rejected as outliers. Finally, the sensitivity parameter roughly determines how often clusters are split up. A very small value will result in many small clusters (limited by the choice of Min Cluster Size and Min Samples), while a large value will result in fewer, larger clusters.

Unlike the Gaussian mixture model, HDBscan has less obvious behavior with respect to the inputs and so some experimentation may be required to give good fits. However, it has two major advantages: first, it is not limited to accurately fitting N-dimensional Gaussian sub-populations, which makes it more flexible for datasets containing sub-populations that do not fit that model, and it automates outlier selection, reducing the influence of bad fits and other artefacts on the labeling.

### Feature Selection

The clusters that are built depend on the feature selection. In general, the most common features on which to cluster are Passage Time or Sub-level Duration, Maximum Blockage Depth or Sub-level Blockage, and Number of Levels, though many other combinations are possible. If you are finding that you are unable to separate populations that you believe should be separated, adding another feature (for example Sub-level Standard Deviation) will often allow better separation. Note that you can cluster on as many features as you want simultaneously, though you will for obvious reason only be able to visualize any three of them at a time.

### Feature Normalization

Regardless of the algorithm chosen, the choice of features on which to cluster and the normalization method applied internally to that feature will make a significant difference. Every feature added will start out with default normalization, which is different depending on the columns chosen. The four options are Max, Gauss, MAD, and None. Max normalization simply normalizing a feature by the maximum value over the entire dataset. Gauss normalization does the standard transform to N(0,1), which makes clustering less sensitive overall. MAD normalization is similar to Gauss, but uses median absolute deviation and the sample median instead of standard deviation and mean, while None simply uses the raw values of the column. If you notice that the clustering is overly sensitive to a particular parameter, try changing the normalization method used. In order of sensitivity from least to greatest, the normalization methods of choice are usually Max, Gauss, MAD, and None.

## Sub-population labeling

To label events by sub-population, simply choose at least two metadata features that correspond to events only (not sub-levels), check off up to three features to be plotted, choose normalization and log-scaling options, and press "Update Event IDs". This will generate a visual representation of the clusters found, along with integer labels for each cluster that can now be used in the Statistics tab for __visualization__ or the Data Manager for __event database filtering__. The system will also assign a confidence score in its cluster assignment. Adjust the clustering parameters until you are satisfied with the assignments.

## Sub-level labeling

To label sub-level, choose at least two metadata features on which to cluster, at least one of which must correspond to a __sub-level feature__. As before, tick off up to three features to plot, choose normalization and log-scaling options, and press "Update Sublevels". This will plot a visual representation of sub-level assignments. This operation will calculate two new sub-level metadata features that can now be used in __visualization__ and __event database filtering__ operations: the labels for each sub-level and confidence with which that label has been applied. Adjust the clustering parameters until you are satisfied with the assignments.

## Experimentation

There is no general rule for the best combination of algorithm, parameters, and feature set for a given task, since your research is likely entirely unique. Experiment with different parameters to get a feel for how the clustering changes and develop methods that work for your data. As always, we are happy to help if you get stuck, simply reach out and ask.

__Previous Topic: Visualizing event metadata__

__Next Topic: Event database filtering__

Last edited: 2021-08-02