Robust Clustering Methods for Sequence Analysis

A Practical Guide to Consensus and Noise Clustering in R

Leonhard Unterlerchner

Matthias Studer

Introduction

Sequence analysis (SA) is a holistic method for studying trajectories. Using a range of techniques, from visualization to explanation, this approach allows researchers to describe, compare, and identify patterns or irregularities in trajectories.

One key step is to create a typology of the trajectories with cluster analysis. This typology describes the various kinds of patterns observed and can be used as a categorical variable in subsequent analysis (Liao et al. 2022). This makes clustering central to SA as it strongly shapes the subsequent analyses.

However, it features among the main criticisms of SA for several reasons. First, typologies created using cluster analysis might be unstable or sample dependent, or more generally, perform poorly depending on the data characteristics. This raises concerns about the reliability of the results (Roth et al. 2024). Second, these methods might perform poorly in the presence of outliers, when some observations lie between clusters, or when the data are weakly structured—i.e. cluster separation is unclear and clusters are not homogeneous (see Figure @ref(fig:figClustStr)) (Balcan, Liang, and Gupta 2014; Martin, Schoon, and Ross 2008, FIXMEworkingpaper). Third, these methods might fail to identify uncommon subgroups. Infrequent types and outliers might be of key interest to identify atypical or emerging behaviours (Sacchi and Meyer 2016; Unterlerchner, Studer, and Gomensoro 2023).

Two clustering approaches —noise and consensus clustering— answer these limitations. This document describes these clustering algorithms and provides the R code to create typologies of trajectories using the consClust and seqclararange provided by the WeightedCluster R library (Studer 2013). It also presents methods to evaluate the quality of the resulting clusterings.

The document is structured as follows. We start by presenting the data and its preparation in Section @ref(secData). After briefly presenting cluster analysis in Section @ref(secClustering), we present the creation and evaluation of typologies using consensus and noise clustering in Sections @ref(secConsClust) and @ref(secNoiseClust). We conclude with the advantages of each approach in Section @ref(secConclusion).

N.B The running time is stated below computationally intensive chunks (time \(\ge\) 1 sec.).

Data Preparation

We rely on the mvad dataset to illustrate the use of consClust and seqclarange functions. This public dataset is distributed with the TraMineR R package. It contains the data used by McVicar and Anyadike-Danes (2002) for studying school-to-work transitions in Northern Ireland.

First, we create a state sequence object using the seqdef command (Gabadinho et al. 2011). Trajectories can be plotted using the seqIplot command, see Figure @ref(fig:figSeqMvad).

# Loading the package
library(WeightedCluster)

# Loading illustrative data
data(mvad)

# Creating state sequence object
mvad.seq <- seqdef(mvad[, 17:86], # The data containing information on trajectories
                   labels = c("Employment", "Further Education", # The states 
                              "Higher Education", "Joblessness", 
                              "School", "Training"), 
                   xtstep = 6)
# Plotting the squences
seqIplot(mvad.seq, 
         legend.prop=0.2, 
         sortv = "from.start") # sequences are ordered by state.
MVAD trajectories, sorted from start
MVAD trajectories, sorted from start

Second, to perform cluster analysis, we compute a dissimilarity matrix comparing the trajectories using the seqdist command. We use the LCS dissimilarity measure capturing both differences in timing and sequencing within the trajectories. Its versatility makes it the standard choice in SA. For more details and other methods, see Studer and Ritschard (2016).

# Compute LCS dissimilarities
diss <- seqdist(mvad.seq, 
                method="LCS")

Time for this code chunk to run: 0.8 seconds

Clustering Approaches

We now turn to the creation of typologies using cluster analysis. It is a data mining technique grouping similar observations into types. A multitude of clustering algorithms (CAs) have been proposed to fulfil different aims. A key distinction resides in the kind of typology returned by the algorithm, which can be crisp or fuzzy (Hennig et al. 2015).

Both noise and consensus clustering can render each of these partition types. To allow an informed choice on this matter, we briefly discuss these approaches in the following lines.

Crisp Clustering

Crisp clustering partitions a dataset so that each observation belongs to exactly one cluster and no clusters overlap. This makes crisp clustering compatible with any method handling categorical data. However, this approach compresses potentially rich dissimilarity information into a single categorical assignment.

Doing so, members of the same cluster may be falsely regarded as identical or highly similar even when important differences exist. Hybrid cases —i.e. observations lying in between several clusters— must still be forced into only one cluster. In consequence, the identification of such observations is made difficult by crisp clustering refWorkingPaper.

Fuzzy Clustering

Fuzzy clustering allows each data point to have graded membership in several clusters instead of being forced into exactly one group. This method is more suitable than crisp clustering when the clustering structure is weak, leading to unclear and overlapping boundaries between categories. Figure @ref(fig:figClustStr) provides an example of two clusterings diverging in their structure strength.

Clusterings with the same centers but showing Strong or Weak Structure
Clusterings with the same centers but showing Strong or Weak Structure

By assigning membership degrees between 0 and 1, fuzzy methods can reveal hybrid cases, that is, observations that genuinely share characteristics of multiple clusters rather than fitting neatly into a single class. This soft assignment also improves robustness to noise and outliers because uncertain points can be given distributed memberships rather than being wrongly forced into one cluster (Studer 2018; Ruspini, Bezdek, and Keller 2019; Helske, Helske, and Chihaya 2023).

Consensus Clustering

The first robust clustering method presented in this vignette is consensus clustering. It is a technique aiming to increase the robustness of the clustering results, by diminishing their sample dependence or by taking advantage of several clustering rationales. In a simulation study, we found consensus clustering to be particularly versatile and robust [workingpaper FIXMEREF]. It proceeds in two steps.

First, several clusterings are computed to form an ensemble of partitions of the same data. This first step allows the ensemble of partition to reflect the diversity of typologies that can be obtained on the same data. Monti et al. (2003) propose to generate the ensemble by clustering the same data with varying weights. To do so, we rely on Bayesian resampling. This simulates a bootstrap procedure but all observations are always present, albeit weighted differently (Hornik and Böhm 2023).

The reweighted samples are then clustered using the computed weights and one of the specified CA. If several CAs were specified, each CA is used in an equal number of reweighted samples. In the first case, the aim is to reduce the typology sample dependence. In the second one, the aim is to achieve greater flexibility by benefiting simultaneously from several CAs (Hennig et al. 2015). If multiple CAs are provided, each CA is applied to an equal share of the reweighted samples.

Second, a consensus is searched for among these partitions by a consensus function to obtain a typology which synthesizes the information from the ensemble of partitions. Doing so, the resulting typology is more robust. According to the used method, the consensus can be either a crisp or a fuzzy clustering.

Creating the typology

In this section we create a typology using the consensus clustering framework as proposed by Monti et al. (2003). To do so, we use the consClust command of the WeightedCluster package. The function takes a dissimilarity matrix diss as input data. The argument base.clust specifies the clustering algorithms to be used for creating the ensemble of R partitions. When several good candidates exist for the same task, specifying several CAs allows achieving greater flexibility [see FIXME workingpaper]. The argument kvals specifies the number of groups the algorithm looks for, cons.method sets the consensus function to rely on. In the following example, we rely on the SE method, which minimizes the sum of dissimilarities using Euclidean dissimilarities. Please refer to Hornik and Böhm (2023) for details on available methods. The argument membership defines whether the returned clustering takes the form of fuzzy membership matrices or crisp label vectors. The argument k.fixed prevents the consensus function from producing a typology with more groups than in the ensemble of partition.

In the following example, the typology is computed on a ensemble of 100 partitions obtained with PAM and Ward clustering algorithms (for details on these algorithms, see FIXMEREF unterlerchnerStuder 2026)

Setting parallel=TRUE, a default parallel back-end is set up using the future framework (Bengtsson 2026). When parallel=FALSE, any parallel back-end previously defined with the plan function will be used. The parallel protocol can then be adapted to specific environments, for instance, some High Performance Computing (HPC) servers rely on specific protocols (MPI,…). We use here these strategy, and any subsequent call will use these parallel backend. Setting progressbar=TRUE shows information (and estimated computation time) on the progress of the computations.

# Setting up parallel computing
library(future)
plan(multisession)
# Creating the typology
set.seed(1234)
pamWardConsClust <- consClust(diss,
                              base.clust = c("pam", "ward.D"), 
                              R = 100, 
                              kvals = 2:15,
                              cons.method = "SE", 
                              membership = "crisp",
                              k.fixed = TRUE,
                              agg.method = "cRand",
                              keep.ensemble = TRUE,
                              parallel = FALSE, 
                              progressbar = FALSE)
## [>] Performing consensus clustering on 100 partitions, using: pam, ward.D
## [>] Elapsed time: 15.99 secs

Time for this code chunk to run: 16.44 seconds

The function returns a consClust object, containing the obtained consensus clusterings, the function call and Cluster Quality Indices (CQIs). If keep.ensemble = TRUE, the ensemble of partitions is stored in the returned object.

Evaluating and plotting the typology

To guide the user on the adequate number of groups to keep for the final typology, CQIs can be displayed by typing the name of the returned object pamWardConsClust

# Showing CQIs
pamWardConsClust
##            PBC   HG HGSD  ASW ASWw     CH   R2   CHsq R2sq   HC cons_cRand
## cluster2  0.66 0.80 0.80 0.45 0.45 237.88 0.25 500.72 0.41 0.10       0.83
## cluster3  0.57 0.69 0.68 0.34 0.34 189.30 0.35 401.68 0.53 0.15       0.54
## cluster4  0.50 0.65 0.64 0.31 0.32 159.17 0.40 323.92 0.58 0.18       0.59
## cluster5  0.58 0.79 0.79 0.37 0.38 171.09 0.49 425.71 0.71 0.10       0.59
## cluster6  0.57 0.80 0.79 0.37 0.37 166.36 0.54 426.83 0.75 0.10       0.65
## cluster7  0.56 0.84 0.84 0.38 0.39 161.74 0.58 447.82 0.79 0.08       0.67
## cluster8  0.56 0.86 0.86 0.39 0.39 147.21 0.59 389.65 0.79 0.08       0.67
## cluster9  0.56 0.91 0.90 0.41 0.42 151.58 0.63 468.80 0.84 0.06       0.71
## cluster10 0.55 0.90 0.90 0.38 0.39 137.04 0.64 420.98 0.84 0.06       0.69
## cluster11 0.54 0.92 0.91 0.42 0.43 132.31 0.65 419.41 0.86 0.06       0.68
## cluster12 0.50 0.90 0.90 0.39 0.40 127.47 0.67 402.59 0.86 0.07       0.67
## cluster13 0.49 0.91 0.90 0.38 0.39 122.55 0.68 392.25 0.87 0.07       0.67
## cluster14 0.48 0.91 0.90 0.37 0.38 116.27 0.68 371.49 0.87 0.06       0.65
## cluster15 0.46 0.90 0.90 0.34 0.35 114.85 0.70 361.42 0.88 0.07       0.65

Measures of agreement between the partitions used to obtain the consensus clustering are also provided. They allow the evaluation of the ensemble’s clustering stability. We propose relying on the Adjusted Rand Index (cRand). It measures the similarity between partitions. A value of 1 indicates two identical clusterings, 0 indicates similarity obtained by chance and highly dissimilar clusterings are associated with negative values (Hubert and Arabie 1985). Studer, Sadeghi, and Tochon (2024) propose the following similarity interpretation thresholds: strong (ARI \(\ge\) 0.9), good (ARI \(\ge\) 0.8) and weak (ARI \(\ge\) 0.7).

High cRand values indicate a high level of stability in the partition ensemble and, by extension, a more robust consensus typology. Low cRand values can be interpreted in two ways (Warrens and van der Hoef 2022). First, if the partitions are obtained with only one CA, this indicates that the partitions are dependent on the subsamples they are computed on and that a single clustering on the whole sample would not be robust. If the partitions are obtained from several CAs. A low cRand means that the CAs lead to different results. This can be expected if one uses consensus clustering to benefit from CAs following different rationales. However, since the first interpretation still applies in this case, the exact contribution of each dynamic to the index is unknown.

CQIs can be plotted with the plot command, see Figure @ref(fig:plotConsClustCqi). CH and CHsq showing high values, we normalized the CQIs using the argument norm = zscore to allow plotting all CQIs on the same figure with the argument stats = "all".

# Plotting CQIs
par(cex = 0.75)
plot(pamWardConsClust,
     legendpos = "topleft",
     stat = "all",
     norm = "zscore") # CQIs are standarize
PAM and Ward consensus clustering CQIs (normalized)
PAM and Ward consensus clustering CQIs (normalized)

Internal CQIs (HG, CHsq and HC) indicate a nine or eleven-cluster solution, as they are maximized (minimized for HC) for these numbers of groups, see FIXMEworkingpaper for details on the use of CQIs to select the number of groups. The cRand is maximized for nine clusters and indicates a good level of agreement in the partition ensemble (Studer, Sadeghi, and Tochon 2024). We can now plot the trajectories according to the nine-groups typology (Figure @ref(fig:consClustSeqplot)), which is more parsimonious.

# Plotting the consensus typology in nine groups
par(mar = c(2,2,2,2))
seqIplot(mvad.seq,
         group = pamWardConsClust$clustering$cluster9, # Specitifing the cluster to use for plotting 
         main = c("Further Ed. - Higher Ed.", "Joblessness", # naming the clusters in plot
                  "Training - Employment", "Training",
                  "School - Higher Ed.", "Further Ed. - Employment", 
                  "Employment", "School - Employment", 
                  "Futher Ed."), 
         cex.legend = 0.8)
Crisp consensus Clustering in nine groups (PAM and Ward)
Crisp consensus Clustering in nine groups (PAM and Ward)

Fuzzy Consensus clustering

We now compute the same consensus clustering but in its fuzzy version. It is done by using the argument membership = "fuzzy".

# Creating the typology
set.seed(1234)
pamWardConsClustF <- consClust(diss,
                              base.clust = c("pam", "ward.D"),
                              R = 100, 
                              kvals = 2:15, 
                              cons.method = "SE", 
                              membership = "fuzzy", 
                              k.fixed = TRUE,
                              agg.method = "cRand",
                              keep.ensemble = TRUE,
                              progressbar = FALSE)
## [>] Performing consensus clustering on 100 partitions, using: pam, ward.D
## [>] Elapsed time: 7.11 secs

Time for this code chunk to run: 7.11 seconds

The obtained typology can be plotted using the fuzzyseqplot function, see Figure @ref(fig:plotConsFuzzy). In each panel sequences are sorted according to the membership probability. Each panel only displays sequences with a membership probability \(\ge 0.4\).

par(mar = c(2,2,2,2))
fuzzyseqplot(mvad.seq, # sequences to plot 
             group = pamWardConsClustF$clustering$cluster9, # grouping variable
              main = c("Further Ed. - Higher Ed.", "Joblessness",# naming the clusters
                       "Training - Employment", "Training",
                       "School - Higher Ed.", "Further Ed. - Employment",
                       "Employment", "School - Employment",
                       "Futher Ed."), 
             membership.threshold = 0.4,
             sortv = "membership",
             type = "I", # We plot an index plot
             cex.legend = 0.8) 
Fuzzy consensus Clustering in nine groups (PAM and Ward), sorted by membership probability
Fuzzy consensus Clustering in nine groups (PAM and Ward), sorted by membership probability

The obtained fuzzy typology provides similar clusters to the crisp one. The added value being that the cluster’s diversity can be better described by looking at the panels, where typical sequences are shown at the top.

Noise clustering

Noise clustering is another advanced clustering technique. Contrary to most clustering algorithms, it does not provide exhaustive typologies. Observations are not coerced to belong to a cluster but can also remain unclassified. In such case, they are labelled as noise.

This approach has two advantages. First, unclassifiable observations are not assigned to clusters in which they would poorly fit. Doing so, clusters are better defined and more homogeneous. Second, by flagging them as noise, unclassifiable trajectories can be studied per se (Liao et al. 2022; Piccarreta and Struffolino 2023). Such trajectories might be of great interest in some research designs, as they often denote particularly good (or ill) situations, or might be associated with particular outcomes in later life (Sacchi and Meyer 2016; Unterlerchner, Studer, and Gomensoro 2023).

In its fuzzy variant, if the noise group is set aside, one can consider that the membership degrees are not coerced to sum to one. Doing so the fuzzy noise clustering can be seen as a variant of possibilistic clustering, which provides more coherent membership degrees in the presence of noise in the data (D’Urso 2015).

Creating the typology

To create the typology, we use a fuzzy extension of the CLARA algorithm that allows labelling sequences as noise instead of assigning them to a cluster. CLARA is a medoid-based clustering method, but rather than clustering the whole dataset, medoids are searched for on a subsample. The clustering is then extended to the dataset and this operation is repeated to ensure the robustness of the results. CLARA can be applied to large datasets. The fuzzy approach is well suited to the identification of noise, looking for exact analytical solutions being extremely computationally intensive.

We use the seqclarange command (with the argument method = "noise") of the WeightedCluster package to create the typology with noise. R specifies the number of times the operation is repeated. The subsample size is defined by the sample.size argument. For more details on the use of seqclarange please refer to Studer (2024).

The argument dnoise is a tuning parameter controlling the algorithm’s sensitivity to noise. It is the required distance \(\delta\) of an observation to any medoid for this observation to be considered as not belonging to any type. Defining this parameter plays a critical role in the typology creation, as it directly affects the number of observations labelled as noise. Higher \(\delta\) values labellize fewer trajectories as noise. We discuss \(\delta\) definition in detail and give examples in Section @ref(secDnoise).

Dave (1991) defines this distance using the average distance in the sample using the following formula, being the number of sequences, \(\mathbf {x}\) the sequences and \(\lambda\) an user-defined coefficient.

\(\delta = \lambda \cdot \frac{2} {n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} d(\mathbf{x}_i, \mathbf{x}_j)\).

Using the above formula and setting \(\lambda\) to 0.8 leads to a \(\delta\) of 68.4. Since we used Optimal Matching with constant costs, this value can be interpreted theoretically (Studer and Ritschard 2016). It indicates that a sequence needs to be different to any medoid during 34 months in total to be considered as noise.

delta08 <- mean(diss) * 0.8 # Calculating the noise distance delta
delta08
## [1] 68.39203
# Creating a typology with noise
set.seed(1234)
noiseClust08 <- seqclararange(mvad.seq, 
                              kvals = 2:15,
                              R = 50, # Number of subsamples
                              sample.size = nrow(mvad.seq), 
                              method = "noise",
                              dnoise = delta08, # noise sensitivity
                              seqdist.args = list(method = "LCS"))

Time for this code chunk to run: 48.09 seconds

Evaluating and plotting the typology

Medoid-based fuzzy CQIs are computed alongside the clustering (Studer, Sadeghi, and Tochon 2024). Applied to noise clustering, their interpretation is made difficult because all the clusters are not created on the same rules, the noise cluster being not constructed to be homogeneous. In this context, we recommend using CQIs only to guide the number of group selection, but not to compare clusterings obtained with different algorithms.

CQIs can be displayed by typing the object name and the command plot produce a figure of the CQIs, see Figure @ref(fig:plotNoiseClustCqi).

# Showing CQIs
noiseClust08
##           Avg dist  PBM   DB   XB  AMS ARI>0.8 JC>0.8 Best iter
## cluster2     33.25 1.73 2.64 0.24 0.73      NA     NA        32
## cluster3     26.01 0.96 3.50 0.54 0.71      NA     NA         4
## cluster4     21.61 0.63 3.65 0.49 0.78      NA     NA         7
## cluster5     18.95 0.44 3.87 0.43 0.81      NA     NA        32
## cluster6     17.45 0.33 4.32 0.40 0.79      NA     NA        34
## cluster7     15.54 0.28 4.56 0.35 0.81      NA     NA         6
## cluster8     14.27 0.22 5.11 0.59 0.82      NA     NA        15
## cluster9     13.32 0.18 5.57 0.56 0.82      NA     NA        46
## cluster10    12.44 0.16 5.59 0.52 0.82      NA     NA        16
## cluster11    11.80 0.13 5.91 0.59 0.84      NA     NA        29
## cluster12    11.15 0.11 6.08 0.62 0.84      NA     NA        35
## cluster13    10.75 0.10 6.10 0.54 0.83      NA     NA        27
## cluster14    10.17 0.09 6.04 0.56 0.84      NA     NA        20
## cluster15     9.85 0.08 6.54 0.55 0.84      NA     NA         4
# Plotting CQIs
plot(noiseClust08, 
     legendpos = "topleft")
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
Fuzzy noise clustering CQIs, lambda = 0.8
Fuzzy noise clustering CQIs, lambda = 0.8

DB and XB indicate a seven-groups typology. Figure @ref(fig:noiseClustSeqplot) provides a graphical representation of the typology using the fuzzyseqplot command. Additionnaly sequences are sorted according to their membership strength (Studer 2018). Regarding the memb.treashold argument, we used a small value to be able to display sequences labelled as noise, which tend to be associated with dispersed membership probabilities.

## Displaying the resulting clustering with membership threshold of 0.20
par(mar = c(2,2,2,2))
fuzzyseqplot(mvad.seq, 
             group = noiseClust08$clustering$cluster7,  
             main = c("Futher Ed.", "Employment", # naming the clusters
                      "School - Higher Ed.", "Further Ed. - Higher Ed.",
                      "Joblessness", "Training - Employment",
                      "Further Ed. - Employment", "Noise Seq."), 
             membership.threshold = 0.20,
             type = "I", # We plot an index plot
             sortv = "membership", 
             cex.legend = 0.8)
Fuzzy noise clustering in seven groups, lambda = 0.8, sorted by membership probability
Fuzzy noise clustering in seven groups, lambda = 0.8, sorted by membership probability

The visual inspection of the group of sequences labelled as noise indicates that we cannot consider these sequences as a regular type, since it is heterogeneous and features observations strongly diverging from the other types in their sequencing aspect.

Converting the fuzzy typology to crisp

To be used with methods handling categorical data, the fuzzy clustering can be transformed into a crisp one by assigning the observation to the cluster showing the highest membership probability. This can be done using the as.crisp command. See Figure @ref(fig:crispNoiseClustSeqplot).

crispNoiseClust08 <- as.crisp(noiseClust08)
par(mar = c(2,2,2,2))
seqIplot(mvad.seq, 
         group = crispNoiseClust08$clustering$cluster7,
         main = c("Futher Ed.", "Employment", # naming the clusters
                  "School - Higher Ed.", "Further Ed. - Higher Ed.",
                  "Joblessness", "Training - Employment",
                  "Further Ed. - Employment", "Noise Seq."), 
        cex.legend = 0.8)
Crisp noise clustering in seven groups, lambda = 0.8
Crisp noise clustering in seven groups, lambda = 0.8

Defining dnoise

Defining dnoise is a critical step in performing noise clustering, as it controls the number of observations labelled as noise.

To discuss the dnoise argument, we now provide three examples of noise clustering with varying \(\lambda\) and number of groups. For conciseness we only present the crisp version of the clusterings.

Setting \(\lambda\)

When using Dave (1991) formula to define dnoise the coefficient \(\lambda\) acts as a tuning parameter. This allows the algorithm to be more or less sensitive to noise. Higher lambda leads to more conservative noise labelling.

We propose setting \(\lambda\) by visually investigating the clusterings obtained according to several values chosen around one. Dave (1991) suggested using smaller values. However, in our case —applied to high-dimensional categorical data— such values proved to be too restrictive. We provide two examples of noise clustering with different \(\lambda\) to discuss its impact on the resulting typologies.

# Calculating the noise distance delta
delta06 <- mean(diss) * 0.6 
delta06
## [1] 51.29403

The \(\lambda\) parameter is now decreased to 0.6. The resulting \(\delta\) being smaller, the algorithm will label more trajectories as noise.

# Creating a typology with noise
set.seed(1234)
noiseClust06 <- seqclararange(mvad.seq, 
                              kvals = 2:15,
                              R = 50,
                              sample.size = nrow(mvad.seq),
                              method = "noise", 
                              dnoise = delta06,
                              seqdist.args = list(method = "LCS"))

# Converting the fuzzy partition to crisp
crispNoiseClust06 <- as.crisp(noiseClust06) 

Time for this code chunk to run: 48.01 seconds

par(mar = c(2,2,2,2))
# Plotting the crisp typology
seqIplot(mvad.seq,
         group = crispNoiseClust06$clustering$cluster7, 
         main = c("Futher Ed.", "Employment", 
                  "School - Higher Ed.", "Further Ed. - Higher Ed.",
                  "Joblessness", "Training - Employment",
                  "Further Ed. - Employment", "Noise Seq."),
        cex.legend = 0.8)  
Crisp noise clustering in seven groups, lambda = 0.6
Crisp noise clustering in seven groups, lambda = 0.6

As expected, lowering \(\lambda\) to 0.6 sharply increases the number of sequences identified as noise (see Figure @ref(fig:noiseClust06Seqplot). This group being highly heterogeneous, it is not possible to consider it as a type. However the seven other types are more homogeneous than before. If getting such homogeneous types suits the research aim, such \(\lambda\) value would be adequate.

In the following example, \(\lambda\) is increased to 1.

delta1 <- mean(diss) * 1 # Calculating the noise distance delta
delta1
## [1] 85.49004
# Creating a typology with noise
set.seed(1234)
noiseClust <- seqclararange(mvad.seq, 
                            kvals = 2:15,
                            R = 50,
                            sample.size = nrow(mvad.seq),
                            method = "noise", 
                            dnoise = delta1,
                            seqdist.args = list(method = "LCS"))

# Converting the fuzzy partition to crisp
crispNoiseClust <- as.crisp(noiseClust)

Time for this code chunk to run: 48.22 seconds

# Plotting the crisp typology
par(mar = c(2,2,2,2))
seqIplot(mvad.seq, 
         group = crispNoiseClust$clustering$cluster7,
         main = c("Futher Ed.", "Employment", 
                  "School - Higher Ed.", "Further Ed. - Higher Ed.",
                  "Joblessness", "Training - Employment",
                  "Further Ed. - Employment", "Noise Seq."),
         cex.legend = 0.8)
Crisp noise clustering in seven groups, lambda = 1
Crisp noise clustering in seven groups, lambda = 1

This new \(\lambda\) decreases the algorithm’s sensitivity to noise to the point that only very few sequences are labelled as such (see Figure @ref(fig:noiseClust1Seqplot)). Its extremely small size impedes its use in subsequent analyses.

dnoise and number of groups

We now discuss the link between the number of groups in a typology and the sensitivity of dnoise. When increasing the number of groups, the distance between the observation and the medoids diminishes. In consequence fewer sequences are labelled as noise with the same dnoise value. Figure @ref(fig:boxplotD2m) below presents the boxplots of the distance to the medoids of a crisp clustering without noise.

Distance to medoid by number of clusters
Distance to medoid by number of clusters
Crisp noise clustering in ten groups, lambda = 0.8 (delta = 68.39)
Crisp noise clustering in ten groups, lambda = 0.8 (delta = 68.39)

In a ten-group typology, only 13 sequences are labelled as noise with dnoise = 68.4 (see Figure @ref(fig:crispNoiseTypo10). They were 53 in the seven-group typology using the same dnoise. To achieve the same level of noise sensitivity, a lower dnoise is needed for a more detailed typology.

To avoid this behaviour and to predict dnoise values suitable for a greater number of groups, we propose the following strategy. First we compute a clustering without noise and calculate the distances to medoid in each cluster for every number of groups. Using these distances and a \(\delta\) optimized for a given number of groups (here seven), we can calculate \(\delta\)’s adapted to any number of groups. Doing so, the same amount of noise will be detected for every number of groups.

Applying this strategy to our example in seven groups leads to \(\delta\) = 62.57 for a ten-group typology. With this new \(\delta\) the number of observations labelled as noise in then groups is close to the number labelled as such in six groups with \(\delta\) = 68.39 (see Figure @ref(fig:ngroupDnoise)).

par(mar = c(2,2,2,2))
seqIplot(mvad.seq, 
         group = crispNoiseClust10k$clustering$cluster10,
         cex.legend = 0.6, 
         main = c("Medium Futher Ed.", "Long Futher Ed.", # naming the clusters
                  "Training - Employment", "Employment", 
                  "School - Higher Ed.", "Further Ed. - Higher Ed.",
                  "Short Training - Employment","Joblessness", 
                  "Short Further Ed. - Employment", "School - Employment",
                  "Noise Seq."))
Crisp noise clustering in ten groups, delta = 62.57
Crisp noise clustering in ten groups, delta = 62.57

Conclusion

In this vignette we presented the R code to use two robust clustering methods: consensus and noise clustering.

On the one hand, consensus clustering can be used to fulfil two aims. First to create typologies that are little influenced by data peculiarities, and second, to benefit simultaneously from the advantages of several CAs. It is implemented in WeightedCluster in the conClust command. It should be used when the clustering structure is expected to be weak or when several CAs are suited to create a typology [FIXME workingpaper].

On the other hand, noise clustering allows detecting unclassifiable sequences and increasing types homogeneity (Liao et al. 2022; Piccarreta and Struffolino 2023). This approach might be beneficial when one is interested in rare or atypical trajectories or when the crisp clusters lack homogeneity. It is implemented in WeightedCluster in the seqclararange command.

Additionally these two methods are available in both crisp and fuzzy versions. While crisp typologies are easily used in subsequent analyses, fuzzy ones allow a better characterization of cluster assignation uncertainty and detection of observations that lay in between types (Studer 2018; Helske, Helske, and Chihaya 2023).

References

Balcan, Maria-Florina, Yingyu Liang, and Pramod Gupta. 2014. “Robust Hierarchical Clustering.” Journal of Machine Learning Research 15 (118): 4011–51.
Bengtsson, Henrik. 2026. “Future: Unified Parallel and Distributed Processing in R for Everyone.”
D’Urso, Pierpaolo. 2015. “Fuzzy Clustering.” In Handbook of Cluster Analysis. Chapman and Hall/CRC.
Dave, Rajesh N. 1991. “Characterization and Detection of Noise in Clustering.” Pattern Recognition Letters 12 (11): 657–64. https://doi.org/10.1016/0167-8655(91)90002-4.
Gabadinho, Alexis, Gilbert Ritschard, Nicolas S. Müller, and Matthias Studer. 2011. “Analyzing and Visualizing State Sequences in R with TraMineR.” Journal of Statistical Software 40 (1): 1–37. https://doi.org/10.18637/jss.v040.i04.
Helske, Satu, Jouni Helske, and Guilherme K. Chihaya. 2023. “From Sequences to Variables: Rethinking the Relationship Between Sequences and Outcomes.” Sociological Methodology, June, 00811750231177026. https://doi.org/10.1177/00811750231177026.
Hennig, Christian, Marina Meila, Fionn Murtagh, and Roberto Rocci, eds. 2015. Handbook of Cluster Analysis. Boca Raton.
Hornik, Kurt, and Walter Böhm. 2023. “Clue: Cluster Ensembles.”
Hubert, Lawrence, and Phipps Arabie. 1985. “Comparing Partitions.” Journal of Classification 2 (1): 193–218. https://doi.org/10.1007/BF01908075.
Liao, Tim F., Danilo Bolano, Christian Brzinsky-Fay, Benjamin Cornwell, Anette Eva Fasang, Satu Helske, Raffaella Piccarreta, et al. 2022. “Sequence Analysis: Its Past, Present, and Future.” Social Science Research 107 (September): 102772. https://doi.org/10.1016/j.ssresearch.2022.102772.
Martin, Peter, Ingrid Schoon, and Andy Ross. 2008. “Beyond Transitions: Applying Optimal Matching Analysis to Life Course Research.” International Journal of Social Research Methodology 11 (3): 179–99. https://doi.org/10.1080/13645570701622025.
McVicar, Duncan, and Michael Anyadike-Danes. 2002. “Predicting Successful and Unsuccessful Transitions from School to Work by Using Sequence Methods.” Journal of the Royal Statistical Society Series A: Statistics in Society 165 (2): 317–34. https://doi.org/10.1111/1467-985X.00641.
Monti, Stefano, Pablo Tamayo, Jill Mesirov, and Todd Golub. 2003. “Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data.” Machine Learning 52 (1): 91–118. https://doi.org/10.1023/A:1023949509487.
Piccarreta, Raffaella, and Emanuela Struffolino. 2023. “Identifying and Qualifying Deviant Cases in Clusters of Sequences: The Why and The How.” European Journal of Population 40 (December). https://doi.org/10.1007/s10680-023-09682-3.
Roth, Leonard, Matthias Studer, Emilie Zuercher, and Isabelle Peytremann-Bridevaux. 2024. “Robustness Assessment of Regressions Using Cluster Analysis Typologies: A Bootstrap Procedure with Application in State Sequence Analysis.” BMC Medical Research Methodology 24 (1): 303. https://doi.org/10.1186/s12874-024-02435-8.
Ruspini, Enrique H., James C. Bezdek, and James M. Keller. 2019. “Fuzzy Clustering: A Historical Perspective.” IEEE Computational Intelligence Magazine. https://doi.org/10.1109/MCI.2018.2881643.
Sacchi, Stefan, and Thomas Meyer. 2016. “Übergangslösungen Beim Eintritt in Die Schweizer Berufsbildung: Brückenschlag Oder Sackgasse?” Swiss Journal of Sociology 42 (June). https://doi.org/10.1515/sjs-2016-0002.
Studer, Matthias. 2013. WeightedCluster Library Manual: A Practical Guide to Creating Typologies of Trajectories in the Social Sciences with R.” LIVES Working Papers 2013 (24): 1–32. https://doi.org/10.12682/lives.2296-1658.2013.24.
———. 2018. “Divisive Property-Based and Fuzzy Clustering for Sequence Analysis.” In Sequence Analysis and Related Approaches: Innovative Methods and Applications, edited by Gilbert Ritschard and Matthias Studer, 223–39. Life Course Research and Social Policies. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-95420-2_13.
———. 2024. “Seqclararange: Sequence Analysis for Large Databases.”
Studer, Matthias, and Gilbert Ritschard. 2016. “What Matters in Differences Between Life Trajectories: A Comparative Review of Sequence Dissimilarity Measures.” Journal of the Royal Statistical Society. Series A (Statistics in Society) 179 (2): 481–511. https://www.jstor.org/stable/43965553.
Studer, Matthias, Rojin Sadeghi, and Louis Tochon. 2024. “Sequence Analysis for Large Databases.” https://doi.org/10.12682/LIVES.2296-1658.2024.104.
Unterlerchner, Leonhard, Matthias Studer, and Andres Gomensoro. 2023. “Back to the Features. Investigating the Relationship Between Educational Pathways and Income Using Sequence Analysis and Feature Extraction and Selection Approach.” Swiss Journal of Sociology 49 (August): 417–46. https://doi.org/10.2478/sjs-2023-0021.
Warrens, Matthijs J., and Hanneke van der Hoef. 2022. “Understanding the Adjusted Rand Index and Other Partition Comparison Indices Based on Counting Object Pairs.” Journal of Classification 39 (3): 487–509. https://doi.org/10.1007/s00357-022-09413-z.