Sequence analysis (SA) is a holistic method for studying trajectories. Using a range of techniques, from visualization to explanation, this approach allows researchers to describe, compare, and identify patterns or irregularities in trajectories.
One key step is to create a typology of the trajectories with cluster analysis. This typology describes the various kinds of patterns observed and can be used as a categorical variable in subsequent analysis (Liao et al. 2022). This makes clustering central to SA as it strongly shapes the subsequent analyses.
However, it features among the main criticisms of SA for several reasons. First, typologies created using cluster analysis might be unstable or sample dependent, or more generally, perform poorly depending on the data characteristics. This raises concerns about the reliability of the results (Roth et al. 2024). Second, these methods might perform poorly in the presence of outliers, when some observations lie between clusters, or when the data are weakly structured—i.e. cluster separation is unclear and clusters are not homogeneous (see Figure @ref(fig:figClustStr)) (Balcan, Liang, and Gupta 2014; Martin, Schoon, and Ross 2008, FIXMEworkingpaper). Third, these methods might fail to identify uncommon subgroups. Infrequent types and outliers might be of key interest to identify atypical or emerging behaviours (Sacchi and Meyer 2016; Unterlerchner, Studer, and Gomensoro 2023).
Two clustering approaches —noise and consensus clustering— answer
these limitations. This document describes these clustering algorithms
and provides the R code to create typologies of
trajectories using the consClust and
seqclararange provided by the
WeightedCluster R library (Studer
2013). It also presents methods to evaluate the quality of the
resulting clusterings.
The document is structured as follows. We start by presenting the data and its preparation in Section @ref(secData). After briefly presenting cluster analysis in Section @ref(secClustering), we present the creation and evaluation of typologies using consensus and noise clustering in Sections @ref(secConsClust) and @ref(secNoiseClust). We conclude with the advantages of each approach in Section @ref(secConclusion).
N.B The running time is stated below computationally intensive chunks (time \(\ge\) 1 sec.).
We rely on the mvad dataset to illustrate the use of
consClust and seqclarange functions. This
public dataset is distributed with the TraMineR R package.
It contains the data used by McVicar and
Anyadike-Danes (2002) for studying school-to-work transitions in
Northern Ireland.
First, we create a state sequence object using the
seqdef command (Gabadinho et al.
2011). Trajectories can be plotted using the
seqIplot command, see Figure @ref(fig:figSeqMvad).
# Loading the package
library(WeightedCluster)
# Loading illustrative data
data(mvad)
# Creating state sequence object
mvad.seq <- seqdef(mvad[, 17:86], # The data containing information on trajectories
labels = c("Employment", "Further Education", # The states
"Higher Education", "Joblessness",
"School", "Training"),
xtstep = 6)# Plotting the squences
seqIplot(mvad.seq,
legend.prop=0.2,
sortv = "from.start") # sequences are ordered by state.Second, to perform cluster analysis, we compute a dissimilarity
matrix comparing the trajectories using the seqdist
command. We use the LCS dissimilarity measure capturing
both differences in timing and sequencing within the trajectories. Its
versatility makes it the standard choice in SA. For more details and
other methods, see Studer and Ritschard
(2016).
Time for this code chunk to run: 0.8 seconds
We now turn to the creation of typologies using cluster analysis. It is a data mining technique grouping similar observations into types. A multitude of clustering algorithms (CAs) have been proposed to fulfil different aims. A key distinction resides in the kind of typology returned by the algorithm, which can be crisp or fuzzy (Hennig et al. 2015).
Both noise and consensus clustering can render each of these partition types. To allow an informed choice on this matter, we briefly discuss these approaches in the following lines.
Crisp clustering partitions a dataset so that each observation belongs to exactly one cluster and no clusters overlap. This makes crisp clustering compatible with any method handling categorical data. However, this approach compresses potentially rich dissimilarity information into a single categorical assignment.
Doing so, members of the same cluster may be falsely regarded as identical or highly similar even when important differences exist. Hybrid cases —i.e. observations lying in between several clusters— must still be forced into only one cluster. In consequence, the identification of such observations is made difficult by crisp clustering refWorkingPaper.
Fuzzy clustering allows each data point to have graded membership in several clusters instead of being forced into exactly one group. This method is more suitable than crisp clustering when the clustering structure is weak, leading to unclear and overlapping boundaries between categories. Figure @ref(fig:figClustStr) provides an example of two clusterings diverging in their structure strength.
By assigning membership degrees between 0 and 1, fuzzy methods can reveal hybrid cases, that is, observations that genuinely share characteristics of multiple clusters rather than fitting neatly into a single class. This soft assignment also improves robustness to noise and outliers because uncertain points can be given distributed memberships rather than being wrongly forced into one cluster (Studer 2018; Ruspini, Bezdek, and Keller 2019; Helske, Helske, and Chihaya 2023).
The first robust clustering method presented in this vignette is consensus clustering. It is a technique aiming to increase the robustness of the clustering results, by diminishing their sample dependence or by taking advantage of several clustering rationales. In a simulation study, we found consensus clustering to be particularly versatile and robust [workingpaper FIXMEREF]. It proceeds in two steps.
First, several clusterings are computed to form an ensemble of partitions of the same data. This first step allows the ensemble of partition to reflect the diversity of typologies that can be obtained on the same data. Monti et al. (2003) propose to generate the ensemble by clustering the same data with varying weights. To do so, we rely on Bayesian resampling. This simulates a bootstrap procedure but all observations are always present, albeit weighted differently (Hornik and Böhm 2023).
The reweighted samples are then clustered using the computed weights and one of the specified CA. If several CAs were specified, each CA is used in an equal number of reweighted samples. In the first case, the aim is to reduce the typology sample dependence. In the second one, the aim is to achieve greater flexibility by benefiting simultaneously from several CAs (Hennig et al. 2015). If multiple CAs are provided, each CA is applied to an equal share of the reweighted samples.
Second, a consensus is searched for among these partitions by a consensus function to obtain a typology which synthesizes the information from the ensemble of partitions. Doing so, the resulting typology is more robust. According to the used method, the consensus can be either a crisp or a fuzzy clustering.
In this section we create a typology using the consensus clustering
framework as proposed by Monti et al.
(2003). To do so, we use the consClust command of
the WeightedCluster package. The function takes a
dissimilarity matrix diss as input data. The argument
base.clust specifies the clustering algorithms to be used
for creating the ensemble of R partitions. When several
good candidates exist for the same task, specifying several CAs allows
achieving greater flexibility [see FIXME workingpaper]. The argument
kvals specifies the number of groups the algorithm looks
for, cons.method sets the consensus function to rely on. In
the following example, we rely on the SE method, which
minimizes the sum of dissimilarities using Euclidean dissimilarities.
Please refer to Hornik and Böhm (2023) for
details on available methods. The argument membership
defines whether the returned clustering takes the form of
fuzzy membership matrices or crisp label
vectors. The argument k.fixed prevents the consensus
function from producing a typology with more groups than in the ensemble
of partition.
In the following example, the typology is computed on a ensemble of
100 partitions obtained with PAM and Ward clustering algorithms (for
details on these algorithms, see
FIXMEREF unterlerchnerStuder 2026)
Setting parallel=TRUE, a default parallel back-end is
set up using the future framework (Bengtsson
2026). When parallel=FALSE, any parallel back-end
previously defined with the plan function will be used. The
parallel protocol can then be adapted to specific environments, for
instance, some High Performance Computing (HPC) servers rely on specific
protocols (MPI,…). We use here these strategy, and any subsequent call
will use these parallel backend. Setting progressbar=TRUE
shows information (and estimated computation time) on the progress of
the computations.
# Setting up parallel computing
library(future)
plan(multisession)
# Creating the typology
set.seed(1234)
pamWardConsClust <- consClust(diss,
base.clust = c("pam", "ward.D"),
R = 100,
kvals = 2:15,
cons.method = "SE",
membership = "crisp",
k.fixed = TRUE,
agg.method = "cRand",
keep.ensemble = TRUE,
parallel = FALSE,
progressbar = FALSE)## [>] Performing consensus clustering on 100 partitions, using: pam, ward.D
## [>] Elapsed time: 15.99 secs
Time for this code chunk to run: 16.44 seconds
The function returns a consClust object, containing the
obtained consensus clusterings, the function call and Cluster Quality
Indices (CQIs). If keep.ensemble = TRUE, the ensemble of
partitions is stored in the returned object.
To guide the user on the adequate number of groups to keep for the
final typology, CQIs can be displayed by typing the name of the returned
object pamWardConsClust
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC cons_cRand
## cluster2 0.66 0.80 0.80 0.45 0.45 237.88 0.25 500.72 0.41 0.10 0.83
## cluster3 0.57 0.69 0.68 0.34 0.34 189.30 0.35 401.68 0.53 0.15 0.54
## cluster4 0.50 0.65 0.64 0.31 0.32 159.17 0.40 323.92 0.58 0.18 0.59
## cluster5 0.58 0.79 0.79 0.37 0.38 171.09 0.49 425.71 0.71 0.10 0.59
## cluster6 0.57 0.80 0.79 0.37 0.37 166.36 0.54 426.83 0.75 0.10 0.65
## cluster7 0.56 0.84 0.84 0.38 0.39 161.74 0.58 447.82 0.79 0.08 0.67
## cluster8 0.56 0.86 0.86 0.39 0.39 147.21 0.59 389.65 0.79 0.08 0.67
## cluster9 0.56 0.91 0.90 0.41 0.42 151.58 0.63 468.80 0.84 0.06 0.71
## cluster10 0.55 0.90 0.90 0.38 0.39 137.04 0.64 420.98 0.84 0.06 0.69
## cluster11 0.54 0.92 0.91 0.42 0.43 132.31 0.65 419.41 0.86 0.06 0.68
## cluster12 0.50 0.90 0.90 0.39 0.40 127.47 0.67 402.59 0.86 0.07 0.67
## cluster13 0.49 0.91 0.90 0.38 0.39 122.55 0.68 392.25 0.87 0.07 0.67
## cluster14 0.48 0.91 0.90 0.37 0.38 116.27 0.68 371.49 0.87 0.06 0.65
## cluster15 0.46 0.90 0.90 0.34 0.35 114.85 0.70 361.42 0.88 0.07 0.65
Measures of agreement between the partitions used to obtain the
consensus clustering are also provided. They allow the evaluation of the
ensemble’s clustering stability. We propose relying on the Adjusted Rand
Index (cRand). It measures the similarity between
partitions. A value of 1 indicates two identical clusterings, 0
indicates similarity obtained by chance and highly dissimilar
clusterings are associated with negative values (Hubert and Arabie 1985). Studer, Sadeghi, and Tochon (2024) propose the
following similarity interpretation thresholds: strong (ARI \(\ge\) 0.9), good (ARI \(\ge\) 0.8) and weak (ARI \(\ge\) 0.7).
High cRand values indicate a high level of stability in
the partition ensemble and, by extension, a more robust consensus
typology. Low cRand values can be interpreted in two ways
(Warrens and van der Hoef 2022). First, if
the partitions are obtained with only one CA, this indicates that the
partitions are dependent on the subsamples they are computed on and that
a single clustering on the whole sample would not be robust. If the
partitions are obtained from several CAs. A low cRand means
that the CAs lead to different results. This can be expected if one uses
consensus clustering to benefit from CAs following different rationales.
However, since the first interpretation still applies in this case, the
exact contribution of each dynamic to the index is unknown.
CQIs can be plotted with the plot command, see Figure
@ref(fig:plotConsClustCqi). CH and CHsq showing high values, we
normalized the CQIs using the argument norm = zscore to
allow plotting all CQIs on the same figure with the argument
stats = "all".
# Plotting CQIs
par(cex = 0.75)
plot(pamWardConsClust,
legendpos = "topleft",
stat = "all",
norm = "zscore") # CQIs are standarizeInternal CQIs (HG, CHsq and HC) indicate a nine or eleven-cluster solution, as they are maximized (minimized for HC) for these numbers of groups, see FIXMEworkingpaper for details on the use of CQIs to select the number of groups. The cRand is maximized for nine clusters and indicates a good level of agreement in the partition ensemble (Studer, Sadeghi, and Tochon 2024). We can now plot the trajectories according to the nine-groups typology (Figure @ref(fig:consClustSeqplot)), which is more parsimonious.
# Plotting the consensus typology in nine groups
par(mar = c(2,2,2,2))
seqIplot(mvad.seq,
group = pamWardConsClust$clustering$cluster9, # Specitifing the cluster to use for plotting
main = c("Further Ed. - Higher Ed.", "Joblessness", # naming the clusters in plot
"Training - Employment", "Training",
"School - Higher Ed.", "Further Ed. - Employment",
"Employment", "School - Employment",
"Futher Ed."),
cex.legend = 0.8)We now compute the same consensus clustering but in its fuzzy
version. It is done by using the argument
membership = "fuzzy".
# Creating the typology
set.seed(1234)
pamWardConsClustF <- consClust(diss,
base.clust = c("pam", "ward.D"),
R = 100,
kvals = 2:15,
cons.method = "SE",
membership = "fuzzy",
k.fixed = TRUE,
agg.method = "cRand",
keep.ensemble = TRUE,
progressbar = FALSE)## [>] Performing consensus clustering on 100 partitions, using: pam, ward.D
## [>] Elapsed time: 7.11 secs
Time for this code chunk to run: 7.11 seconds
The obtained typology can be plotted using the
fuzzyseqplot function, see Figure @ref(fig:plotConsFuzzy).
In each panel sequences are sorted according to the membership
probability. Each panel only displays sequences with a membership
probability \(\ge 0.4\).
par(mar = c(2,2,2,2))
fuzzyseqplot(mvad.seq, # sequences to plot
group = pamWardConsClustF$clustering$cluster9, # grouping variable
main = c("Further Ed. - Higher Ed.", "Joblessness",# naming the clusters
"Training - Employment", "Training",
"School - Higher Ed.", "Further Ed. - Employment",
"Employment", "School - Employment",
"Futher Ed."),
membership.threshold = 0.4,
sortv = "membership",
type = "I", # We plot an index plot
cex.legend = 0.8) The obtained fuzzy typology provides similar clusters to the crisp one. The added value being that the cluster’s diversity can be better described by looking at the panels, where typical sequences are shown at the top.
Noise clustering is another advanced clustering technique. Contrary to most clustering algorithms, it does not provide exhaustive typologies. Observations are not coerced to belong to a cluster but can also remain unclassified. In such case, they are labelled as noise.
This approach has two advantages. First, unclassifiable observations are not assigned to clusters in which they would poorly fit. Doing so, clusters are better defined and more homogeneous. Second, by flagging them as noise, unclassifiable trajectories can be studied per se (Liao et al. 2022; Piccarreta and Struffolino 2023). Such trajectories might be of great interest in some research designs, as they often denote particularly good (or ill) situations, or might be associated with particular outcomes in later life (Sacchi and Meyer 2016; Unterlerchner, Studer, and Gomensoro 2023).
In its fuzzy variant, if the noise group is set aside, one can consider that the membership degrees are not coerced to sum to one. Doing so the fuzzy noise clustering can be seen as a variant of possibilistic clustering, which provides more coherent membership degrees in the presence of noise in the data (D’Urso 2015).
To create the typology, we use a fuzzy extension of the CLARA algorithm that allows labelling sequences as noise instead of assigning them to a cluster. CLARA is a medoid-based clustering method, but rather than clustering the whole dataset, medoids are searched for on a subsample. The clustering is then extended to the dataset and this operation is repeated to ensure the robustness of the results. CLARA can be applied to large datasets. The fuzzy approach is well suited to the identification of noise, looking for exact analytical solutions being extremely computationally intensive.
We use the seqclarange command (with the argument
method = "noise") of the WeightedCluster
package to create the typology with noise. R specifies the
number of times the operation is repeated. The subsample size is defined
by the sample.size argument. For more details on the use of
seqclarange please refer to Studer
(2024).
The argument dnoise is a tuning parameter controlling
the algorithm’s sensitivity to noise. It is the required distance \(\delta\) of an observation to any medoid
for this observation to be considered as not belonging to any type.
Defining this parameter plays a critical role in the typology creation,
as it directly affects the number of observations labelled as noise.
Higher \(\delta\) values labellize
fewer trajectories as noise. We discuss \(\delta\) definition in detail and give
examples in Section @ref(secDnoise).
Dave (1991) defines this distance using the average distance in the sample using the following formula, being the number of sequences, \(\mathbf {x}\) the sequences and \(\lambda\) an user-defined coefficient.
\(\delta = \lambda \cdot \frac{2} {n(n-1)} \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} d(\mathbf{x}_i, \mathbf{x}_j)\).
Using the above formula and setting \(\lambda\) to 0.8 leads to a \(\delta\) of 68.4. Since we used Optimal Matching with constant costs, this value can be interpreted theoretically (Studer and Ritschard 2016). It indicates that a sequence needs to be different to any medoid during 34 months in total to be considered as noise.
## [1] 68.39203
# Creating a typology with noise
set.seed(1234)
noiseClust08 <- seqclararange(mvad.seq,
kvals = 2:15,
R = 50, # Number of subsamples
sample.size = nrow(mvad.seq),
method = "noise",
dnoise = delta08, # noise sensitivity
seqdist.args = list(method = "LCS"))Time for this code chunk to run: 48.09 seconds
Medoid-based fuzzy CQIs are computed alongside the clustering (Studer, Sadeghi, and Tochon 2024). Applied to noise clustering, their interpretation is made difficult because all the clusters are not created on the same rules, the noise cluster being not constructed to be homogeneous. In this context, we recommend using CQIs only to guide the number of group selection, but not to compare clusterings obtained with different algorithms.
CQIs can be displayed by typing the object name and the command
plot produce a figure of the CQIs, see Figure
@ref(fig:plotNoiseClustCqi).
## Avg dist PBM DB XB AMS ARI>0.8 JC>0.8 Best iter
## cluster2 33.25 1.73 2.64 0.24 0.73 NA NA 32
## cluster3 26.01 0.96 3.50 0.54 0.71 NA NA 4
## cluster4 21.61 0.63 3.65 0.49 0.78 NA NA 7
## cluster5 18.95 0.44 3.87 0.43 0.81 NA NA 32
## cluster6 17.45 0.33 4.32 0.40 0.79 NA NA 34
## cluster7 15.54 0.28 4.56 0.35 0.81 NA NA 6
## cluster8 14.27 0.22 5.11 0.59 0.82 NA NA 15
## cluster9 13.32 0.18 5.57 0.56 0.82 NA NA 46
## cluster10 12.44 0.16 5.59 0.52 0.82 NA NA 16
## cluster11 11.80 0.13 5.91 0.59 0.84 NA NA 29
## cluster12 11.15 0.11 6.08 0.62 0.84 NA NA 35
## cluster13 10.75 0.10 6.10 0.54 0.83 NA NA 27
## cluster14 10.17 0.09 6.04 0.56 0.84 NA NA 20
## cluster15 9.85 0.08 6.54 0.55 0.84 NA NA 4
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "legendpos" is not a
## graphical parameter
DB and XB indicate a seven-groups typology. Figure
@ref(fig:noiseClustSeqplot) provides a graphical representation of the
typology using the fuzzyseqplot command. Additionnaly
sequences are sorted according to their membership strength (Studer 2018). Regarding the
memb.treashold argument, we used a small value to be able
to display sequences labelled as noise, which tend to be associated with
dispersed membership probabilities.
## Displaying the resulting clustering with membership threshold of 0.20
par(mar = c(2,2,2,2))
fuzzyseqplot(mvad.seq,
group = noiseClust08$clustering$cluster7,
main = c("Futher Ed.", "Employment", # naming the clusters
"School - Higher Ed.", "Further Ed. - Higher Ed.",
"Joblessness", "Training - Employment",
"Further Ed. - Employment", "Noise Seq."),
membership.threshold = 0.20,
type = "I", # We plot an index plot
sortv = "membership",
cex.legend = 0.8)The visual inspection of the group of sequences labelled as noise indicates that we cannot consider these sequences as a regular type, since it is heterogeneous and features observations strongly diverging from the other types in their sequencing aspect.
To be used with methods handling categorical data, the fuzzy
clustering can be transformed into a crisp one by assigning the
observation to the cluster showing the highest membership probability.
This can be done using the as.crisp command. See Figure
@ref(fig:crispNoiseClustSeqplot).
par(mar = c(2,2,2,2))
seqIplot(mvad.seq,
group = crispNoiseClust08$clustering$cluster7,
main = c("Futher Ed.", "Employment", # naming the clusters
"School - Higher Ed.", "Further Ed. - Higher Ed.",
"Joblessness", "Training - Employment",
"Further Ed. - Employment", "Noise Seq."),
cex.legend = 0.8)dnoiseDefining dnoise is a critical step in performing noise
clustering, as it controls the number of observations labelled as
noise.
To discuss the dnoise argument, we now provide three
examples of noise clustering with varying \(\lambda\) and number of groups. For
conciseness we only present the crisp version of the clusterings.
When using Dave (1991) formula to
define dnoise the coefficient \(\lambda\) acts as a tuning parameter. This
allows the algorithm to be more or less sensitive to noise. Higher
lambda leads to more conservative noise labelling.
We propose setting \(\lambda\) by visually investigating the clusterings obtained according to several values chosen around one. Dave (1991) suggested using smaller values. However, in our case —applied to high-dimensional categorical data— such values proved to be too restrictive. We provide two examples of noise clustering with different \(\lambda\) to discuss its impact on the resulting typologies.
## [1] 51.29403
The \(\lambda\) parameter is now decreased to 0.6. The resulting \(\delta\) being smaller, the algorithm will label more trajectories as noise.
# Creating a typology with noise
set.seed(1234)
noiseClust06 <- seqclararange(mvad.seq,
kvals = 2:15,
R = 50,
sample.size = nrow(mvad.seq),
method = "noise",
dnoise = delta06,
seqdist.args = list(method = "LCS"))
# Converting the fuzzy partition to crisp
crispNoiseClust06 <- as.crisp(noiseClust06) Time for this code chunk to run: 48.01 seconds
par(mar = c(2,2,2,2))
# Plotting the crisp typology
seqIplot(mvad.seq,
group = crispNoiseClust06$clustering$cluster7,
main = c("Futher Ed.", "Employment",
"School - Higher Ed.", "Further Ed. - Higher Ed.",
"Joblessness", "Training - Employment",
"Further Ed. - Employment", "Noise Seq."),
cex.legend = 0.8) As expected, lowering \(\lambda\) to 0.6 sharply increases the number of sequences identified as noise (see Figure @ref(fig:noiseClust06Seqplot). This group being highly heterogeneous, it is not possible to consider it as a type. However the seven other types are more homogeneous than before. If getting such homogeneous types suits the research aim, such \(\lambda\) value would be adequate.
In the following example, \(\lambda\) is increased to 1.
## [1] 85.49004
# Creating a typology with noise
set.seed(1234)
noiseClust <- seqclararange(mvad.seq,
kvals = 2:15,
R = 50,
sample.size = nrow(mvad.seq),
method = "noise",
dnoise = delta1,
seqdist.args = list(method = "LCS"))
# Converting the fuzzy partition to crisp
crispNoiseClust <- as.crisp(noiseClust)Time for this code chunk to run: 48.22 seconds
# Plotting the crisp typology
par(mar = c(2,2,2,2))
seqIplot(mvad.seq,
group = crispNoiseClust$clustering$cluster7,
main = c("Futher Ed.", "Employment",
"School - Higher Ed.", "Further Ed. - Higher Ed.",
"Joblessness", "Training - Employment",
"Further Ed. - Employment", "Noise Seq."),
cex.legend = 0.8)This new \(\lambda\) decreases the algorithm’s sensitivity to noise to the point that only very few sequences are labelled as such (see Figure @ref(fig:noiseClust1Seqplot)). Its extremely small size impedes its use in subsequent analyses.
dnoise and number of groupsWe now discuss the link between the number of groups in a typology
and the sensitivity of dnoise. When increasing the number
of groups, the distance between the observation and the medoids
diminishes. In consequence fewer sequences are labelled as noise with
the same dnoise value. Figure @ref(fig:boxplotD2m) below
presents the boxplots of the distance to the medoids of a crisp
clustering without noise.
In a ten-group typology, only 13 sequences are labelled as noise with
dnoise = 68.4 (see Figure @ref(fig:crispNoiseTypo10). They
were 53 in the seven-group typology using the same dnoise.
To achieve the same level of noise sensitivity, a lower
dnoise is needed for a more detailed typology.
To avoid this behaviour and to predict dnoise values
suitable for a greater number of groups, we propose the following
strategy. First we compute a clustering without noise and calculate the
distances to medoid in each cluster for every number of groups. Using
these distances and a \(\delta\)
optimized for a given number of groups (here seven), we can calculate
\(\delta\)’s adapted to any number of
groups. Doing so, the same amount of noise will be detected for every
number of groups.
Applying this strategy to our example in seven groups leads to \(\delta\) = 62.57 for a ten-group typology. With this new \(\delta\) the number of observations labelled as noise in then groups is close to the number labelled as such in six groups with \(\delta\) = 68.39 (see Figure @ref(fig:ngroupDnoise)).
par(mar = c(2,2,2,2))
seqIplot(mvad.seq,
group = crispNoiseClust10k$clustering$cluster10,
cex.legend = 0.6,
main = c("Medium Futher Ed.", "Long Futher Ed.", # naming the clusters
"Training - Employment", "Employment",
"School - Higher Ed.", "Further Ed. - Higher Ed.",
"Short Training - Employment","Joblessness",
"Short Further Ed. - Employment", "School - Employment",
"Noise Seq."))In this vignette we presented the R code to use two
robust clustering methods: consensus and noise clustering.
On the one hand, consensus clustering can be used to fulfil two aims.
First to create typologies that are little influenced by data
peculiarities, and second, to benefit simultaneously from the advantages
of several CAs. It is implemented in WeightedCluster in the
conClust command. It should be used when the clustering
structure is expected to be weak or when several CAs are suited to
create a typology [FIXME workingpaper].
On the other hand, noise clustering allows detecting
unclassifiable sequences and increasing types homogeneity (Liao et al. 2022; Piccarreta and Struffolino
2023). This approach might be beneficial when one is interested
in rare or atypical trajectories or when the crisp clusters lack
homogeneity. It is implemented in WeightedCluster in the
seqclararange command.
Additionally these two methods are available in both crisp and fuzzy versions. While crisp typologies are easily used in subsequent analyses, fuzzy ones allow a better characterization of cluster assignation uncertainty and detection of observations that lay in between types (Studer 2018; Helske, Helske, and Chihaya 2023).