R Tutorials: Robustness Assessment of Regressions using Cluster Analysis Typologies

Leonard Roth

In a standard Sequence Analysis, similar trajectories are clustered together to create a typology of trajectories, which is then often used to evaluate the association between sequence patterns and covariates inside regression models. The sampling uncertainty, which affects both the derivation of the typology and associated regressions, is typically ignored in this analysis, an oversight that may lead to wrong statistical conclusions.

This document provides an introduction to the R code proposed to assess the robustness of regression results obtained from this standard analysis. Bootstrap samples are drawn from the data, and for each bootstrap, a new typology replicating the original one is constructed, followed by the estimation of the corresponding regression models. The bootstrap estimates are then combined using a multilevel modelling framework.

The document is structured in two parts. First, a short tutorial with a streamlined standard analysis on sequence data and a robustness assessment made with the rarcat function. Then, an extended tutorial with the same data illustration and a detailed explanation of the new functions, their arguments and their outputs. Furthermore, readers interested in the methods and the precise interpretation of the results are referred to:

You are kindly asked to cite the above reference if you use the methods presented in this document.

Let us start by setting the seed for reproducible results.

set.seed(1)

Time for this code chunk to run: 0.01 seconds

Short tutorial

Creating the state sequence object

For this example, we use the openly available mvad dataset on transitions from school to work. First, we create the state sequence object.

## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)

## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")

## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, 
                   labels = mvad.lab, xtstep = 6)

Time for this code chunk to run: 0.06 seconds

Constructing the typology

We will now construct a typology using cluster analysis. For readers seeking more details, the WeightedCluster library manual provides an in-depth explanation of the process and the computation of cluster quality measures (Studer, 2013).

We start by computing dissimilarities with the seqdist function using the longest common subsequence distance. We then use Ward clustering to create a typology of the trajectories.

## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")

Time for this code chunk to run: 1.79 seconds

We can now compute several cluster quality indices using as.clustrange function with two to ten groups.

# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
##            PBC   HG HGSD  ASW ASWw     CH   R2   CHsq R2sq   HC
## cluster2  0.59 0.75 0.74 0.43 0.43 216.72 0.23 431.41 0.38 0.13
## cluster3  0.43 0.51 0.50 0.25 0.25 175.61 0.33 335.58 0.49 0.25
## cluster4  0.52 0.67 0.67 0.29 0.30 164.63 0.41 352.31 0.60 0.17
## cluster5  0.52 0.69 0.68 0.31 0.31 153.46 0.46 326.41 0.65 0.16
## cluster6  0.57 0.79 0.78 0.34 0.35 151.17 0.52 372.95 0.73 0.11
## cluster7  0.58 0.83 0.83 0.37 0.37 144.38 0.55 383.10 0.77 0.09
## cluster8  0.51 0.78 0.78 0.32 0.33 140.00 0.58 358.85 0.78 0.12
## cluster9  0.52 0.85 0.85 0.34 0.35 137.85 0.61 386.02 0.81 0.09
## cluster10 0.51 0.87 0.87 0.35 0.36 131.67 0.63 379.81 0.83 0.08

Time for this code chunk to run: 0.1 seconds

Different clustering solutions may be argued based on the information above. We are interested in a relatively detailed partition of the data, so focus hereafter on a six clusters solution (more details on selecting a clustering solution available in the extended tutorial that below). The corresponding typology is shown next:

seqdplot(mvad.seq, group=clustqual$clustering$cluster6, border=NA)

Time for this code chunk to run: 0.16 seconds

Association study

We are now interested in the relationship between this typology and the funemp and gcse5eq covariates, which represent the father’s unemployment status and the qualifications gained by the end of compulsory education, respectively (both are binary variables). Such associations can be studied with different approaches. Here, we focus on implementing separate logistic regressions for each cluster and estimating average marginal effects (AMEs) with these models. The readers are referred to the article by Roth et al. (2024) for theoretical and practical explanations for these methodological choices.

Multiple commands would normally be required to explore all possible combinations between the clusters and their related covariates. This can also be done with the function rarcat below.

# Add the clustering solution to the dataset
mvad$clustering <- clustqual$clustering$cluster6
# Formula for the association between clustering and covariates of interest
formula <- clustering ~ funemp + gcse5eq
# Run separate logistic regressions for each cluster and compute the corresponding AMEs
# with their confidence interval
rarcatout <- rarcat(formula, data=mvad, diss=diss, robust=FALSE)
rarcatout$original.analysis
##       factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1  funempyes    -0.070   -0.150    0.010    -0.019   -0.075    0.036     0.011
## 2 gcse5eqyes    -0.181   -0.244   -0.118     0.159    0.109    0.209     0.014
##   c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1   -0.072    0.094     0.056   -0.027    0.140    -0.049   -0.110    0.012
## 2   -0.050    0.078    -0.202   -0.260   -0.144     0.254    0.197    0.312
##   cluster 6 c6 lower c6 upper
## 1     0.051   -0.003    0.105
## 2    -0.051   -0.082   -0.020

Time for this code chunk to run: 0.37 seconds

The results in the table above are AMEs, which measure the expected change in probability of belonging to a given cluster if the covariate takes a given value, together with their 95% confidence intervals. Thus, after adjustment for the GCSEs, father unemployment status is only marginally associated with membership to cluster 6, which contains the children unemployment trajectories (the p-value is actually 0.06).

On the other hand, there are strong associations apparent between the GCSEs obtained and the trajectory groups, with a distinctively higher probability to enter higher education (clusters 2 and 5) compared to being employed, in training or, to a lesser degree, unemployed (clusters 1, 4 and 6) if five or more GCSEs were gained. However, the standard analysis presented to this point does not properly account for the sampling uncertainty, such that the findings’ reliability remains in the balance.

Robustness assessment

The Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure allows for evaluating the reproducibility of the analysis on resamples from the data. We focus here on a quick implementation of the procedure and refer to the extended tutorial below or the article by Roth et al. (2024) for further details. The function rarcat is run again with robust = TRUE and further arguments for the procedure that replicates the typology and associated regressions on bootstrap samples of the data, thus enabling assessing the robustness of the original estimates by comparing them to summary estimates in the form of pooled AMEs and 95% predictions intervals.

# Evaluate the validity of the original analysis and the reliability of its
# findings by applying RARCAT with 50 bootstrap replications.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# The number of clusters is fixed to 6 here.
rarcatout <- rarcat(formula, data = mvad, diss = diss, robust = TRUE, B = 50, 
                    algo = "hierarchical", method = "ward.D", 
                    fixed = TRUE, kcluster = 6)
rarcatout$robust.analysis
##       factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1  funempyes    -0.039   -0.100    0.023    -0.016   -0.078    0.046    -0.008
## 2 gcse5eqyes    -0.168   -0.266   -0.069     0.171    0.095    0.246     0.034
##   c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1   -0.080    0.065     0.018   -0.050    0.087    -0.052   -0.126    0.021
## 2   -0.059    0.128    -0.164   -0.244   -0.083     0.254    0.184    0.324
##   cluster 6 c6 lower c6 upper
## 1     0.083    0.001    0.166
## 2    -0.083   -0.163   -0.004

Time for this code chunk to run: 18.75 seconds

The results stay fairly close to the ones from the original analysis (seen earlier), in particular regarding the “significant” associations. The relationship between father unemployment status and unemployment trajectories (cluster 6) seems somewhat stronger than it did on the original sample, and the relationship between GCSEs and long training spells (cluster 4) is a bit weaker. The other relevant associations are virtually unchanged. Thus, we conclude that the original analysis appear to be valid and robust to sampling variation. However, fixing the number of clusters (six here) throughout the bootstrap procedure may not always be warranted.

Parallel computing

In the example above, we have run the function rarcat with the number of bootstrap \(B=50\), which is less than the recommended number (500 in the function documentation). This was done to speed up the computing. Alternatively, parallel computing may be used. The code below shows how to do this for Windows users. The function rarcat also accepts arguments for parallel computing on different operating systems (see the function boot documentation for more details).

# # Loading the parallel library
# library(parallel)
# # Use available cores minus one
# ncpus <- detectCores() - 1
# # Create the parallel cluster
# cl <- makeCluster(ncpus)
# 
# # Parallel run
# rarcatout <- rarcat(formula, data = mvad, diss = diss, robust = TRUE, B = 500,
#                     algo = "hierarchical", method = "ward.D",
#                     fixed = TRUE, kcluster = 6,
#                     parallel = "snow", ncpus = ncpus, cl = cl)
# rarcatout$robust.analysis

Time for this code chunk to run: 0.01 seconds

The bootstrap procedure is now at least four times faster and the estimates become sufficiently accurate, in the sense that a new run gives almost exactly the same values.

Extended tutorial

First use case: varying number of clusters

Association with a covariate

We will use here the same dataset (mvad) and dissimilarity matrix (diss) as before. For sake of illustration, we start by considering the two clusters solution. These clusters are visualised below with state distribution plots.

seqdplot(mvad.seq, group=clustqual$clustering$cluster2, border=NA)

Time for this code chunk to run: 0.07 seconds

The distinction is essentially between individuals with prolonged spells of higher education (on the right, n=151) and the rest (on the left, n=561). In this example, we investigate the association between father unemployment status (funemp variable) and this typology of trajectories.

# Loading the margins library for marginal effects estimation
library(margins)
# Create cluster membership variable
mvad$clustering <- clustqual$clustering$cluster2
mvad$membership <- mvad$clustering == 2
# Run logistic regression model
mod <- glm(membership ~ funemp, mvad, family = "binomial")
# Model results (AME)
summary(margins(mod))
##     factor     AME     SE       z      p   lower   upper
##  funempyes -0.1208 0.0338 -3.5729 0.0004 -0.1871 -0.0545

Time for this code chunk to run: 0.05 seconds

The average marginal effect (AME) measures the expected change in probability of belonging to the higher education trajectory group if the father was unemployed at time of survey. The readers are referred to the article by Roth et al. (2024) for theoretical and practical explanations of the use of a logistic regression and marginal effects in this situation. Crucially, the standard analysis presented to this point does not account for the sampling uncertainty, which might impact the findings’ reliability.

Bootstrap replicates of the typology

The Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure works as follows:

  1. A random sample with replacement (i.e, bootstrap) is drawn from the data.
  2. The bootstrap sample is clustered applying the exact same clustering procedure as the one used in the original analysis, which implies using the same dissimilarity measure, cluster algorithm, and method to determine the number of clusters.
  3. A separate logistic regression predicting membership probability in each group is estimated.
  4. The AME of each covariate on the probability to be assigned to a given type is retrieved for all sequences belonging to this type.
  5. These steps are repeated \(N\) times, with \(N\) typically large.
  6. The AMEs from step 4 are pooled using a multilevel modelling framework.

Hennig (2007) proposed to use the new partitions derived in step 2 to evaluate cluster-wise stability by measuring the quality preservation of clustering solutions across perturbed dataset through average Jaccard similarities. The corresponding code is available in the library fpc and its application is illustrated below with 500 bootstrap.

# Loading the fpc library
library(fpc)
# Cluster-wise stability assessment by bootstrap
stab <- clusterboot(diss, B = 500, distances = TRUE, clustermethod = disthclustCBI, 
                    method = "ward.D", k = 2, count = FALSE)
round(stab$bootmean, 2)
## [1] 0.96 0.89

Time for this code chunk to run: 4.45 seconds

The estimated Jaccard coefficients are above 0.85, which indicate high cluster-wise stability, meaning that most of the individuals belonging to any of the two clusters in the original partition tend to be clustered again in the bootstrap partitions, particularly for the first cluster.

We propose to go one step further and use the bootstrap partitions to estimate each time new regression models replicating the original ones. This can be achieved by running the regressboot function with the following main parameters:

  • formula: A formula object with the clustering solution on the left side and the covariates of interest.
  • data: The dataset (data frame) with column names corresponding to the information in formula. The number of individuals (row number) should match the dimension of diss.
  • diss: The numerical dissimilarity matrix used for clustering. Only a pre-computed matrix (i.e., where pairwise dissimilarities do not depend on the resample) is currently supported.
  • B: The integer number of bootstrap. Set to 500 by default to attain a satisfactory precision around the estimates as the procedure involves multiple steps..
  • algo: The clustering algorithm as a character string. Currently only “pam” (calling the function wcKMedRange) and “hierarchical” (calling the function fastcluster::hclust) are supported. By default “pam”.
  • method: A character string with the method argument of hclust, “ward.D” by default.
  • fixed: Logical. TRUE implies that the number of clusters is the same in every bootstrap. FALSE (default) implies that an optimal number of clusters is evaluated each time.
  • kcluster: Integer. Either the number of clusters in every bootstrap if fixed is TRUE or the maximum number of clusters (starting from 2) to be evaluated in each bootstrap if fixed is FALSE.
  • cqi: A character string with the cluster quality index to be evaluated for each new partition. Any column of as.clustrange is supported, “CH” (the Calinski-Harabasz index) by default. Also works with algo= “pam”.
  • parallel: A character string with the type of parallel operation to be used (if any) by the function boot:boot. Options are “no” (default), “multicore” and “snow” (for Windows).
  • ncpus: Integer. Number of processes to be used in case of parallel operation. Typically, one would chose this to be the number of available CPUs.
  • cl: A parallel cluster for use if parallel = “snow”. If not supplied, a cluster on the local machine is created for the duration of the boot call.

A typical utilisation of regressboot is shown below (caution, this uses the Windows parallel computing settings from the short tutorial):

# Bootstrap replicates of the typology and its association with the variable funemp.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# Also, an optimal clustering solution with n between 2 and 10 is evaluated each time by
# maximizing the CH index.
# bootout <- regressboot(clustering ~ funemp, data = mvad, diss = diss, B = 500,
#                        algo = "hierarchical", method = "ward.D",
#                        fixed = FALSE, kcluster = 10, cqi = "CH",
#                        parallel = "snow", ncpus = ncpus, cl = cl)
# Without parallel computing
bootout <- regressboot(clustering ~ funemp, data = mvad, diss = diss, B = 50,
                       algo = "hierarchical", method = "ward.D",
                       fixed = FALSE, kcluster = 10, cqi = "CH")
table(bootout$optimal.kcluster)
## 
##  2  3  6 
## 47  2  1

Time for this code chunk to run: 6.96 seconds

The vast majority of the bootstrap partitions have two clusters, but sometimes this optimal number varies. In these cases, corresponding varying numbers of logistic regressions are also conducted. The output of the function regressboot is a list with the following interesting components:

  • optimal.kcluster: An integer vector with the numbers of clusters for each bootstrap partition. If input parameter fixed is FALSE, this corresponds to the selected clustering solution based on the evaluation criterion. If input parameter fixed is TRUE, this can in rare cases differ from ncluster if two reference clusters have exactly the same estimated association with a covariate.
  • covar.name: A character vector with the different associations evaluated in the logistic regression model (based on input parameter formula). This corresponds to the name of the covariate for numerical variables and the name with a specific level for factors.
  • original.ame: A list with the estimated AMEs corresponding to each association between covariates of interest (as in covar.name) and the original typology, i.e., the one constructed on the original sample.
  • bootstrap.ame: A list with the estimated AMEs for all individuals and all bootstraps, corresponding to the associations between covariates of interest (as in covar.name) and the typology constructed on each bootstrap. For each covariate, the list contains a numerical matrix with the number of individuals (nrow(data)) as row number and the number of bootstrap (B) as column number.
  • std.err: A list with the estimated standard errors of the AMEs for all individuals and all bootstraps, corresponding to the associations between covariates of interest (as in bootstrap.ame) and the typology constructed on each bootstrap. For each covariate, the list contains a numerical matrix with the number of individuals (nrow(data)) as row number and the number of bootstrap (B) as column number.

Let us come back to some of these components.

# The association with father unemployment status as it appears in the regression output
bootout$covar.name
## [1] "funempyes"
# The AMEs for the association between father unemployment status and the original clustering
round(bootout$original.ame$funempyes, 4)
## [1]  0.1208 -0.1208

Time for this code chunk to run: 0.01 seconds

As the original clustering is not given as input but computed inside the function, it may be recommended to verify that the result corresponds to what is expected. Here, it is indeed the same as what was estimated earlier. There is symmetry in this setting because of the two clusters solution.

The main purpose of the regressboot function is to estimate the AME matrix for an association with a covariate of interest. We show next the distribution of these values for our use case.

# Histogram of the estimated AMEs for all individuals and all bootstraps
hist(bootout$bootstrap.ame$funempyes, main = NULL, xlab = "AME")

# Histogram for the individuals in the second cluster (high education trajectories)
clustering <- clustqual$clustering$cluster2
hist(bootout$bootstrap.ame$funempyes[clustering == 2,], main = NULL, xlab = "AME")

# Histogram for the individuals in the first cluster (the rest)
hist(bootout$bootstrap.ame$funempyes[clustering == 1,], main = NULL, xlab = "AME")

Time for this code chunk to run: 0.09 seconds

If a given individual is sampled in a given bootstrap, the entry in the AME matrix corresponds to the expected change in probability of belonging to a certain cluster in this bootstrap for a change in the level of the covariate. Otherwise, if an individual is not sampled in a bootstrap, the corresponding entry is empty. Next, we show how to aggregate and summarize the information obtained from the bootstrap procedure.

Pooling effect sizes

We remind that our objective is to assess the robustness of the regression result from the original cluster analysis by utilising the sampling variation. This is achieved with the bootpool function with the following parameters:

  • bootout: Output of the regressboot function.
  • clustering: An integer vector containing the clustering solution (one entry for each individual) from the original analysis.
  • clusnb: An integer with the cluster to be evaluated (part of the clustering solution), as the RARCAT procedure is cluster-wise by design.
  • covar: A character string with the association (covariate) of interest as specified in the component covar.name of the regressboot function output.
  • fisher_transform: Logical. TRUE means that the AMEs from the bootstrap procedure are transformed with a Fisher transformation before being imputed in the pooling model, and then transformed back for the output results. This can be recommended in case of extreme associations (close to the -1 or 1 boundaries). FALSE by default.

A typical utilisation of bootpool is presented below:

# Robustness assessment for the association between father unemployment status
# and membership to the higher education trajectory group
result <- bootpool(bootout, clustering = mvad$clustering, clusnb = 2, 
                   covar = "funempyes")
round(result$pooled.ame, 4)
## [1] -0.1046
round(result$standard.error, 4)
## [1] 0.007
round(result$bootstrap.stddev, 4)
## [1] 0.0486

Time for this code chunk to run: 0.1 seconds

The output of the function bootpool is a list with the following interesting components:

  • pooled.ame: A numeric value indicating the pooled AME, which is the mean change in cluster membership probability for a change in the level of the covariate of interest over all bootstraps and all individuals belonging to the reference cluster in the original typology.
  • standard.error: Standard error of the pooled AME, which diminishes asymptotically as the number of bootstrap increases.
  • bootstrap.stddev: The estimate for the standard deviation of the bootstrap random effect. This can be used to construct a prediction interval for the association of interest (see Roth et al. [2024] for details on how to compute this).
  • individual.stddev: The estimate for the standard deviation of the bootstrap random effect.
  • bootstrap.ranef: A vector containing the estimated random effects for each bootstrap.
  • individual.ranef: A vector containing the estimated random effects for each individual in the reference cluster.

Now we are able to interpret the output of the bootpool function run above. The pooled AME is approximately -0.11 and represents the expected change in cluster membership probability based on the typology replications for individuals whose father was unemployed at time of survey, for an average bootstrap sample and an average individual in the higher education cluster. The corresponding standard error (SE) is small, which indicates that there are probably enough bootstrap replications, and that despite the results’ randomness caused by the bootstrap sampling, variation in the pooled AME is limited. Based on the between-bootstrap random deviation, we can derive a 95% prediction interval (PI) for the association of interest: [-0.20, -0.02]. If a new sample is drawn from the same underlying distribution, and a new partition constructed from this sample, this interval gives the range of expected values for the change in cluster membership probability for father unemployment status based on this new partition for individuals assigned to the high education cluster originally. While this supports the claim from the original analysis (95% CI for the corresponding AME was [-0.19, -0.06]), the evidence for the effect of father unemployment on higher education trajectories is a bit weaker after the robustness assessment.

Lastly, the individual-specific random effects in the output of unirarcat inform on the within-cluster homogeneity. Indeed, individuals with fitted values that diverge from the other individuals in the same cluster were often assigned to different clusters in the bootstrap, thus impacting their estimated AMEs. We can see this with by looking at the distribution of these random effects.

# Histogram of the fitted random effects for individuals in the second cluster
hist(result$individual.ranef, main = NULL, breaks = 20,
     xlab = "Individual-specific random effects")

Time for this code chunk to run: 0.03 seconds

Individuals at the far right of this histogram have outlier trajectories compared to the other individuals in the cluster. This can be checked by plotting the actual trajectories. In the code below, we highlight the most common sequences in the second cluster depending on whether the fitted random effect for the corresponding individual is more than one standard deviation away from its mean (on the right, n=11) or not (on the left, n=140). The trajectories on the right feature shorter spells in higher education compared to those on the left and can thus be interpreted as borderline or atypical.

# Most frequent trajectories in the second cluster, separated by their 
# random effects estimated in RARCAT
seqfplot(mvad.seq[clustering == 2,], 
         group=abs(result$individual.ranef) > result$individual.stddev, 
         border=NA, main="Outlier")

Time for this code chunk to run: 0.07 seconds

Second use case: fixed number of clusters

It can be argued that a two clusters solution is not particularly interesting or informative, and that despite “worse” cluster quality indices, a solution with more clusters should be favoured. This is particularly true in this case as we are interested in the association between the clustering and a covariate, and the clustassoc function supports considering at least 6 groups to sufficiently account for this association (see WeightedCluster vignette Validating Sequence Analysis Typologies to be used in Subsequent Regression). Therefore, we consider a six clusters solution for further illustration of the methods and their implementation in \(R\). First, we visualise again this typology.

seqdplot(mvad.seq, group=clustqual$clustering$cluster6, border=NA)

Time for this code chunk to run: 0.14 seconds

It could be interesting to study the association between father unemployment status and membership to the unemployment or joblessness cluster (number 6 above). We repeat thus the standard analysis with this trajectory group as reference.

# Create cluster membership variable
mvad$clustering <- clustqual$clustering$cluster6
mvad$membership <- mvad$clustering == 6
# Run logistic regression model
mod <- glm(clustering ~ funemp, mvad, family = "binomial")
# Model results
summary(margins(mod))
##     factor    AME     SE      z      p   lower  upper
##  funempyes 0.0447 0.0436 1.0265 0.3047 -0.0407 0.1301

Time for this code chunk to run: 0.04 seconds

Moving directly to the new functions, we force this time the number of clusters to be fixed to six in each bootstrap to avoid choosing a solution with less clusters due to a quality index. The output of the function regressboot does not depend on a specific reference cluster so there is the possibility to evaluate all the possible associations in one run. This is achieved with the function rarcat, with the same parameters as regressboot as well as several additional ones:

# Evaluate the validity of the original analysis and the reliability of its
# findings by applying RARCAT with all given covariates and constructed types
# rarcat <- rarcat(clustering ~ funemp, mvad, diss = diss, B = 500,
#                  algo = "hierarchical", method = "ward.D",
#                  fixed = TRUE, kcluster = 6,
#                  parallel = "snow", ncpus = ncpus, cl = cl)
rarcat <- rarcat(clustering ~ funemp, mvad, diss = diss, B = 50,
                 algo = "hierarchical", method = "ward.D",
                 fixed = TRUE, kcluster = 6)
rarcat$original.analysis
##      factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1 funempyes    -0.045    -0.13    0.041    -0.041   -0.087    0.005     0.009
##   c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1   -0.073    0.091     0.093    0.002    0.183     -0.08   -0.131   -0.028
##   cluster 6 c6 lower c6 upper
## 1     0.064    0.005    0.123
rarcat$robust.analysis
##      factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1 funempyes    -0.023   -0.084    0.038     -0.04   -0.094    0.013    -0.012
##   c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1   -0.088    0.065     0.043   -0.027    0.113    -0.077   -0.125   -0.029
##   cluster 6 c6 lower c6 upper
## 1     0.111    0.028    0.194
# Stop the parallel cluster after computation
#stopCluster(cl)

Time for this code chunk to run: 9.46 seconds

The results here are not directly comparable to the ones from the short tutorial as only one covariate is considered (father unemployment status), such that the estimated effects are not adjusted for GCSEs obtained as it was the case then. Nevertheless, we see in this bivariate setting that in the original analysis, father unemployment status was not only associated with the unemployment trajectory group, but also with trajectories featuring long training spells (cluster 4) and, as hinted in the first use case (two clusters solution), with school + higher education trajectories (cluster 5, inverse association). However, the evidence for the association with cluster 4 was only weak and does not pass the robustness assessment (95% PI contains the null value), which indicates that this relationship is sensitive to sampling uncertainty and therefore unstable [1]. On the other hand, the association with cluster 5 stays virtually unchanged in the robustness assessment, and the association with cluster 6 becomes even more apparent. We conclude that there is relatively strong evidence of a relationship between our covariate of interest and these two specific trajectory types, at least if we ignore potential confounding factors.

[1] In fact, the estimated Jaccard coefficient for this cluster is only 0.39, compared to 0.94 for the fifth cluster and 0.63 for the sixth.

References

Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.

Roth, L., Studer, M., Zuercher, E., & Peytremann-Bridevaux, I. (2024). Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with application in state sequence analysis. BMC medical research methodology, 24(1), 303. https://doi.org/10.1186/s12874-024-02435-8.

Studer M. WeightedCluster Library Manual A practical guide to creating typologies of trajectories in the social sciences with R. 2013.