In a standard Sequence Analysis, similar trajectories are clustered together to create a typology of trajectories, which is then often used to evaluate the association between sequence patterns and covariates inside regression models. The sampling uncertainty, which affects both the derivation of the typology and associated regressions, is typically ignored in this analysis, an oversight that may lead to wrong statistical conclusions.
This document provides an introduction to the R
code
proposed to assess the robustness of regression results obtained from
this standard analysis. Bootstrap samples are drawn from the data, and
for each bootstrap, a new typology replicating the original one is
constructed, followed by the estimation of the corresponding regression
models. The bootstrap estimates are then combined using a multilevel
modelling framework.
The document is structured in two parts. First, a short tutorial with
a streamlined standard analysis on sequence data and a robustness
assessment made with the rarcat
function. Then, an extended
tutorial with the same data illustration and a detailed explanation of
the new functions, their arguments and their outputs. Furthermore,
readers interested in the methods and the precise interpretation of the
results are referred to:
You are kindly asked to cite the above reference if you use the methods presented in this document.
Let us start by setting the seed for reproducible results.
Time for this code chunk to run: 0.01 seconds
For this example, we use the openly available mvad
dataset on transitions from school to work. First, we create the state
sequence object.
## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)
## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")
## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab,
labels = mvad.lab, xtstep = 6)
Time for this code chunk to run: 0.06 seconds
We will now construct a typology using cluster analysis. For readers
seeking more details, the WeightedCluster
library manual
provides an in-depth explanation of the process and the computation of
cluster quality measures (Studer, 2013).
We start by computing dissimilarities with the seqdist
function using the longest common subsequence distance. We then use Ward
clustering to create a typology of the trajectories.
## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")
Time for this code chunk to run: 1.79 seconds
We can now compute several cluster quality indices using
as.clustrange
function with two to ten groups.
# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual
## PBC HG HGSD ASW ASWw CH R2 CHsq R2sq HC
## cluster2 0.59 0.75 0.74 0.43 0.43 216.72 0.23 431.41 0.38 0.13
## cluster3 0.43 0.51 0.50 0.25 0.25 175.61 0.33 335.58 0.49 0.25
## cluster4 0.52 0.67 0.67 0.29 0.30 164.63 0.41 352.31 0.60 0.17
## cluster5 0.52 0.69 0.68 0.31 0.31 153.46 0.46 326.41 0.65 0.16
## cluster6 0.57 0.79 0.78 0.34 0.35 151.17 0.52 372.95 0.73 0.11
## cluster7 0.58 0.83 0.83 0.37 0.37 144.38 0.55 383.10 0.77 0.09
## cluster8 0.51 0.78 0.78 0.32 0.33 140.00 0.58 358.85 0.78 0.12
## cluster9 0.52 0.85 0.85 0.34 0.35 137.85 0.61 386.02 0.81 0.09
## cluster10 0.51 0.87 0.87 0.35 0.36 131.67 0.63 379.81 0.83 0.08
Time for this code chunk to run: 0.1 seconds
Different clustering solutions may be argued based on the information above. We are interested in a relatively detailed partition of the data, so focus hereafter on a six clusters solution (more details on selecting a clustering solution available in the extended tutorial that below). The corresponding typology is shown next:
Time
for this code chunk to run: 0.16 seconds
We are now interested in the relationship between this typology and
the funemp
and gcse5eq
covariates, which
represent the father’s unemployment status and the qualifications gained
by the end of compulsory education, respectively (both are binary
variables). Such associations can be studied with different approaches.
Here, we focus on implementing separate logistic regressions for each
cluster and estimating average marginal effects (AMEs) with these
models. The readers are referred to the article by Roth et al. (2024)
for theoretical and practical explanations for these methodological
choices.
Multiple commands would normally be required to explore all possible
combinations between the clusters and their related covariates. This can
also be done with the function rarcat
below.
# Add the clustering solution to the dataset
mvad$clustering <- clustqual$clustering$cluster6
# Formula for the association between clustering and covariates of interest
formula <- clustering ~ funemp + gcse5eq
# Run separate logistic regressions for each cluster and compute the corresponding AMEs
# with their confidence interval
rarcatout <- rarcat(formula, data=mvad, diss=diss, robust=FALSE)
rarcatout$original.analysis
## factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1 funempyes -0.070 -0.150 0.010 -0.019 -0.075 0.036 0.011
## 2 gcse5eqyes -0.181 -0.244 -0.118 0.159 0.109 0.209 0.014
## c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1 -0.072 0.094 0.056 -0.027 0.140 -0.049 -0.110 0.012
## 2 -0.050 0.078 -0.202 -0.260 -0.144 0.254 0.197 0.312
## cluster 6 c6 lower c6 upper
## 1 0.051 -0.003 0.105
## 2 -0.051 -0.082 -0.020
Time for this code chunk to run: 0.37 seconds
The results in the table above are AMEs, which measure the expected change in probability of belonging to a given cluster if the covariate takes a given value, together with their 95% confidence intervals. Thus, after adjustment for the GCSEs, father unemployment status is only marginally associated with membership to cluster 6, which contains the children unemployment trajectories (the p-value is actually 0.06).
On the other hand, there are strong associations apparent between the GCSEs obtained and the trajectory groups, with a distinctively higher probability to enter higher education (clusters 2 and 5) compared to being employed, in training or, to a lesser degree, unemployed (clusters 1, 4 and 6) if five or more GCSEs were gained. However, the standard analysis presented to this point does not properly account for the sampling uncertainty, such that the findings’ reliability remains in the balance.
The Robustness Assessment of Regressions using Cluster Analysis
Typologies (RARCAT) procedure allows for evaluating the reproducibility
of the analysis on resamples from the data. We focus here on a quick
implementation of the procedure and refer to the extended tutorial below
or the article by Roth et al. (2024) for further details. The function
rarcat
is run again with robust = TRUE
and
further arguments for the procedure that replicates the typology and
associated regressions on bootstrap samples of the data, thus enabling
assessing the robustness of the original estimates by comparing them to
summary estimates in the form of pooled AMEs and 95% predictions
intervals.
# Evaluate the validity of the original analysis and the reliability of its
# findings by applying RARCAT with 50 bootstrap replications.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# The number of clusters is fixed to 6 here.
rarcatout <- rarcat(formula, data = mvad, diss = diss, robust = TRUE, B = 50,
algo = "hierarchical", method = "ward.D",
fixed = TRUE, kcluster = 6)
rarcatout$robust.analysis
## factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1 funempyes -0.039 -0.100 0.023 -0.016 -0.078 0.046 -0.008
## 2 gcse5eqyes -0.168 -0.266 -0.069 0.171 0.095 0.246 0.034
## c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1 -0.080 0.065 0.018 -0.050 0.087 -0.052 -0.126 0.021
## 2 -0.059 0.128 -0.164 -0.244 -0.083 0.254 0.184 0.324
## cluster 6 c6 lower c6 upper
## 1 0.083 0.001 0.166
## 2 -0.083 -0.163 -0.004
Time for this code chunk to run: 18.75 seconds
The results stay fairly close to the ones from the original analysis (seen earlier), in particular regarding the “significant” associations. The relationship between father unemployment status and unemployment trajectories (cluster 6) seems somewhat stronger than it did on the original sample, and the relationship between GCSEs and long training spells (cluster 4) is a bit weaker. The other relevant associations are virtually unchanged. Thus, we conclude that the original analysis appear to be valid and robust to sampling variation. However, fixing the number of clusters (six here) throughout the bootstrap procedure may not always be warranted.
In the example above, we have run the function rarcat
with the number of bootstrap \(B=50\),
which is less than the recommended number (500 in the function
documentation). This was done to speed up the computing. Alternatively,
parallel computing may be used. The code below shows how to do this for
Windows users. The function rarcat
also accepts arguments
for parallel computing on different operating systems (see the function
boot
documentation for more details).
# # Loading the parallel library
# library(parallel)
# # Use available cores minus one
# ncpus <- detectCores() - 1
# # Create the parallel cluster
# cl <- makeCluster(ncpus)
#
# # Parallel run
# rarcatout <- rarcat(formula, data = mvad, diss = diss, robust = TRUE, B = 500,
# algo = "hierarchical", method = "ward.D",
# fixed = TRUE, kcluster = 6,
# parallel = "snow", ncpus = ncpus, cl = cl)
# rarcatout$robust.analysis
Time for this code chunk to run: 0.01 seconds
The bootstrap procedure is now at least four times faster and the estimates become sufficiently accurate, in the sense that a new run gives almost exactly the same values.
We will use here the same dataset (mvad
) and
dissimilarity matrix (diss
) as before. For sake of
illustration, we start by considering the two clusters solution. These
clusters are visualised below with state distribution plots.
Time
for this code chunk to run: 0.07 seconds
The distinction is essentially between individuals with prolonged
spells of higher education (on the right, n=151) and the rest (on the
left, n=561). In this example, we investigate the association between
father unemployment status (funemp
variable) and this
typology of trajectories.
# Loading the margins library for marginal effects estimation
library(margins)
# Create cluster membership variable
mvad$clustering <- clustqual$clustering$cluster2
mvad$membership <- mvad$clustering == 2
# Run logistic regression model
mod <- glm(membership ~ funemp, mvad, family = "binomial")
# Model results (AME)
summary(margins(mod))
## factor AME SE z p lower upper
## funempyes -0.1208 0.0338 -3.5729 0.0004 -0.1871 -0.0545
Time for this code chunk to run: 0.05 seconds
The average marginal effect (AME) measures the expected change in probability of belonging to the higher education trajectory group if the father was unemployed at time of survey. The readers are referred to the article by Roth et al. (2024) for theoretical and practical explanations of the use of a logistic regression and marginal effects in this situation. Crucially, the standard analysis presented to this point does not account for the sampling uncertainty, which might impact the findings’ reliability.
The Robustness Assessment of Regressions using Cluster Analysis Typologies (RARCAT) procedure works as follows:
Hennig (2007) proposed to use the new partitions derived in step 2 to
evaluate cluster-wise stability by measuring the quality preservation of
clustering solutions across perturbed dataset through average Jaccard
similarities. The corresponding code is available in the library
fpc
and its application is illustrated below with 500
bootstrap.
# Loading the fpc library
library(fpc)
# Cluster-wise stability assessment by bootstrap
stab <- clusterboot(diss, B = 500, distances = TRUE, clustermethod = disthclustCBI,
method = "ward.D", k = 2, count = FALSE)
round(stab$bootmean, 2)
## [1] 0.96 0.89
Time for this code chunk to run: 4.45 seconds
The estimated Jaccard coefficients are above 0.85, which indicate high cluster-wise stability, meaning that most of the individuals belonging to any of the two clusters in the original partition tend to be clustered again in the bootstrap partitions, particularly for the first cluster.
We propose to go one step further and use the bootstrap partitions to
estimate each time new regression models replicating the original ones.
This can be achieved by running the regressboot
function
with the following main parameters:
formula
: A formula object with the clustering solution
on the left side and the covariates of interest.data
: The dataset (data frame) with column names
corresponding to the information in formula. The number of individuals
(row number) should match the dimension of diss
.diss
: The numerical dissimilarity matrix used for
clustering. Only a pre-computed matrix (i.e., where pairwise
dissimilarities do not depend on the resample) is currently
supported.B
: The integer number of bootstrap. Set to 500 by
default to attain a satisfactory precision around the estimates as the
procedure involves multiple steps..algo
: The clustering algorithm as a character string.
Currently only “pam” (calling the function wcKMedRange
) and
“hierarchical” (calling the function fastcluster::hclust
)
are supported. By default “pam”.method
: A character string with the method argument of
hclust
, “ward.D” by default.fixed
: Logical. TRUE implies that the number of
clusters is the same in every bootstrap. FALSE (default) implies that an
optimal number of clusters is evaluated each time.kcluster
: Integer. Either the number of clusters in
every bootstrap if fixed
is TRUE or the maximum number of
clusters (starting from 2) to be evaluated in each bootstrap if
fixed
is FALSE.cqi
: A character string with the cluster quality index
to be evaluated for each new partition. Any column of
as.clustrange
is supported, “CH” (the Calinski-Harabasz
index) by default. Also works with algo
= “pam”.parallel
: A character string with the type of parallel
operation to be used (if any) by the function boot:boot
.
Options are “no” (default), “multicore” and “snow” (for Windows).ncpus
: Integer. Number of processes to be used in case
of parallel operation. Typically, one would chose this to be the number
of available CPUs.cl
: A parallel cluster for use if parallel
= “snow”. If not supplied, a cluster on the local machine is created for
the duration of the boot
call.A typical utilisation of regressboot
is shown below
(caution, this uses the Windows parallel computing settings from the
short tutorial):
# Bootstrap replicates of the typology and its association with the variable funemp.
# As in the original analysis, hierarchical clustering with Ward method is implemented.
# Also, an optimal clustering solution with n between 2 and 10 is evaluated each time by
# maximizing the CH index.
# bootout <- regressboot(clustering ~ funemp, data = mvad, diss = diss, B = 500,
# algo = "hierarchical", method = "ward.D",
# fixed = FALSE, kcluster = 10, cqi = "CH",
# parallel = "snow", ncpus = ncpus, cl = cl)
# Without parallel computing
bootout <- regressboot(clustering ~ funemp, data = mvad, diss = diss, B = 50,
algo = "hierarchical", method = "ward.D",
fixed = FALSE, kcluster = 10, cqi = "CH")
table(bootout$optimal.kcluster)
##
## 2 3 6
## 47 2 1
Time for this code chunk to run: 6.96 seconds
The vast majority of the bootstrap partitions have two clusters, but
sometimes this optimal number varies. In these cases, corresponding
varying numbers of logistic regressions are also conducted. The output
of the function regressboot
is a list with the following
interesting components:
optimal.kcluster
: An integer vector with the numbers of
clusters for each bootstrap partition. If input parameter
fixed
is FALSE, this corresponds to the selected clustering
solution based on the evaluation criterion. If input parameter
fixed
is TRUE, this can in rare cases differ from
ncluster
if two reference clusters have exactly the same
estimated association with a covariate.covar.name
: A character vector with the different
associations evaluated in the logistic regression model (based on input
parameter formula
). This corresponds to the name of the
covariate for numerical variables and the name with a specific level for
factors.original.ame
: A list with the estimated AMEs
corresponding to each association between covariates of interest (as in
covar.name
) and the original typology, i.e., the one
constructed on the original sample.bootstrap.ame
: A list with the estimated AMEs for all
individuals and all bootstraps, corresponding to the associations
between covariates of interest (as in covar.name
) and the
typology constructed on each bootstrap. For each covariate, the list
contains a numerical matrix with the number of individuals
(nrow(data)
) as row number and the number of bootstrap
(B
) as column number.std.err
: A list with the estimated standard errors of
the AMEs for all individuals and all bootstraps, corresponding to the
associations between covariates of interest (as in
bootstrap.ame
) and the typology constructed on each
bootstrap. For each covariate, the list contains a numerical matrix with
the number of individuals (nrow(data)
) as row number and
the number of bootstrap (B
) as column number.Let us come back to some of these components.
# The association with father unemployment status as it appears in the regression output
bootout$covar.name
## [1] "funempyes"
# The AMEs for the association between father unemployment status and the original clustering
round(bootout$original.ame$funempyes, 4)
## [1] 0.1208 -0.1208
Time for this code chunk to run: 0.01 seconds
As the original clustering is not given as input but computed inside the function, it may be recommended to verify that the result corresponds to what is expected. Here, it is indeed the same as what was estimated earlier. There is symmetry in this setting because of the two clusters solution.
The main purpose of the regressboot
function is to
estimate the AME matrix for an association with a covariate of interest.
We show next the distribution of these values for our use case.
# Histogram of the estimated AMEs for all individuals and all bootstraps
hist(bootout$bootstrap.ame$funempyes, main = NULL, xlab = "AME")
# Histogram for the individuals in the second cluster (high education trajectories)
clustering <- clustqual$clustering$cluster2
hist(bootout$bootstrap.ame$funempyes[clustering == 2,], main = NULL, xlab = "AME")
# Histogram for the individuals in the first cluster (the rest)
hist(bootout$bootstrap.ame$funempyes[clustering == 1,], main = NULL, xlab = "AME")
Time
for this code chunk to run: 0.09 seconds
If a given individual is sampled in a given bootstrap, the entry in the AME matrix corresponds to the expected change in probability of belonging to a certain cluster in this bootstrap for a change in the level of the covariate. Otherwise, if an individual is not sampled in a bootstrap, the corresponding entry is empty. Next, we show how to aggregate and summarize the information obtained from the bootstrap procedure.
We remind that our objective is to assess the robustness of the
regression result from the original cluster analysis by utilising the
sampling variation. This is achieved with the bootpool
function with the following parameters:
bootout
: Output of the regressboot
function.clustering
: An integer vector containing the clustering
solution (one entry for each individual) from the original
analysis.clusnb
: An integer with the cluster to be evaluated
(part of the clustering solution), as the RARCAT procedure is
cluster-wise by design.covar
: A character string with the association
(covariate) of interest as specified in the component
covar.name
of the regressboot
function
output.fisher_transform
: Logical. TRUE means that the AMEs
from the bootstrap procedure are transformed with a Fisher
transformation before being imputed in the pooling model, and then
transformed back for the output results. This can be recommended in case
of extreme associations (close to the -1 or 1 boundaries). FALSE by
default.A typical utilisation of bootpool
is presented
below:
# Robustness assessment for the association between father unemployment status
# and membership to the higher education trajectory group
result <- bootpool(bootout, clustering = mvad$clustering, clusnb = 2,
covar = "funempyes")
round(result$pooled.ame, 4)
## [1] -0.1046
## [1] 0.007
## [1] 0.0486
Time for this code chunk to run: 0.1 seconds
The output of the function bootpool
is a list with the
following interesting components:
pooled.ame
: A numeric value indicating the pooled AME,
which is the mean change in cluster membership probability for a change
in the level of the covariate of interest over all bootstraps and all
individuals belonging to the reference cluster in the original
typology.standard.error
: Standard error of the pooled AME, which
diminishes asymptotically as the number of bootstrap increases.bootstrap.stddev
: The estimate for the standard
deviation of the bootstrap random effect. This can be used to construct
a prediction interval for the association of interest (see Roth et
al. [2024] for details on how to compute this).individual.stddev
: The estimate for the standard
deviation of the bootstrap random effect.bootstrap.ranef
: A vector containing the estimated
random effects for each bootstrap.individual.ranef
: A vector containing the estimated
random effects for each individual in the reference cluster.Now we are able to interpret the output of the bootpool
function run above. The pooled AME is approximately -0.11 and represents
the expected change in cluster membership probability based on the
typology replications for individuals whose father was unemployed at
time of survey, for an average bootstrap sample and an average
individual in the higher education cluster. The corresponding standard
error (SE) is small, which indicates that there are probably enough
bootstrap replications, and that despite the results’ randomness caused
by the bootstrap sampling, variation in the pooled AME is limited. Based
on the between-bootstrap random deviation, we can derive a 95%
prediction interval (PI) for the association of interest: [-0.20,
-0.02]. If a new sample is drawn from the same underlying distribution,
and a new partition constructed from this sample, this interval gives
the range of expected values for the change in cluster membership
probability for father unemployment status based on this new partition
for individuals assigned to the high education cluster originally. While
this supports the claim from the original analysis (95% CI for the
corresponding AME was [-0.19, -0.06]), the evidence for the effect of
father unemployment on higher education trajectories is a bit weaker
after the robustness assessment.
Lastly, the individual-specific random effects in the output of
unirarcat
inform on the within-cluster homogeneity. Indeed,
individuals with fitted values that diverge from the other individuals
in the same cluster were often assigned to different clusters in the
bootstrap, thus impacting their estimated AMEs. We can see this with by
looking at the distribution of these random effects.
# Histogram of the fitted random effects for individuals in the second cluster
hist(result$individual.ranef, main = NULL, breaks = 20,
xlab = "Individual-specific random effects")
Time
for this code chunk to run: 0.03 seconds
Individuals at the far right of this histogram have outlier trajectories compared to the other individuals in the cluster. This can be checked by plotting the actual trajectories. In the code below, we highlight the most common sequences in the second cluster depending on whether the fitted random effect for the corresponding individual is more than one standard deviation away from its mean (on the right, n=11) or not (on the left, n=140). The trajectories on the right feature shorter spells in higher education compared to those on the left and can thus be interpreted as borderline or atypical.
# Most frequent trajectories in the second cluster, separated by their
# random effects estimated in RARCAT
seqfplot(mvad.seq[clustering == 2,],
group=abs(result$individual.ranef) > result$individual.stddev,
border=NA, main="Outlier")
Time
for this code chunk to run: 0.07 seconds
It can be argued that a two clusters solution is not particularly
interesting or informative, and that despite “worse” cluster quality
indices, a solution with more clusters should be favoured. This is
particularly true in this case as we are interested in the association
between the clustering and a covariate, and the clustassoc
function supports considering at least 6 groups to sufficiently account
for this association (see WeightedCluster
vignette
Validating Sequence Analysis Typologies to be used in Subsequent
Regression). Therefore, we consider a six clusters solution for
further illustration of the methods and their implementation in \(R\). First, we visualise again this
typology.
Time
for this code chunk to run: 0.14 seconds
It could be interesting to study the association between father unemployment status and membership to the unemployment or joblessness cluster (number 6 above). We repeat thus the standard analysis with this trajectory group as reference.
# Create cluster membership variable
mvad$clustering <- clustqual$clustering$cluster6
mvad$membership <- mvad$clustering == 6
# Run logistic regression model
mod <- glm(clustering ~ funemp, mvad, family = "binomial")
# Model results
summary(margins(mod))
## factor AME SE z p lower upper
## funempyes 0.0447 0.0436 1.0265 0.3047 -0.0407 0.1301
Time for this code chunk to run: 0.04 seconds
Moving directly to the new functions, we force this time the number
of clusters to be fixed to six in each bootstrap to avoid choosing a
solution with less clusters due to a quality index. The output of the
function regressboot
does not depend on a specific
reference cluster so there is the possibility to evaluate all the
possible associations in one run. This is achieved with the function
rarcat
, with the same parameters as
regressboot
as well as several additional ones:
robust
: Logical. TRUE (the default) indicates that
RARCAT should be performed. FALSE implies a much faster function run but
only output the original analysis, which is a standard regression
analysis for all combinations of reference clusters and covariates.fisher_transform
: Logical. TRUE means that a Fisher
transformation is applied in the rarcat
function. This can
be recommended in case of extreme associations (close to the -1 or 1
boundaries). FALSE by default.conflevel
: Confidence level for the confidence
intervals from the original analysis and the prediction intervals from
the robustness assessment. 0.05 by default.digits
: Controls the number of significant digits to
print. 3 by default.# Evaluate the validity of the original analysis and the reliability of its
# findings by applying RARCAT with all given covariates and constructed types
# rarcat <- rarcat(clustering ~ funemp, mvad, diss = diss, B = 500,
# algo = "hierarchical", method = "ward.D",
# fixed = TRUE, kcluster = 6,
# parallel = "snow", ncpus = ncpus, cl = cl)
rarcat <- rarcat(clustering ~ funemp, mvad, diss = diss, B = 50,
algo = "hierarchical", method = "ward.D",
fixed = TRUE, kcluster = 6)
rarcat$original.analysis
## factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1 funempyes -0.045 -0.13 0.041 -0.041 -0.087 0.005 0.009
## c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1 -0.073 0.091 0.093 0.002 0.183 -0.08 -0.131 -0.028
## cluster 6 c6 lower c6 upper
## 1 0.064 0.005 0.123
## factor cluster 1 c1 lower c1 upper cluster 2 c2 lower c2 upper cluster 3
## 1 funempyes -0.023 -0.084 0.038 -0.04 -0.094 0.013 -0.012
## c3 lower c3 upper cluster 4 c4 lower c4 upper cluster 5 c5 lower c5 upper
## 1 -0.088 0.065 0.043 -0.027 0.113 -0.077 -0.125 -0.029
## cluster 6 c6 lower c6 upper
## 1 0.111 0.028 0.194
Time for this code chunk to run: 9.46 seconds
The results here are not directly comparable to the ones from the short tutorial as only one covariate is considered (father unemployment status), such that the estimated effects are not adjusted for GCSEs obtained as it was the case then. Nevertheless, we see in this bivariate setting that in the original analysis, father unemployment status was not only associated with the unemployment trajectory group, but also with trajectories featuring long training spells (cluster 4) and, as hinted in the first use case (two clusters solution), with school + higher education trajectories (cluster 5, inverse association). However, the evidence for the association with cluster 4 was only weak and does not pass the robustness assessment (95% PI contains the null value), which indicates that this relationship is sensitive to sampling uncertainty and therefore unstable [1]. On the other hand, the association with cluster 5 stays virtually unchanged in the robustness assessment, and the association with cluster 6 becomes even more apparent. We conclude that there is relatively strong evidence of a relationship between our covariate of interest and these two specific trajectory types, at least if we ignore potential confounding factors.
[1] In fact, the estimated Jaccard coefficient for this cluster is only 0.39, compared to 0.94 for the fifth cluster and 0.63 for the sixth.
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, 52, 258-271.
Roth, L., Studer, M., Zuercher, E., & Peytremann-Bridevaux, I. (2024). Robustness assessment of regressions using cluster analysis typologies: a bootstrap procedure with application in state sequence analysis. BMC medical research methodology, 24(1), 303. https://doi.org/10.1186/s12874-024-02435-8.
Studer M. WeightedCluster Library Manual A practical guide to creating typologies of trajectories in the social sciences with R. 2013.