References:
VarSelLCM permits a full model selection (detection of the relevant features for clustering and selection of the number of clusters) in model-based clustering, according to classical information criteria (BIC, MICL or AIC).
Data to analyzed can be composed of continuous, integer and/or categorical features. Moreover, missing values are managed, without any pre-processing, by the model used to cluster with the assumption that values are missing completely at random. Thus, VarSelLCM can also be used for data imputation via mixture models.
An R-Shiny application is implemented to easily interpret the clustering results
This section performs the whole analysis of the Heart data set. Warning continuous features must be stored in numeric, integer features must be stored in integer and categorical features must be stored in factor.
library(VarSelLCM)
Attaching package: 'VarSelLCM'
The following object is masked from 'package:stats':
predict
# Data loading
data("heart")
head(heart)
Age Sex ChestPainType RestBloodPressure SerumCholestoral
1 70 1 4 130 322
2 67 0 3 115 564
3 57 1 2 124 261
4 64 1 4 128 263
5 74 0 2 120 269
6 65 1 4 120 177
FastingBloodSugar ResElectrocardiographic MaxHeartRate ExerciseInduced
1 0 2 109 0
2 0 2 160 0
3 0 0 141 0
4 0 0 105 1
5 0 2 121 1
6 0 0 140 0
Slope MajorVessels Thal Class
1 2 3 3 2
2 2 0 7 1
3 1 0 7 2
4 2 1 7 1
5 1 1 3 1
6 1 0 7 1
Clustering is performed with variable selection. Model selection is done with BIC because the number of observations is large (compared to the number of features). The number of components is between 1 and 4. Do not hesitate to use parallelisation (here only two cores are used).
# Add a missing value artificially (just to show that it works!)
heart[1,1] <- NA
# Clustering with variable selection and a number of cluster betwee 1 and 4
# Model selection is BIC (to use MICL, the option must be specified)
out <- VarSelCluster(heart[,-13], 1:4, nbcores = 2)
To get a summary of the selected model.
# Summary of the best model
summary(out)
Model:
Number of components: 2
Model selection has been performed according to the BIC criterion
Variable selection has been performed, 8 ( 66.67 % ) of the variables are relevant for clustering
All the results can be analyzed by the Shiny application…
# Start the shiny application
VarSelShiny(out)
… but this analysis can also be done on R.
Model interpretation should focus on the most discriminative variables. These variables can be found with the following plot.
# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(out)
Interpretation of the most discriminative variable is based on its distribution per cluster.
# Boxplot for continuous (or interger) variable
plot(out, y="MaxHeartRate", type="boxplot")
We can check that the distribution used to cluster is relevant.
# Empirical and theoretical distributions (to check that clustering is pertinent)
plot(out, y="MaxHeartRate", type="cdf")
Interpretation of a categorical variable is based on its distribution per cluster.
# Summary of categorical variable
plot(out, y="Sex")
Interpretation of the cluster overlaps by using the probabilities of missclassification.
# Summary of the probabilities of missclassification
plot(out, type="probs-class")
Partition among the observations
# Estimated partition
fitted(out)
[1] 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 2 1 2 2 2 2 2 1 2 1 1 1 1 2 1 1
[36] 1 1 1 2 2 2 2 2 2 1 2 1 2 1 1 1 2 2 2 2 2 1 1 1 1 2 1 2 2 1 1 2 2 2 2
[71] 1 2 2 1 1 1 1 2 2 2 1 1 1 2 1 2 2 1 2 1 2 2 1 1 2 1 1 1 1 2 2 1 2 1 1
[106] 1 1 1 1 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 1 2 2 1 1 1 2 1 2 2 2 1 2 1
[141] 1 2 1 1 2 1 2 1 2 2 2 2 2 1 2 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1 1 1 1 2
[176] 1 1 2 1 2 2 1 2 1 2 2 1 1 2 1 2 1 2 2 2 2 1 2 1 1 1 1 1 1 1 2 2 1 1 2
[211] 1 2 2 1 2 2 2 1 1 2 1 1 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 1 2 1 1 2
[246] 1 1 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 2 2 2 2 2 2 1 1
Probabilities of classification for the first six observations
# Estimated probabilities of classification
head(fitted(out, type="probability"))
class-1 class-2
[1,] 0.9999917 8.261741e-06
[2,] 0.6334640 3.665360e-01
[3,] 0.1755305 8.244695e-01
[4,] 1.0000000 4.443322e-08
[5,] 0.9961152 3.884823e-03
[6,] 0.9547823 4.521769e-02
# Probabilities of classification for new observations
predict(out, newdata = heart[c(8,12,65),-13])
class-1 class-2
[1,] 0.9997334 2.665880e-04
[2,] 0.9999472 5.278345e-05
[3,] 1.0000000 1.378215e-11
Missing values can be imputed.
# Imputation by posterior mean for the first observation
not.imputed <- heart[1:2,-13]
imputed <- VarSelImputation(out, heart[1:2,-13], method = "sampling")
rbind(not.imputed, imputed)
Age Sex ChestPainType RestBloodPressure SerumCholestoral
1 NA 1 4 130 322
2 67 0 3 115 564
3 54 1 4 130 322
4 67 0 3 115 564
FastingBloodSugar ResElectrocardiographic MaxHeartRate ExerciseInduced
1 0 2 109 0
2 0 2 160 0
3 0 2 109 0
4 0 2 160 0
Slope MajorVessels Thal
1 2 3 3
2 2 0 7
3 2 3 3
4 2 0 7