The introduction of high dimensional technologies in molecular
biology, which perform genome-scale measurements on DNA, RNA, or
protein, has been accompanied by more extensive use of “dimension
reduction” (DR) algorithms that promote visualization of important
aspects of the data in just two dimensions. However, it is critical to
understand the extent to which different DR algorithms distort the
properties of the data set that are present in the original high
dimensional space. The Preservation R package collects
together some metrics to aid our understanding.
We need some data in order to illustrate the preservation metrics. Here we use a subset of the CAVA CD4 T-cell single-cell CITE-Seq data set [1]. This data set was previously identified as a single coherent cluster in the much larger complete data set. It contains measurements of the expression of 51 proteins in 3096 individual cells. We also sub-clustered this subset into four parts to help with visualization.
## [1] 3096 51
## colvec
## #69DE69 #9FC7FF #C882D9 #FFA4A4
## 812 596 641 1047
Next, we applied four popular dimension reduction methods:
The first two DR algorithms (MDS and TSNE) were computed using the
Rdimtools package; UMAP, using the umap
package; and DM, using the diffusionMap package.
## MDS TSNE UMAP DM
## [1,] 3096 3096 3096 3096
## [2,] 2 2 2 2
Here is an initial plot showing the dimension reduction results.
opar <- par(mfrow = c(2,2))
for (tag in names(drs)) {
plot(drs[[tag]], col = colvec, pch = 16, main = tag, xlab = "X1", ylab = "X2")
}We first note that MDS and DM produce essentially the same plot, after recognizing that DM has switched the (arbitrary) signs of both components. By contrast, both UMAP and TSNE have tried to place the purple group and half of the blue group in a separate cluster. The TSNE algorithm is slightly more extreme, since it also seems to have “flipped” that new cluster so that the two halves of the blue group are further apart.
Some DR algorithms are designed to try to preserve the pairwise distances between points in the data set. MDS is the most explicit example of such an algorithm, since its defined goal is to preserve the average of these distances as well as possible.
The Preservation package includes six metrics to assess
how well pairwise distances are preserved. Here we apply five of these
metrics to the four different DR views of the CAVA dataset. (We omit
Spearman’s Rho, since it is somewhat slower than the others.)
hiD <- dist(prot5)
loD <- lapply(drs, dist)
suppressWarnings(
scores <- sapply(c("MilnorDistortion", "SigmaDistortion", "Stress",
"EarthMover"), function(metric) {
cat(metric, "\n", file = stderr())
FUN <- get(metric)
sapply(names(drs), function(tag) {
FUN(hiD, loD[[tag]])
})
})
)
m1d <- sapply(names(drs), function(tag) {
M1Distortion(prot5, drs[[tag]])
})
scores <- cbind(scores, M1Distortion = m1d)
round(scores, 3)## MilnorDistortion SigmaDistortion Stress EarthMover M1Distortion
## MDS 8.353 0.378 0.569 0.012 0.750
## TSNE 8.609 0.455 0.666 0.025 0.853
## UMAP 9.234 0.696 0.497 0.051 0.530
## DM 8.139 0.963 0.983 1.016 1.000
## MilnorDistortion SigmaDistortion Stress EarthMover
## 0 0 0 0
## M1Distortion
## 0
DM is the best DR method based on the Milnor distortion; UMAP is best by Stress or Spearman’s Rho (not shown); MDS is best based on the Sigma distortion, M1 distortion, or Earth Mover’s Distance. These findings probably tells us more about how well the metrics assess their desired goal than about how well the DR algorithms perform.
Method like TSNE and UMAP are inherently non-linear, and their
heuristic is to try to preserve local neighborhoods of points rather
than all pairwise distances. The Preservation package
includes several metrics to assess how well neighborhoods are preserved.
These metrics are naturally functions of the number, \(k\), of nearest neighbors defining a
neighborhood.
ajd <- AverageJaccardDistance(prot5, drs[["MDS"]], k = 50)
plot(ajd, type = "b", pch = 16, lwd = 2)
abline(h = mean(ajd))# too slow: qnx <- QNX(hiD, loD[["MDS"]])
ct <- Preservation:::compute_ct_list(prot5, drs[["MDS"]])
plot(ct[, 3], main = "Average Trustworthiness")We now want to see how well DR algorithms preserve paths. We define a path to be an ordered set of points in a data set. As an example, we computed a path in the original high (51-)dimensional space that closely tracked the first principal component. So, in the original space, the path is a close approximation to a line segment that crosses the full data set. The path is represented by 301 individual cells (indicated by their indices as rows in the data set).
## [1] 310
## [1] 310 51
Here is a plot showing the dimension reductions with the overlaid image of the PC1 path.
opar <- par(mfrow = c(2,2))
for (tag in names(drs)) {
lowD <- drs[[tag]][idex,]
plot(drs[[tag]], col = colvec, pch = 16, main = tag, xlab = "X1", ylab = "X2")
lines(lowD, lwd = 2)
}Again, the paths in the MDS and DM reductions jitter about the first component axis. In the UMAP and TSNE reductions, the path seems to do a lot of jumping between the two pseudo-clusters in the plot. In the TSNE plot, there are also numerous jumps between the two halves of the blue group.
scores <- sapply(c("LengthDistortion", "SegmentVariance", "Curvature", "SpatialSimilarity"),
function(metric) {
cat(metric, "\n", file = stderr())
FUN <- get(metric)
sapply(names(drs), function(tag) {
lowD <- drs[[tag]][idex,]
FUN(highD, lowD)
})
})
round(scores, 3)## LengthDistortion SegmentVariance Curvature SpatialSimilarity
## MDS 1.587 -0.520 -0.004 -0.649
## TSNE 1.184 -1.501 0.001 -0.810
## UMAP 0.671 -2.632 0.010 -0.493
## DM 5.128 6.557 -0.021 -1.619
## LengthDistortion SegmentVariance Curvature SpatialSimilarity
## 0 0 0 0
This computation was performed using the following software tools.
## R version 4.6.0 Patched (2026-05-01 r89993)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 13 (trixie)
##
## Matrix products: default
## BLAS: /srv/R/R-patched/build.26-05-02/lib/libRblas.so
## LAPACK: /srv/R/R-patched/build.26-05-02/lib/libRlapack.so; LAPACK version 3.12.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Vienna
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Preservation_0.0.11
##
## loaded via a namespace (and not attached):
## [1] magic_1.6-1 tidyr_1.3.2
## [3] plotly_4.12.0 sass_0.4.10
## [5] generics_0.1.4 digest_0.6.39
## [7] magrittr_2.0.5 evaluate_1.0.5
## [9] grid_4.6.0 RColorBrewer_1.1-3
## [11] fastmap_1.2.0 jsonlite_2.0.0
## [13] DRquality_0.2.1 DatabionicSwarm_2.0.0
## [15] promises_1.5.0 httr_1.4.8
## [17] purrr_1.2.2 viridisLite_0.4.3
## [19] scales_1.4.0 GeneralizedUmatrix_1.3.1
## [21] lazyeval_0.2.3 jquerylib_0.1.4
## [23] abind_1.4-8 cli_3.6.6
## [25] shiny_1.13.0 rlang_1.2.0
## [27] emdist_0.3-3 shinythemes_1.2.0
## [29] cachem_1.1.0 yaml_2.3.12
## [31] otel_0.2.0 FNN_1.1.4.1
## [33] coRanking_0.2.5 geometry_0.5.2
## [35] tools_4.6.0 deldir_2.0-4
## [37] dplyr_1.2.1 ggplot2_4.0.3
## [39] ProjectionBasedClustering_1.2.2 httpuv_1.6.17
## [41] vctrs_0.7.3 R6_2.6.1
## [43] mime_0.13 lifecycle_1.0.5
## [45] htmlwidgets_1.6.4 shinyjs_2.1.1
## [47] pkgconfig_2.0.3 RcppParallel_5.1.11-2
## [49] pillar_1.11.1 bslib_0.10.0
## [51] later_1.4.8 gtable_0.3.6
## [53] glue_1.8.1 data.table_1.18.2.1
## [55] Rcpp_1.1.1-1.1 xfun_0.57
## [57] tibble_3.3.1 tidyselect_1.2.1
## [59] knitr_1.51 farver_2.1.2
## [61] xtable_1.8-8 htmltools_0.5.9
## [63] rmarkdown_2.31 compiler_4.6.0
## [65] S7_0.2.2