The introduction of high dimensional technologies in molecular
biology, which perform genome-scale measurements on DNA, RNA, or
protein, has been accompanied by more extensive use of “dimension
reduction” (DR) algorithms that promote visualization of important
aspects of the data in just two dimensions. However, it is critical to
understand the extent to which different DR algorithms distort the
properties of the data set that are present in the original high
dimensional space. The Preservation R package collects
together some metrics to aid our understanding.
We need some data in order to illustrate the preservation metrics. Here we use a subset of the CAVA CD4 T-cell single-cell CITE-Seq data set [1]. This data set was previously identified as a single coherent cluster in the much larger complete data set. It contains measurements of the expression of 51 proteins in 3096 individual cells. We also sub-clustered this subset into four parts to help with visualization.
## [1] 3096 51
## colvec
## #69DE69 #9FC7FF #C882D9 #FFA4A4
## 812 596 641 1047
Next, we applied four popular dimension reduction methods:
The first two DR algorithms (MDS and TSNE) were computed using the
Rdimtools package; UMAP, using the umap
package; and DM, using the diffusionMap package.
## MDS TSNE UMAP DM
## [1,] 3096 3096 3096 3096
## [2,] 2 2 2 2
Here is an initial plot showing the dimension reduction results.
opar <- par(mfrow = c(2,2))
for (tag in names(drs)) {
plot(drs[[tag]], col = colvec, pch = 16, main = tag, xlab = "X1", ylab = "X2")
}We first note that MDS and DM produce essentially the same plot, after recognizing that DM has switched the (arbitrary) signs of both components. By contrast, both UMAP and TSNE have tried to place the purple group and half of the blue group in a separate cluster. The TSNE algorithm is slightly more extreme, since it also seems to have “flipped” that new cluster so that the two halves of the blue group are further apart.
Some DR algorithms are designed to try to preserve the pairwise distances between points in the data set. MDS is the most explicit example of such an algorithm, since its defined goal is to preserve the average of these distances as well as possible.
The Preservation package includes six metrics to assess
how well pairwise distances are preserved. Here we apply five of these
metrics to the four different DR views of the CAVA dataset. (We omit
Spearman’s Rho, since it is somewhat slower than the others.)
hiD <- dist(prot5)
loD <- lapply(drs, dist)
suppressWarnings(
scores <- sapply(c("MilnorDistortion", "SigmaDistortion", "Stress",
"EarthMover"), function(metric) {
cat(metric, "\n", file = stderr())
FUN <- get(metric)
sapply(names(drs), function(tag) {
FUN(hiD, loD[[tag]])
})
})
)
m1d <- sapply(names(drs), function(tag) {
M1Distortion(prot5, drs[[tag]])
})
scores <- cbind(scores, M1Distortion = m1d)
round(scores, 3)## MilnorDistortion SigmaDistortion Stress EarthMover M1Distortion
## MDS 8.353 0.378 0.569 0.012 0.750
## TSNE 8.609 0.455 0.666 0.025 0.853
## UMAP 9.234 0.696 0.497 0.051 0.530
## DM 8.139 0.963 0.983 1.016 1.000
## MilnorDistortion SigmaDistortion Stress EarthMover
## 0 0 0 0
## M1Distortion
## 0
DM is the best DR method based on the Milnor distortion; UMAP is best by Stress or Spearman’s Rho (not shown); MDS is best based on the Sigma distortion, M1 distortion, or Earth Mover’s Distance. These findings probably tells us more about how well the metrics assess their desired goal than about how well the DR algorithms perform.
Method like TSNE and UMAP are inherently non-linear, and their
heuristic is to try to preserve local neighborhoods of points rather
than all pairwise distances. The Preservation package
includes several metrics to assess how well neighborhoods are preserved.
These metrics are naturally functions of the number, \(k\), of nearest neighbors defining a
neighborhood.
ajd <-sapply(names(drs), function(tag) {
AverageJaccardDistance(prot5, drs[[tag]], k = 50)
})
scores <- cbind(scores, ajd)
# too slow: qnx <- QNX(hiD, loD[["MDS"]])
quad <- c(MDS = "#B00068", TSNE = "#1C8356", UMAP = "#325A9B", DM = "#FEAF16" )
ct <- lapply(names(drs), function(tag) {
Preservation:::compute_ct_list(prot5, drs[[tag]])
})
avtrust <- sapply(ct, function(X) X[,3])
colnames(avtrust) <- names(drs)
avcont <- sapply(ct, function(X) X[,6])
colnames(avcont) <- names(drs)opar <- par(mfrow = c(1,2))
plot(avtrust[,1], type = "n", ylim = c(0.1, 0.3), xlab = "Neighbors", ylab = "Trustworthiness")
for (tag in names(drs)) lines(avtrust[, tag], lwd = 2, col = quad[tag])
legend("topright", names(drs), col = quad, lwd = 2)
plot(avcont[,1], type = "n", ylim = c(0, 0.25), xlab = "Neighbors", ylab = "Continuity")
for (tag in names(drs)) lines(avcont[, tag], lwd = 2, col = quad[tag])
legend("topleft", names(drs), col = quad, lwd = 2)We now want to see how well DR algorithms preserve paths. We define a path to be an ordered set of points in a data set. As an example, we computed a path in the original high (51-)dimensional space that closely tracked the first principal component. So, in the original space, the path is a close approximation to a line segment that crosses the full data set. The path is represented by 301 individual cells (indicated by their indices as rows in the data set).
## [1] 310
## [1] 310 51
Here is a plot showing the dimension reductions with the overlaid image of the PC1 path.
opar <- par(mfrow = c(2,2))
for (tag in names(drs)) {
lowD <- drs[[tag]][idex,]
plot(drs[[tag]], col = colvec, pch = 16, main = tag, xlab = "X1", ylab = "X2")
lines(lowD, lwd = 2)
}Again, the paths in the MDS and DM reductions jitter about the first component axis. In the UMAP and TSNE reductions, the path seems to do a lot of jumping between the two pseudo-clusters in the plot. In the TSNE plot, there are also numerous jumps between the two halves of the blue group.
scores <- sapply(c("LengthDistortion", "SegmentVariance", "Curvature", "SpatialSimilarity"),
function(metric) {
FUN <- get(metric)
sapply(names(drs), function(tag) {
lowD <- drs[[tag]][idex,]
FUN(highD, lowD)
})
})
round(scores, 3)## LengthDistortion SegmentVariance Curvature SpatialSimilarity
## MDS 1.587 -0.520 -0.004 -0.649
## TSNE 1.184 -1.501 0.001 -0.810
## UMAP 0.671 -2.632 0.010 -0.493
## DM 5.128 6.557 -0.021 -1.619
## LengthDistortion SegmentVariance Curvature SpatialSimilarity
## 0 0 0 0
Each of the four metrics we have looked at compares the measurements on original and reduced data sets by computing their log ratio (which is why the best score is the one nearest zero). Here are variants that use Spearman’s rank correlation instead, where te best score would be the score closest to one. here length and curvature are topped by the UMAP and TSNE methods, while spatial imilarity is better with the DM and MDS algorithms.
rscores <- sapply(c("RankLengthDistortion", "RankCurvature", "RankSpatialSimilarity"),
function(metric) {
FUN <- get(metric)
sapply(names(drs), function(tag) {
lowD <- drs[[tag]][idex,]
FUN(highD, lowD)
})
})
round(rscores, 3)## RankLengthDistortion RankCurvature RankSpatialSimilarity
## MDS 0.265 0.247 0.835
## TSNE 0.480 0.371 0.629
## UMAP 0.477 0.330 0.654
## DM 0.316 0.258 0.836
We find it surprising that UMAP and TSNE rank so highly under many of these metrics. Perhaps we need to look more broadly to find metrics that give results that match our expectations?
One idea is to look at a smooth version of the path in the reduced spaces. Here is a plot showing the dimension reductions with the overlaid image of smoothed PC1 path.
smoosh <- function( coords) {
ticks <- 1:nrow(coords) # arbitrary parametrization
## fit smooth curve to each coordinate as a function of the parameter
fitted <- apply(coords, 2, function(X) predict(loess(X ~ ticks, span = 0.1)))
delta <- fitted - coords # difference between actual path and smooth fitted curve
val <- list(coords = coords, fitted = fitted, delta = delta)
class(val) <- "smoosh"
val
}
fits <- lapply(names(drs), function(tag) smoosh(drs[[tag]][idex,]))
names(fits) <- names(drs)
plot.smoosh <- function(obj, main = "", col = "purple", ...) {
plot(obj$coords, main = main, type = "b", col = "grey80", ...)
lines(obj$fitted, col = col, lwd = 2)
invisible(main)
}
lines.smoosh <- function(obj, col = "purple") {
lines(obj$fitted, col = col, lwd = 2)
}
opar <- par(mfrow = c(2, 2))
for (tag in names(fits)) {
plot(fits[[tag]], main=tag, col = "black", xlab = "X1", ylab = "X2")
points(drs[[tag]], col = colvec, pch = ".", cex=3)
lines(fits[[tag]], col = "black")
}mscores <- sapply(c("Smoothness", "Coil", "Contact", "EndpointDisplacement"),
function(metric) {
cat(metric, "\n", file = stderr())
FUN <- get(metric)
sapply(names(drs), function(tag) {
lowD <- drs[[tag]][idex,]
FUN(highD, lowD)
})
})
round(mscores, 3)## Smoothness Coil Contact EndpointDisplacement
## MDS 1.595 -0.314 0.729 0.017
## TSNE 1.077 0.352 0.060 1.157
## UMAP 0.620 -0.470 0.174 0.318
## DM 5.142 2.908 0.855 3.250
## MDS TSNE UMAP DM
## LengthDistortion 1.587 1.184 0.671 5.128
## SegmentVariance -0.520 -1.501 -2.632 6.557
## Curvature -0.004 0.001 0.010 -0.021
## SpatialSimilarity -0.649 -0.810 -0.493 -1.619
## RankLengthDistortion 0.265 0.480 0.477 0.316
## RankCurvature 0.247 0.371 0.330 0.258
## RankSpatialSimilarity 0.835 0.629 0.654 0.836
## Smoothness 1.595 1.077 0.620 5.142
## Coil -0.314 0.352 -0.470 2.908
## Contact 0.729 0.060 0.174 0.855
## EndpointDisplacement 0.017 1.157 0.318 3.250
This computation was performed using the following software tools.
## R version 4.6.0 Patched (2026-05-01 r89993)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 13 (trixie)
##
## Matrix products: default
## BLAS: /srv/R/R-patched/build.26-05-02/lib/libRblas.so
## LAPACK: /srv/R/R-patched/build.26-05-02/lib/libRlapack.so; LAPACK version 3.12.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Vienna
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Preservation_0.1.0
##
## loaded via a namespace (and not attached):
## [1] magic_1.6-1 tidyr_1.3.2
## [3] plotly_4.12.0 sass_0.4.10
## [5] generics_0.1.4 digest_0.6.39
## [7] magrittr_2.0.5 evaluate_1.0.5
## [9] grid_4.6.0 RColorBrewer_1.1-3
## [11] fastmap_1.2.0 jsonlite_2.0.0
## [13] DRquality_0.2.1 DatabionicSwarm_2.0.0
## [15] promises_1.5.0 httr_1.4.8
## [17] purrr_1.2.2 viridisLite_0.4.3
## [19] scales_1.4.0 GeneralizedUmatrix_1.3.1
## [21] lazyeval_0.2.3 jquerylib_0.1.4
## [23] abind_1.4-8 cli_3.6.6
## [25] shiny_1.13.0 rlang_1.2.0
## [27] emdist_0.3-3 shinythemes_1.2.0
## [29] cachem_1.1.0 yaml_2.3.12
## [31] otel_0.2.0 FNN_1.1.4.1
## [33] coRanking_0.2.5 geometry_0.5.2
## [35] tools_4.6.0 deldir_2.0-4
## [37] dplyr_1.2.1 ggplot2_4.0.3
## [39] ProjectionBasedClustering_1.2.2 httpuv_1.6.17
## [41] vctrs_0.7.3 R6_2.6.1
## [43] mime_0.13 lifecycle_1.0.5
## [45] htmlwidgets_1.6.4 shinyjs_2.1.1
## [47] pkgconfig_2.0.3 RcppParallel_5.1.11-2
## [49] pillar_1.11.1 bslib_0.10.0
## [51] later_1.4.8 gtable_0.3.6
## [53] glue_1.8.1 data.table_1.18.2.1
## [55] Rcpp_1.1.1-1.1 xfun_0.57
## [57] tibble_3.3.1 tidyselect_1.2.1
## [59] knitr_1.51 farver_2.1.2
## [61] xtable_1.8-8 htmltools_0.5.9
## [63] rmarkdown_2.31 compiler_4.6.0
## [65] S7_0.2.2