Metrics to Assess How Well Dimension Reduction Methods Preserve Data Set Properties

Kevin R. Coombes and Polina Bombina

Introduction

The introduction of high dimensional technologies in molecular biology, which perform genome-scale measurements on DNA, RNA, or protein, has been accompanied by more extensive use of “dimension reduction” (DR) algorithms that promote visualization of important aspects of the data in just two dimensions. However, it is critical to understand the extent to which different DR algorithms distort the properties of the data set that are present in the original high dimensional space. The Preservation R package collects together some metrics to aid our understanding.

Preparation

Packages

As a first step, we must load the Preservation library package.

library(Preservation)

Sample Data

We need some data in order to illustrate the preservation metrics. Here we use a subset of the CAVA CD4 T-cell single-cell CITE-Seq data set [1]. This data set was previously identified as a single coherent cluster in the much larger complete data set. It contains measurements of the expression of 51 proteins in 3096 individual cells. We also sub-clustered this subset into four parts to help with visualization.

data("prot5")
dim(prot5)
## [1] 3096   51
table(colvec)
## colvec
## #69DE69 #9FC7FF #C882D9 #FFA4A4 
##     812     596     641    1047

Next, we applied four popular dimension reduction methods:

  1. Multi-Dimensional Scaling (MDS), which is equivalent to principal components analysis (OCA) when using Euclidean distance.
  2. t-Stochastic Neighbor Embedding (t-SNE or TSNE).
  3. Uniform Manifold Approximation and Projection (UMAP).
  4. Diffusion Maps (DM).

The first two DR algorithms (MDS and TSNE) were computed using the Rdimtools package; UMAP, using the umap package; and DM, using the diffusionMap package.

data("drs")
sapply(drs, dim)
##       MDS TSNE UMAP   DM
## [1,] 3096 3096 3096 3096
## [2,]    2    2    2    2

Here is an initial plot showing the dimension reduction results.

opar <- par(mfrow = c(2,2))
for (tag in names(drs)) {
  plot(drs[[tag]], col = colvec, pch = 16, main = tag, xlab = "X1", ylab = "X2")
}
Four views of the CAVA subset after dimension reduction.
Four views of the CAVA subset after dimension reduction.
par(opar)

We first note that MDS and DM produce essentially the same plot, after recognizing that DM has switched the (arbitrary) signs of both components. By contrast, both UMAP and TSNE have tried to place the purple group and half of the blue group in a separate cluster. The TSNE algorithm is slightly more extreme, since it also seems to have “flipped” that new cluster so that the two halves of the blue group are further apart.

Preservation of Pairwise Distances

Some DR algorithms are designed to try to preserve the pairwise distances between points in the data set. MDS is the most explicit example of such an algorithm, since its defined goal is to preserve the average of these distances as well as possible.

The Preservation package includes six metrics to assess how well pairwise distances are preserved. Here we apply five of these metrics to the four different DR views of the CAVA dataset. (We omit Spearman’s Rho, since it is somewhat slower than the others.)

hiD <- dist(prot5)
loD <- lapply(drs, dist)
suppressWarnings(
  scores <- sapply(c("MilnorDistortion", "SigmaDistortion", "Stress", 
                     "EarthMover"), function(metric) {
    cat(metric, "\n", file = stderr())
    FUN <- get(metric)
    sapply(names(drs), function(tag) {
      FUN(hiD, loD[[tag]])
    })
  })
)
m1d <- sapply(names(drs), function(tag) {
  M1Distortion(prot5, drs[[tag]])
})
scores <- cbind(scores, M1Distortion = m1d)
round(scores, 3)
##      MilnorDistortion SigmaDistortion Stress EarthMover M1Distortion
## MDS             8.353           0.378  0.569      0.012        0.750
## TSNE            8.609           0.455  0.666      0.025        0.853
## UMAP            9.234           0.696  0.497      0.051        0.530
## DM              8.139           0.963  0.983      1.016        1.000
Preservation:::PWBest(colnames(scores))
## MilnorDistortion  SigmaDistortion           Stress       EarthMover 
##                0                0                0                0 
##     M1Distortion 
##                0

DM is the best DR method based on the Milnor distortion; UMAP is best by Stress or Spearman’s Rho (not shown); MDS is best based on the Sigma distortion, M1 distortion, or Earth Mover’s Distance. These findings probably tells us more about how well the metrics assess their desired goal than about how well the DR algorithms perform.

Neighborhood Preservation

Method like TSNE and UMAP are inherently non-linear, and their heuristic is to try to preserve local neighborhoods of points rather than all pairwise distances. The Preservation package includes several metrics to assess how well neighborhoods are preserved. These metrics are naturally functions of the number, \(k\), of nearest neighbors defining a neighborhood.

ajd <-sapply(names(drs), function(tag) {
  AverageJaccardDistance(prot5, drs[[tag]], k = 50)
})
scores <- cbind(scores, ajd)
# too slow: qnx <- QNX(hiD, loD[["MDS"]])
quad <- c(MDS = "#B00068", TSNE = "#1C8356", UMAP = "#325A9B", DM = "#FEAF16" )
ct <- lapply(names(drs), function(tag) {
  Preservation:::compute_ct_list(prot5, drs[[tag]])
})
avtrust <- sapply(ct, function(X) X[,3])
colnames(avtrust) <- names(drs)

avcont <- sapply(ct, function(X) X[,6])
colnames(avcont) <- names(drs)
opar <- par(mfrow = c(1,2))
plot(avtrust[,1], type = "n", ylim = c(0.1, 0.3), xlab = "Neighbors", ylab = "Trustworthiness")
for (tag in names(drs)) lines(avtrust[, tag], lwd = 2, col = quad[tag])
legend("topright", names(drs), col = quad, lwd = 2)

plot(avcont[,1], type = "n", ylim = c(0, 0.25), xlab = "Neighbors", ylab = "Continuity")
for (tag in names(drs)) lines(avcont[, tag], lwd = 2, col = quad[tag])
legend("topleft", names(drs), col = quad, lwd = 2)

par(opar)

Path Preservation

We now want to see how well DR algorithms preserve paths. We define a path to be an ordered set of points in a data set. As an example, we computed a path in the original high (51-)dimensional space that closely tracked the first principal component. So, in the original space, the path is a close approximation to a line segment that crosses the full data set. The path is represented by 301 individual cells (indicated by their indices as rows in the data set).

data("path-idex")
length(idex)
## [1] 310
highD <- prot5[idex,]
dim(highD)
## [1] 310  51

Here is a plot showing the dimension reductions with the overlaid image of the PC1 path.

opar <- par(mfrow = c(2,2))
for (tag in names(drs)) {
  lowD <- drs[[tag]][idex,]
  plot(drs[[tag]], col = colvec, pch = 16, main = tag, xlab = "X1", ylab = "X2")
  lines(lowD, lwd = 2)
}
Four views of the path of the first principal components of the CAVA subset after dimension reduction.
Four views of the path of the first principal components of the CAVA subset after dimension reduction.
par(opar)

Again, the paths in the MDS and DM reductions jitter about the first component axis. In the UMAP and TSNE reductions, the path seems to do a lot of jumping between the two pseudo-clusters in the plot. In the TSNE plot, there are also numerous jumps between the two halves of the blue group.

scores <- sapply(c("LengthDistortion", "SegmentVariance", "Curvature", "SpatialSimilarity"),
                 function(metric) {
  FUN <- get(metric)
  sapply(names(drs), function(tag) {
    lowD <- drs[[tag]][idex,]
    FUN(highD, lowD)
  })
})
round(scores, 3)
##      LengthDistortion SegmentVariance Curvature SpatialSimilarity
## MDS             1.587          -0.520    -0.004            -0.649
## TSNE            1.184          -1.501     0.001            -0.810
## UMAP            0.671          -2.632     0.010            -0.493
## DM              5.128           6.557    -0.021            -1.619
Preservation:::PathBest(colnames(scores))
##  LengthDistortion   SegmentVariance         Curvature SpatialSimilarity 
##                 0                 0                 0                 0

Each of the four metrics we have looked at compares the measurements on original and reduced data sets by computing their log ratio (which is why the best score is the one nearest zero). Here are variants that use Spearman’s rank correlation instead, where te best score would be the score closest to one. here length and curvature are topped by the UMAP and TSNE methods, while spatial imilarity is better with the DM and MDS algorithms.

rscores <- sapply(c("RankLengthDistortion", "RankCurvature", "RankSpatialSimilarity"),
                 function(metric) {
  FUN <- get(metric)
  sapply(names(drs), function(tag) {
    lowD <- drs[[tag]][idex,]
    FUN(highD, lowD)
  })
})
round(rscores, 3)
##      RankLengthDistortion RankCurvature RankSpatialSimilarity
## MDS                 0.265         0.247                 0.835
## TSNE                0.480         0.371                 0.629
## UMAP                0.477         0.330                 0.654
## DM                  0.316         0.258                 0.836
scores <- cbind(scores, rscores)

We find it surprising that UMAP and TSNE rank so highly under many of these metrics. Perhaps we need to look more broadly to find metrics that give results that match our expectations?

One idea is to look at a smooth version of the path in the reduced spaces. Here is a plot showing the dimension reductions with the overlaid image of smoothed PC1 path.

smoosh <- function( coords) {
  ticks <- 1:nrow(coords) # arbitrary parametrization
  ## fit smooth curve to each coordinate as a function of the parameter
  fitted <- apply(coords, 2, function(X) predict(loess(X ~ ticks, span = 0.1)))
  delta <- fitted - coords # difference between actual path and smooth fitted curve
  val <- list(coords = coords, fitted = fitted, delta = delta)
  class(val) <- "smoosh"
  val
}
fits <- lapply(names(drs), function(tag) smoosh(drs[[tag]][idex,]))
names(fits) <- names(drs)
plot.smoosh <- function(obj, main = "", col = "purple", ...) {
  plot(obj$coords, main = main, type = "b", col = "grey80", ...)
  lines(obj$fitted, col = col, lwd = 2)
  invisible(main)
}
lines.smoosh <- function(obj, col = "purple") {
  lines(obj$fitted, col = col, lwd = 2)
}
opar <- par(mfrow = c(2, 2))
for (tag in names(fits)) {
  plot(fits[[tag]], main=tag, col = "black", xlab = "X1", ylab = "X2")
  points(drs[[tag]], col = colvec, pch = ".", cex=3)
  lines(fits[[tag]], col = "black")
}
Four views of the smoothed PC1 path of the CAVA subset after dimension reduction.
Four views of the smoothed PC1 path of the CAVA subset after dimension reduction.
par(opar)
mscores <- sapply(c("Smoothness", "Coil", "Contact", "EndpointDisplacement"),
                 function(metric) {
  cat(metric, "\n", file = stderr())
  FUN <- get(metric)
  sapply(names(drs), function(tag) {
    lowD <- drs[[tag]][idex,]
    FUN(highD, lowD)
  })
})
round(mscores, 3)
##      Smoothness   Coil Contact EndpointDisplacement
## MDS       1.595 -0.314   0.729                0.017
## TSNE      1.077  0.352   0.060                1.157
## UMAP      0.620 -0.470   0.174                0.318
## DM        5.142  2.908   0.855                3.250
scores <- cbind(scores, mscores)
round(t(scores), 3)
##                          MDS   TSNE   UMAP     DM
## LengthDistortion       1.587  1.184  0.671  5.128
## SegmentVariance       -0.520 -1.501 -2.632  6.557
## Curvature             -0.004  0.001  0.010 -0.021
## SpatialSimilarity     -0.649 -0.810 -0.493 -1.619
## RankLengthDistortion   0.265  0.480  0.477  0.316
## RankCurvature          0.247  0.371  0.330  0.258
## RankSpatialSimilarity  0.835  0.629  0.654  0.836
## Smoothness             1.595  1.077  0.620  5.142
## Coil                  -0.314  0.352 -0.470  2.908
## Contact                0.729  0.060  0.174  0.855
## EndpointDisplacement   0.017  1.157  0.318  3.250

Appendix

This computation was performed using the following software tools.

sessionInfo()
## R version 4.6.0 Patched (2026-05-01 r89993)
## Platform: x86_64-pc-linux-gnu
## Running under: Debian GNU/Linux 13 (trixie)
## 
## Matrix products: default
## BLAS:   /srv/R/R-patched/build.26-05-02/lib/libRblas.so 
## LAPACK: /srv/R/R-patched/build.26-05-02/lib/libRlapack.so;  LAPACK version 3.12.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Vienna
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Preservation_0.1.0
## 
## loaded via a namespace (and not attached):
##  [1] magic_1.6-1                     tidyr_1.3.2                    
##  [3] plotly_4.12.0                   sass_0.4.10                    
##  [5] generics_0.1.4                  digest_0.6.39                  
##  [7] magrittr_2.0.5                  evaluate_1.0.5                 
##  [9] grid_4.6.0                      RColorBrewer_1.1-3             
## [11] fastmap_1.2.0                   jsonlite_2.0.0                 
## [13] DRquality_0.2.1                 DatabionicSwarm_2.0.0          
## [15] promises_1.5.0                  httr_1.4.8                     
## [17] purrr_1.2.2                     viridisLite_0.4.3              
## [19] scales_1.4.0                    GeneralizedUmatrix_1.3.1       
## [21] lazyeval_0.2.3                  jquerylib_0.1.4                
## [23] abind_1.4-8                     cli_3.6.6                      
## [25] shiny_1.13.0                    rlang_1.2.0                    
## [27] emdist_0.3-3                    shinythemes_1.2.0              
## [29] cachem_1.1.0                    yaml_2.3.12                    
## [31] otel_0.2.0                      FNN_1.1.4.1                    
## [33] coRanking_0.2.5                 geometry_0.5.2                 
## [35] tools_4.6.0                     deldir_2.0-4                   
## [37] dplyr_1.2.1                     ggplot2_4.0.3                  
## [39] ProjectionBasedClustering_1.2.2 httpuv_1.6.17                  
## [41] vctrs_0.7.3                     R6_2.6.1                       
## [43] mime_0.13                       lifecycle_1.0.5                
## [45] htmlwidgets_1.6.4               shinyjs_2.1.1                  
## [47] pkgconfig_2.0.3                 RcppParallel_5.1.11-2          
## [49] pillar_1.11.1                   bslib_0.10.0                   
## [51] later_1.4.8                     gtable_0.3.6                   
## [53] glue_1.8.1                      data.table_1.18.2.1            
## [55] Rcpp_1.1.1-1.1                  xfun_0.57                      
## [57] tibble_3.3.1                    tidyselect_1.2.1               
## [59] knitr_1.51                      farver_2.1.2                   
## [61] xtable_1.8-8                    htmltools_0.5.9                
## [63] rmarkdown_2.31                  compiler_4.6.0                 
## [65] S7_0.2.2