We want to illustrate the Mender
package (Version
0.3.12) with a clinical data set. Not surprisingly, we start by loading
the package.
library(Mender)
We also load several other useful packages (some of which may eventually get incorporated into the package requirements).
suppressMessages( library(Mercator) )
library(ClassDiscovery)
library(Polychrome)
data(Dark24)
data(Light24)
suppressMessages( library(igraph) )
suppressMessages( library("ape") )
suppressPackageStartupMessages( library(circlize) )
Now we fetch the sample clinical data set that is included with the package.
data(CLL)
ls()
## [1] "Dark24" "Light24" "clinical" "daisydist" "ripdiag"
dim(clinical)
## [1] 266 29
colnames(clinical)
## [1] "AgeAtDx" "Sex" "Race"
## [4] "mutation.status" "Light.chain.subtype" "Zap70Protein"
## [7] "Rai.Stage" "CatRAI" "Serum.beta.2.microglobulin"
## [10] "LogB2M" "CatB2M" "Serum.LDH"
## [13] "LogLDH" "White.blood.count" "LogWBC"
## [16] "CatWBC" "CatCD38" "Hypogammaglobulinemia"
## [19] "Massive.Splenomegaly" "Matutes" "Hemoglobin"
## [22] "Platelets" "Prolymphocytes" "stat13"
## [25] "stat11" "stat12" "stat17"
## [28] "Dohner" "Purity"
The clinical
object is a numeric matrix containing the
clinical data (binary values have been converted to 0-1; categorical
values to integers). The daisydist
is a distance matrix
(stored as a dist
object). Coombes and colleagues [J Biomed
Inform. 2021 Jun;118:103788] showed that this is the best way to measure
distances between mixed-type clinical data. The ripdiag
object is a “Rips diagram” produced by applying the TDA
algorithm to the daisy distances.
Here are some plots of the TDA
results using tools from
the original package. (I am not sure what any of these are really good
for.)
<- ripdiag[["diagram"]]
diag <- par(mfrow = c(1,2))
opar plot(diag, barcode = TRUE, main = "Barcode")
plot(diag, main = "Rips Diagram")
Figure 1 : The Rips barcode diagram from TDA.
par(opar)
rm(opar)
Now we use our Mercator
package to view the underlying
data.
<- Mercator(daisydist, metric = "daisy", method = "hclust", K = 8)
mercury <- addVisualization(mercury, "mds")
mercury <- addVisualization(mercury, "tsne")
mercury <- addVisualization(mercury, "umap")
mercury <- addVisualization(mercury, "som") mercury
<- par(mfrow = c(3,2), cex = 1.1)
opar plot(mercury, view = "hclust")
plot(mercury, view = "mds", main = "Mult-Dimensional Scaling")
plot(mercury, view = "tsne", main = "t-SNE")
plot(mercury, view = "umap", main = "UMAP")
barplot(mercury, main = "Silhouette Width")
plot(mercury, view = "som", main = "Self-Organizing Maps")
Figure 2 : Mercator Visualizations of the distance matrix.
par(opar)
rm(opar)
Here is a picture of the “zero-cycle” data, which can also be used ultimately to cluster the points (where each point is a patient). The connected lines are similar to a single-linkage clustering structure, showing when individual points are merged together as the TDA parameter increases.
<- sum(diag[, "dimension"] == 0)
nzero <- ripdiag[["cycleLocation"]][2:nzero]
cycles <- mercury@view[["umap"]]$layout
W plot(W, main = "Connected Zero Cycles")
for (cyc in cycles) {
points(W[cyc[1], , drop = FALSE], pch = 16,col = "red")
<- c(W[cyc[1], 1], W[cyc[2],1])
X <- c(W[cyc[1], 2], W[cyc[2],2])
Y lines(X, Y)
}
Figure 3 : Hierarchical connections between zero cycles.
We can convert the 0-dimensional cycle structure into a dendrogram,
by first passing them through the igraph
package. We start
by putting all the zero-cycle data together, which can be viewed as an
“edge-list” from the igraph
perspective.
<- t(do.call(cbind, cycles)) # this creates an "edgelist"
edges <- graph_from_edgelist(edges)
G <- set_vertex_attr(G, "label", value = attr(daisydist, "Labels")) G
Note that we attached the sample names to the graph, obtaining them from the daisy distance matrix. Now we use two different algorithms to decide how to layout the graph.
set.seed(2734)
<- layout_as_tree(G)
Lt <- layout_with_fr(G) L
<- par(mfrow = c(1,2), mai = c(0.01, 0.01, 1.02, 0.01))
opar plot(G, layout = Lt, main = "As Tree")
plot(G, layout = L, main = "Fruchterman-Reingold")
Figure 4 : Two igraph depictions of the zero cycle structure.
par(opar)
Note that the Fruchterman-Reingold layout gives the most informative structure.
There are a variety of community-finding algorithms that we can apply. (Communities in grpah theory are similar to clusters in other machine learning areas of study.) “Edge-betweenness” seems to work best.
<- cluster_edge_betweenness(G) # 20
keg table(membership(keg))
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 10 21 22 11 21 23 7 8 9 9 20 8 11 26 14 11 8 15 7 5
<- Dark24[membership(keg)] pal
The first line in the next code chunk shows that we did actually produce a tree. We explore three different ways ro visualize it
is.hierarchical(keg)
## [1] TRUE
<- as.hclust(keg)
H $labels <- attr(daisydist, "Labels")
H<- 7
K <- Light24[cutree(H, k=K)]
colset <- set_vertex_attr(G, "color", value = colset)
G2 <- 0.01
e <- par(mai = c(e, e, e, e))
opar plot(G2, layout = L)
Figure 5 : Community structure, simplified.
par(opar)
<- as.phylo(H)
P <- par(mai = c(0.01, 0.01, 1.0, 0.01))
opar plot(P, type = "u", tip.color = colset, cex = 0.8, main = "Ape/Cladogram")
Figure 6 : Cladogram realization, from the ape package.
par(opar)
rm(opar)
In any of the “scatter plot views” (e.g., MDS, UMAP, t-SNE) from Mercator, we may want to overlay different colors to represent different clinical features.
<- mercury@view[["mds"]]
U <- mercury@view[["tsne"]]$Y
V <- mercury@view[["umap"]]$layout W
<- Feature(clinical[,"mutation.status"], "Mutation Status", c("cyan", "red"),
featMU c("Mutated", "Unmutated"))
<- Feature(clinical[,"CatRAI"], "Rai Stage", c("green", "magenta"), c("High", "Low"))
featRai <- par(mfrow = c(1,2))
opar plot(W, main = "UMAP; Mutation Status", xlab = "U1", ylab = "U2")
points(featMU, W, pch = 16, cex = 1.4)
plot(W, main = "UMAP; Rai Stage", xlab = "U1", ylab = "U2")
points(featRai, W, pch = 16, cex = 1.4)
Figure 7 : UMAP visualizations with clinical features.
par(opar)
rm(opar)
We have a statistical approach to deciding which of the detected cycles are statistically significant. Empirically, the persistence of 0-dimensional cycles looks like a gamma distribution, while the persistence of higher dimensional cycles looks like an exponential distribution. In both cases, we use an empirical Bayes approach, treating the observed distribution as a mixture of either gamma or exponential (as appropriate) with an unknown distribution contributing to heavier tails.
<- diag[, "Death"] - diag[, "Birth"] persistence
<- persistence[diag[, "dimension"] == 0]
d0 <- d0[d0 < 1]
d0 summary(d0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.003652 0.054455 0.076682 0.084373 0.108333 0.227561
<- mean(d0)
mu <- median(d0)
nu <- sd(d0)
sigma <- mu^2/sigma^2
shape <- mu/sigma^2
rate <- seq(0, 0.23, length = 100)
xx <- dgamma(xx, shape = shape, rate = rate) yy
hist(d0, breaks = 123, freq = FALSE)
lines(xx, yy, col = "purple", lwd = 2)
Figure 8 : Histogram of the duration of zero cycles, with an overlaid gammm distibution.
Now we want to determine if there are significant “loops” in the data, and, if so, how many?
<- persistence[diag[, "dimension"] == 1]
d1 <- ExpoFit(d1) # should be close to log(2)/median?
ef <- EBexpo(d1, 200)
eb <- par(mfrow = c(1,3))
opar plot(ef)
hist(eb)
plot(eb, prior = 0.56)
Figure 9 : Empirical Bayes detection of significant one-cycles.
par(opar)
rm(opar)
sum(d1 > cutoff(0.8, 0.56, eb)) # posterior 80%, prior 0.56
## [1] 7
sum(d1 > 0.065) # post 90%
## [1] 1
which(d1 > 0.047)
## [1] 50 91 93 123 214 236 249
which(d1 > 0.065)
## [1] 236
Let’s pick out the most persistent 1-cycle.
<- Cycle(ripdiag, 1, 236, "forestgreen")
cyc1 @index cyc1
## [,1] [,2]
## [1,] 99 212
## [2,] 4 7
## [3,] 2 99
## [4,] 110 212
## [5,] 44 59
## [6,] 4 75
## [7,] 7 59
## [8,] 75 110
## [9,] 2 44
Each row represents an edge, by listing the IDs of the points at either end of the line segment. In this case, there are nine edges that link together to form a connected loop (or topological circle).
<- Cycle(ripdiag, 1, 123, "red")
cyc2 <- Cycle(ripdiag, 1, 214, "purple")
cyc3
<- par(mfrow = c(1, 3))
opar plot(cyc1, W, lwd = 2, pch = 16, col = "gray", xlab = "U1", ylab = "U2", main = "UMAP")
lines(cyc2, W, lwd=2)
lines(cyc3, W, lwd=2)
plot(U, pch = 16, col = "gray", main = "MDS")
lines(cyc1, U, lwd = 2)
lines(cyc2, U, lwd = 2)
lines(cyc3, U, lwd = 2)
plot(V, pch = 16, col = "gray", main = "t-SNE")
lines(cyc1, V, lwd = 2)
lines(cyc2, V, lwd = 2)
lines(cyc3, V, lwd = 2)
Figure 10 : Three views of three one-cycles.
par(opar)
rm(opar)
<- angleMeans(W, ripdiag, cyc1, clinical)
poof is.na(poof)] <- 0
poof[<- poof[, c("mutation.status", "CatB2M", "CatRAI", "CatCD38",
angle.df "Massive.Splenomegaly", "Hypogammaglobulinemia")]
<- list(c(M = "green", U = "magenta"),
colorScheme c(Hi = "cyan", Lo ="red"),
c(Hi = "blue", Lo = "yellow"),
c(Hi = "#dddddd", Lo = "#111111"),
c(No = "#dddddd", Yes = "brown"),
c(No = "#dddddd", Yes = "purple"))
<- LoopCircos(cyc1, angle.df, colorScheme)
annote image(annote)
Figure 11 : Circos plot of features varying around the most persistent cycle.
Now we want to determine if there are significant “voids” (empty interiors of spheres) in the data, and, if so, how many?
<- persistence[diag[, "dimension"] == 2]
d2 <- ExpoFit(d2) # should be close to log(2)/median?
ef <- EBexpo(d2, 200)
eb <- par(mfrow = c(1, 3))
opar plot(ef)
hist(eb)
plot(eb, prior = 0.75)
Figure 12 : Empirical Bayes detection of significant two-cycles.
par(opar)
rm(opar)
sum(d2 > cutoff(0.8, 0.75, eb)) # posterior 80%, prior 0.56
## [1] 15
sum(d2 > cutoff(0.95, 0.75, eb)) # posterior 90%, prior 0.56
## [1] 2
cutoff(0.95, 0.75, eb)
## [1] 0.03281219
sum(d2 > 0.032) # post 90%
## [1] 2
which(d2 > 0.034)
## [1] 95
<- Cycle(ripdiag, 2, 95, "purple")
vd <- cmdscale(daisydist, k = 3)
mds voidPlot(vd, mds)
voidFeature(featMU, mds, radius = 0.011) # need to increase radius when you overlay one sphere on another
::rglwidget() rgl
#htmlwidgets::saveWidget(rglwidget(), "mywidget.html")
<- Projection(vd, mds, featMU, span = 0.2)
ob <- par(mfrow = c(1,2))
opar plot(ob)
image(ob, col = colorRampPalette(c("cyan", "gray", "red"))(64))
Figure 13 : Planar projection of mutation status around void.
par(opar)
rm(opar)
rm(list = ls())