Using the Mercator Package

Kevin R. Coombes, Zachary B. Abrams, C.E. Coombes, Suli Li

2019-07-17

Introduction

Modern biological experiments are increasingly producing interesting binary matrices. These may represent the presence or absence of specific gene mutations, copy number variants, microRNAs, or other molecular or clinical phenomena. We have recently developed a tool, CytoGPS, that converts conventional karyotypes from the standard text-based notation (the International Standard for Human Cytogenetic Nomenclature; ISCN) into a binary vector with three bits (loss, gain, or fusion) per cytoband, which we call the “LGF model”.

The Mercator package is intended to facilitate the exploration of binary data sets. It implements a subset of the 76 binary distance metrics described by [Choi and colleagues], ensuring that at least one representative of each of their major clusters is included. Each resulting distance matrix can be combined with multiple visualiization techniques, providing a consistent interface to thoroughly explore the data set.

The BinaryMatrix Object Class

First we load the R library package Mercator.

library(Mercator)
## Loading required package: Thresher
## Loading required package: ClassDiscovery
## Loading required package: cluster
## Loading required package: oompaBase
## Loading required package: PCDimension
#suppressWarnings( suppressMessages( library(Thresher) ))

A Limited, Sample Dataset

We proceed with a model dataset of karyotypes from patients with Chronic Myelogenous Leukemia (CML) with 400 chromosomal features recorded over 740 patients, from the public Mitelman database. Cytogenetic abnormalities, recorded in Mitelman as text strings have been pre-processed with the CytoGPS package into binary vectors. For the sake of clarity and efficiency, we have chosen a subset of patients and features for this example.

## [1] 740 400

Generating the BinaryMatrix

The functions of the Mercator package operate on the BinaryMatrix S4 object, which forms the input of the subsequent functions and visualizations.

The BinaryMatrix object is formed from a matrix of integer or numeric class. Although this package is designed for the processing and visualizations of binary data, the BinaryMatrix object and subsequent functions accept a variety of integer and numeric values.

Row and column names can be assigned as data frames. If no row or column headings are assigned, the BinaryMatrix takes the row and column names of the inital matrix as default. Here, we wish to keep the column names associated with the parent matrix, but must create row names to build the BinaryMatrix.

Notice also that the object always includes a “history” element that tracks how it has been processed.

## An object of the 'BinaryMatrix' class, of size
## [1] 740 400
## History:
## [1] "Newly created."

We wish to cluster the whole karyotypes of each patient to identify the patterns of important chromosomal abnormalities that link them. To proceed, we must transpose the BinaryMatrix. Transposition meaningfully transposes the row and column headings of the BinaryMatrix, as well.

## An object of the 'BinaryMatrix' class, of size
## [1] 400 740
## History:
## [1] "Newly created." "transposed"

Remove Duplicate Features

The binary feature-vectors (viewed across a population of patient samples) are rarely unique. Having identical feature vectors can complicate some of the clustering and visualization routines that we want to use (often by introducing a division by zero). But they can also alter the biological implications, by automatically giving more “weight” to a single genomic event (like a trisomy or monosomy). To deal with this issue, the Mercator package includes a function to remove duplicate or redundant features.

## An object of the 'BinaryMatrix' class, of size
## [1] 400 136
## History:
## [1] "Newly created."              "transposed"                 
## [3] "Duplicate features removed."

In the case of our data, only 136 of the 740 karyotypes are unique. Some of the karyotypes are “not used”, in the sense that they contain none of the abnormalities selected for this limited subset.

## [1] 65

By contrast, many features are used but are simply redundant.

## [1] 603

Data Filtering with the Thresher Method

We create a ThreshedBinaryMatrix to implement the algorithm. In general, a delta cutoff above approximately 0.3 can be chosen as standard to indicate informative feature. Then, we subset our ThreshedBinaryMatrix to only include features above our given cutoff.

set.seed(21348)
my.binmat <- threshLGF(my.binmat, cutoff=0.3)
summary(my.binmat)
## An object of the 'BinaryMatrix' class, of size
## [1] 400 111
## History:
## [1] "Newly created."              "transposed"                 
## [3] "Duplicate features removed." "Threshed."

The red vertical line in the figure indicates the cutoff we have chosen to separate uninformative features (<0.3) from informative ones (>0.3).

Delta <- my.binmat@thresher@delta
hist(Delta, breaks=20, main="", xlab="Weight")
abline(v=0.3, col='red')
Histogram of weight vectors.

Histogram of weight vectors.

We implement the Reaper functionality of Thresher for principal component analysis. Principal components returned numerically…

my.binmat@reaper@pcdim
## [1] 2
my.binmat@reaper@nGroups
## [1] 5

… or they can be visualized with an Auer-Gervini plot

plot(my.binmat@reaper@ag)
abline(h=my.binmat@reaper@pcdim, col="blue")
abline(h=my.binmat@reaper@nGroups, col="green")
abline(h=8, col="orange")
Auer-Gervini plot.

Auer-Gervini plot.

screeplot(my.binmat@reaper)
abline(h=my.binmat@reaper@pcdim, col="blue")
abline(h=my.binmat@reaper@nGroups, col="green")
abline(h=8, col="orange")
Scree plot.

Scree plot.

All three of these methods provide different, but imperfect, means of estimating K. The blue line represents the result of the pcdim calculation of principal components. The green line represents the result of the nGroups calculation. The orange line represents an estimate of principal components from visual inspection.

From inspection of the plots, we determine an appropriate K with which to proceed as 8.

kk <- 8

Selecting a Distance Metric

The Mercator package implements 10 distance metrics: Jaccard, Sokal-Michener, Hamming, Russell-Rao, Pearson, Goodman-Kruskal, Manhattan, Canberra, Binary and Euclidean. Although some of these metrics can be used for continuous or categorical data, all are appropriate for some or all binary matrices. Mercator allows the user to easily select the most appropriate metric to represent similarity and difference in a biologically meaningful way within a given dataset.

Here, we will proceed with the Jaccard distance, because of its ease of interpretability, commoness of use, and its appropriatness of application to asymmetric binary data, such as the binary vector output of CytoGPS in this dataset.

Visualizing with 5 Methods

The Mercator Package allows visualization of data with 5 methods, including both standard techniques (hierarchical clustering, heat maps) and large-scale multi-dimensional visualizations (multidimensional scaling (MDS), T-distributed Stochastic Neighbor Embedding (t-SNE), and iGraph.)

The initial Mercator function can be implemented with any visualization, and visualizations can be added in an arbitrary order.

jacc.Vis <- Mercator(my.binmat, "jaccard", "hclust", K=kk)

We can represent all the distances between features within the dissimilarity matrix we have calculated on our data as a histogram, as a visual representation of relatedness.

hist(jacc.Vis, 
     xlab="Jaccard Distance", main="Histogram of Distances")

Mercator allows us to implement common, standard visualizations such as hierarchical clustering…

plot(jacc.Vis@view[[1]], col="black", pch=jacc.Vis@symv)

… and heat map.

jacc.Vis <- addVisualization(jacc.Vis, "heat")
plot(jacc.Vis@view[[2]], col=jacc.Vis@colv, pch=jacc.Vis@symv)

Mercator can be used to plot t-Stochastic Neighbor Embedding (t-SNE) plots for visualizing large-scale, high-dimensional data in 2-dimensional space.

par(pty="s")
jacc.Vis <- addVisualization(jacc.Vis, "tsne", 
            perplexity=5, 
            xlab="T1", ylab="T2")
plot(jacc.Vis@view[[3]]$Y, col=jacc.Vis@colv, pch=jacc.Vis@symv,
     main="t-SNE; Jaccard Distance; perplexity=5")

t-SNE parameters, such as perplexity, can be used to fine-tune the plot as the visualization is created. Using addVisualization to create a new, tuned plot of an existing type overwrites the existing plot of that type.

jacc.Vis <- addVisualization(jacc.Vis, "tsne", 
            perplexity=10,
            xlab="T1", ylab="T2")
## Warning in addVisualization(jacc.Vis, "tsne", perplexity = 10, xlab =
## "T1", : Overwriting an existing visualization:tsne
plot(jacc.Vis@view[[3]]$Y, col=jacc.Vis@colv, pch=jacc.Vis@symv,
     main="t-SNE; Jaccard Distance; perplexity=10")

par(pty="m")

Mercator allows visualization of multi-dimensional scaling (MDS) plots, as well.

jacc.Vis <- addVisualization(jacc.Vis, "mds")
plot(jacc.Vis@view[[4]], col=jacc.Vis@colv, pch=jacc.Vis@symv,
      main="MDS; Jaccard Distance", xlab="PC1", ylab="PC2")

IGraph

Mercator can be used to visualize complex networks using IGraph. To improve clarity of the visualization and computational time, we implement the downsample function to reduce the number of data points to be linked and visualized. The idea goes back to Peng Qiu’s implementation of the SPADE clustering algorithm for mass cytyometry data. The main point is to under sample the densest regions of the data space to make it more likely that rarer clusters will still be adequately sampled.

Note: The Mercator class includes a “subset” operator that tries to preserve earlier visualizations. This operator is fast for MDS or t-SNE models, but is very slow for large hierarchical clustering. (It uses the implementation in the dendextend package, which works by removing a single leaf at a time from the tree.) In the next code chunk, we first throw away the dendrogram, then subset using the downsampled data, and then compute a new dendrogram.

## Warning in addVisualization(J, "tsne", perplexity = 5): Overwriting an
## existing visualization:tsne

The densest “eyes” are heavily under-sampled.

Now we can look at the resulting graph, using three different “layouts”. BECAUSE OF A BUG IN IGRAPH, I DON’T WORK!

#Cluster Identities and Color Tuning

We can use the getClusters function to characterize each cluster and use these for further manipulation.

We can easily determine cluster size…

## my.clust
##  1  2  3  4  5  6  7  8 
## 53 16  7  5 11  8  6  5

… or the patients that comprise each cluster.

## [1] C55  C246 C312 C545 C717
## 740 Levels: C1 C10 C100 C101 C102 C103 C104 C105 C106 C107 C108 ... C99

recolor remapColors