Some cancers are associated with the presence of viruses that disrupt chromosomal DNA. Among the best known examples are cervical cancers and head-and-neck cancers that are associated with human papilloma virus (HPV). Using whole genome short-read sequencing, it is possible to identify breakpoints, including those where viral DNA is merged with human DNA. Each breakpoint defines a local instance of “structural variation” (SV) where segments of chromosomes are joined together in non-normal ways. It remains an open question whether those fragments are fully integrated into a human chromosome or just exist as circular extrachromosomal DNA. Using newer long-read sequencing technology, one can obtain information about the order of multiple breakpoints in the original DNA molecules in the tumor sample. In this vignette, we describe SVAlignR, an R package intended to use these long-read sequences to reconstruct the original molecules.
We begin by loading the SVAlignR
package, and a sample
data set.
library(SVAlignR)
data(longreads)
head(longreads[, 1:3])
## qname qlen connection
## LR1 870fe397-c2e5-4fbc-8324-506484f4c89a 64349 50-74-0-0-50-74-0-50-74-0-50
## LR2 cc6314e0-3582-4f15-8d59-44c09bfef1e3 56447 74-0-63-74-0-50-74-0-50-74-0
## LR3 52f855b6-de11-49d8-b298-6460d5274a2f 56306 50-74-0-50-74-0-50-74-0-50-74
## LR4 55becc2c-c24c-49df-b73a-6459aabd2a05 55510 74-50-74-0-50-74-0-50-74-0-50
## LR5 591c04a7-e1f5-40ca-b707-82ce2cfd0ff2 54409 74-50-74-0-50-74-0-50-74-0-50
## LR6 b89fe96f-d773-494a-a702-7d7879219581 51084 0-50-74-0-50-74-0-50-74-0
Each of the 197 rows in this spreadsheet represents a different
Oxford Nanopore long read. The qname
column is a unique
identifier; the qlen
column contains the length of the read
in bytes; and the connections
column lists the breakpoints
that the read contains, each of which has been assigned a numeric
identifier. We are interested in this column. We have two main
goals:
As with many sequence alignment problems, we start by removing any duplicate “reads”; here that means duplication of the break-point sequence summaries of the actual reads.
<- longreads$connection
LR0 names(LR0) <- rownames(longreads)
<- LR0[!duplicated(LR0)]
LR <- sapply(LR, function(lr) sum(LR0 == lr))
weights length(LR) # expect 158
## [1] 158
We want to cluster the break-point sequences.
In order to make use of existing algorithms, we need to convert the
numeric identifiers into single characters as part of an “alphabet”. The
Cipher
class takes in a character vector containing the
“words” (that is, our break-point sequences) that need to be rewritten.
By default, it assumes (as in our example) that “letters” are separated
by hyphens, identifies the unique symbols, and creates a tool to convert
back-and-forth from one “alphabet” to another.
<- Cipher(LR)
alphabet alphabet
## An object of class "Cipher"
## Slot "forward":
## 0 12 13 14 15 17 19 23 24 25 26 27 28 31 32 33 34 35 36 37 38 40
## "A" "B" "C" "D" "E" "F" "G" "H" "I" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W"
## 45 46 48 50 51 52 54 56 57 58 6 60 61 63 65 66 70 71 74 75 x
## "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
##
## Slot "reverse":
## A B C D E F G H I K L M N O P Q R
## "0" "12" "13" "14" "15" "17" "19" "23" "24" "25" "26" "27" "28" "31" "32" "33" "34"
## S T U V W Y Z a b c d e f g h i j
## "35" "36" "37" "38" "40" "45" "46" "48" "50" "51" "52" "54" "56" "57" "58" "6" "60"
## k l m n o p q r s - ?
## "61" "63" "65" "66" "70" "71" "74" "75" "x" ":" "?"
Now we are able to convert the original sequences of numbers into words in our specialized alphabet.
head(LR)
## LR1 LR2
## "50-74-0-0-50-74-0-50-74-0-50" "74-0-63-74-0-50-74-0-50-74-0"
## LR3 LR4
## "50-74-0-50-74-0-50-74-0-50-74" "74-50-74-0-50-74-0-50-74-0-50"
## LR6 LR7
## "0-50-74-0-50-74-0-50-74-0" "74-0-0-0-50-74-0"
<- encode(alphabet, LR)
pseudo head(pseudo)
## LR1 LR2 LR3 LR4 LR6 LR7
## "bqAAbqAbqAb" "qAlqAbqAbqA" "bqAbqAbqAbq" "qbqAbqAbqAb" "AbqAbqAbqA" "qAAAbqA"
The recoding step is already included in the
SequenceCluster
constructor and will happen automatically.
So, to cluster the long-read break point sequences, we just need to call
the constructor on the (named and de-duplicated) sequences. The core
algorithm is the classic Needelman-Wunsch global alignment algorithm, as
implemented in the NameNeedle
R package. As used in the
SVAlignR
package, it computes an alignment score between
any two sequences based on a match score of 2, a mismatch penalty of -6,
and a gap penalty of -2. We fill a square matrix with scores for every
pair (I, J) of sequences. The resulting scores can be either positive or
negative. For each row X, we rescale into the interval [0,1] by
computing \[Y_{IJ} = 1 - \frac{X_{IJ} -
min(X_{I.})}{max(X_{I.})-min(X_{I.})}.\] Since the scores need
not be symmetric, we force symmetry by averaging the scaled score matrix
and its transpose.
We then perform hierarchical clustering using the symmetrized, rescaled score matrix to define distances. We cut the dendrogam to cluster the sequences into 20 groups. The number 20 is arbitrary. In our data set, it should yield about 8 long-read break-point sequences per cluster. For later steps, we also want to make sure that each cluster contains at leeast 2 sequences.
<- SequenceCluster(LR0, NC = 15) seqclust
## Warning in SequenceCluster(LR0, NC = 15): Removing 39 duplicated sequences.
table(seqclust@clusters)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 9 24 14 6 7 10 7 13 14 5 9 7 18 7 8
Now we see why 20 clusters is appropriate. Each cluster is a relatively small size, but they all contain more than one sequence so that aligning those sequences will be meaningful. Figure 1 displays the resulting dendrogram, with each of the 20 clusters shown in a different color.
plot(seqclust)
Figure 1: Hierarchical clustering of break point sequences.
Figure 2 contains a simplified view of the same dendrogram, cutting off the clutter below the major splits into clusters.
plot(seqclust, type = "clipped")
Figure 2: Uncluttered clusters.
Figure 3 displays an alternate view of the dendrogram, emphasizing the relationships between clusters.
plot(seqclust, type = "unrooted")
Figure 3: Unrooted phylogeny tree.
Both Figure 2 and, more strongly, Figure 3 indicate that there are approximately four major clusters.
We can also show the dendrogram attached to a heatmap containing the distance matrix.
heat(seqclust)
Figure 4: Heatmap of the distance matrix.
Each of the clusters uses a small enough set of symbols that we can
safely rewrite them as though they are sequences of amino acids. We then
use the align
function from the SVAlignR
package to perform multiple sequence alignments, using the “ClustalW”
algorithm on each cluster. Because the components are not amino acids,
however, we do not need to use one of the PAM or BLOSUM matrices.
Instead, we define a substitution matrix that gives the same match and
mismatch scores to each pair of symbols.
<- makeSubsMatrix(match = 2, mismatch = -6)
MYSUB <- alignAllClusters(seqclust, mysub = MYSUB, gapO = 2, gapE = 0.1) consensus
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
## use user defined matrix
class(consensus)
## [1] "list"
length(consensus)
## [1] 15
Figure 5 contains an example of a multiple sequence alignment.
<- par(mai = c(1.02, 0.82, 0.82, 0.82))
opar image(consensus[[15]])
Figure 5: Example alignment of cluster 15.
par(opar)
rm(opar)
Figure 6 shows all 20 alignments.
<- par(mfrow = c(4,3))
opar for (K in 1:length(consensus)) {
image(consensus[[K]], col = SVAlignR:::myColorSet[K])
}
Figure 6: Multiple sequence alignments of clusters.
par(opar)
Figure 6: Multiple sequence alignments of clusters.
library(dendextend)
##
## ---------------------
## Welcome to dendextend version 1.17.1
## Type citation('dendextend') for how to cite the package.
##
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
##
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## You may ask questions at stackoverflow, use the r and dendextend tags:
## https://stackoverflow.com/questions/tagged/dendextend
##
## To suppress this message use: suppressPackageStartupMessages(library(dendextend))
## ---------------------
##
## Attaching package: 'dendextend'
## The following object is masked from 'package:stats':
##
## cutree
<- SVAlignR:::myColorSet
myColorSet <- seqclust@NC
NC <- mean(rev(seqclust@hc$height)[NC + -1:0])
mycut <- as.dendrogram(seqclust@hc)
hcd <- cut(hcd, h = mycut)
dend2 <- dend2$upper
DU <- sapply(consensus, function(X) {
blocks @consensus
X
})labels(DU) <- paste(" [", 1:NC, "] ", blocks, " ", sep = "")
<- par(bg = "gray90")
opar plot(ape::as.phylo(DU), type = "unrooted", rotate.tree = -20,
edge.color = "black", edge.width = 2,
lab4ut = "axial",
tip.color = myColorSet[1:NC], cex = 1.2)
Figure 7: Unrooted phylogeny tree with consensus sequences.
par(opar)
We also provide the ability to compute de Bruijn graphs from any set
of sequences. These graphs are frequently used for “de novo” sequencing,
and thus can provide an alternative source of information about
“consensus sequence” alignments. In order to work with graphs in
general, you should load the igraph
package.
library("igraph")
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
<- deBruijn(seqclust@rawSequences, 4)
DB <- DB@G
G <- layout_with_fr(G)
loci plot(G, layout = loci)
count_components(G)
## [1] 13
<- components(G)
comp sort(comp$csize)
## [1] 1 1 2 2 2 2 3 3 3 4 4 32 84
<- largest_component(G)
GC <- layout_with_gem(GC)
gloc plot(GC, layout = gloc)
sort(table(E(GC)$weight))
##
## 6 11 17 20 21 28 7 8 14 9 4 3 5 2 1
## 1 1 1 1 1 1 2 2 2 3 6 8 8 21 46
<- delete_edges(GC, E(GC)[E(GC)$weight < 6])
GE <- largest_component(GE)
GE #GE <- set_edge_attr(GE, "width", value = sqrt(edge_attr(G, "weight")))
<- layout_with_fr(GE)
gloc plot(GE, layout = gloc)
<- lapply(1:seqclust@NC, function(J) {
dbset <- seqclust@rawSequences[seqclust@clusters == J]
rs <- deBruijn(rs, 4)
db
}) <- length(dbset)
L for (I in 1:L) {
<- dbset[[I]]@G
G <- layout_with_fr(G)
lyt plot(G, layout =lyt, main = paste("Cluster", I))
}
sessionInfo()
## R version 4.3.2 Patched (2023-11-21 r85584)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 12 (bookworm)
##
## Matrix products: default
## BLAS: /srv/R/R-patched/build.23-11-23/lib/libRblas.so
## LAPACK: /srv/R/R-patched/build.23-11-23/lib/libRlapack.so; LAPACK version 3.11.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Vienna
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] igraph_1.5.1 dendextend_1.17.1 SVAlignR_0.4.5
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.4 sass_0.4.7 utf8_1.2.4
## [4] generics_0.1.3 bitops_1.0-7 stringi_1.8.2
## [7] lattice_0.22-5 digest_0.6.33 magrittr_2.0.3
## [10] evaluate_0.23 grid_4.3.2 fastmap_1.1.1
## [13] jsonlite_1.8.7 ape_5.7-1 GenomeInfoDb_1.38.1
## [16] gridExtra_2.3 fansi_1.0.5 viridisLite_0.4.2
## [19] scales_1.2.1 Biostrings_2.70.1 jquerylib_0.1.4
## [22] cli_3.6.1 rlang_1.1.2 crayon_1.5.2
## [25] scatterplot3d_0.3-44 XVector_0.42.0 munsell_0.5.0
## [28] cachem_1.0.8 yaml_2.3.7 parallel_4.3.2
## [31] tools_4.3.2 dplyr_1.1.4 colorspace_2.1-1
## [34] ggplot2_3.4.4 GenomeInfoDbData_1.2.11 BiocGenerics_0.48.1
## [37] msa_1.34.0 vctrs_0.6.4 R6_2.5.1
## [40] stats4_4.3.2 lifecycle_1.0.4 stringr_1.5.1
## [43] zlibbioc_1.48.0 oompaBase_3.2.9 S4Vectors_0.40.1
## [46] NameNeedle_1.2.7 IRanges_2.36.0 cluster_2.1.4
## [49] pkgconfig_2.0.3 bslib_0.6.0 pillar_1.9.0
## [52] gtable_0.3.4 glue_1.6.2 Rcpp_1.0.11
## [55] highr_0.10 xfun_0.41 tibble_3.2.1
## [58] tidyselect_1.2.0 knitr_1.45 nlme_3.1-163
## [61] htmltools_0.5.7 rmarkdown_2.25 Polychrome_1.5.1
## [64] compiler_4.3.2 RCurl_1.98-1.13