Page MenuHomec4science

ch_spark.tex
No OneTemporary

File Metadata

Created
Fri, May 3, 14:00

ch_spark.tex

\cleardoublepage
\chapter{SPar-K}
\label{spark}
\markboth{SPar-K}{SPar-K}
\addcontentsline{chapter}{toc}{SPar-K}
This chapter describes SPar-K (Signal Partitioning with K-means), a modification of the K-means algorithm to cluster genomic regions based on their chromatin organization, defined by their sequencing profiles.
I developed, implemented and benchmark this algorithm and produced all the figures that are shown in this chapter. The content of this section is taken an adapted from the original article \citep{groux_spar-k:_2019}.
% Due to the wealth of sequencing data, it is common to analyze positional correlations between chromatin features, e.g. the position of nucleosomes (revealed by MNase-seq) relative to transcription factor binding regions (mapped by ChIP-seq) in order to shed light on their functional relationship. However, as noted in \citep{kundaje_ubiquitous_2012}, chromatin patterns tend to be heterogeneous and often asymmetric. Furthermore, limited mapping precision leading to moderate misalignment between functionally equivalent regions can also obscure a chromatin pattern.
Due to the wealth of sequencing data, it is common to analyze positional correlations between chromatin features, e.g. the position of nucleosomes (revealed by MNase-seq) relative to transcription factor binding regions (mapped by ChIP-seq) in order to shed light on their functional relationship. Several methods and software have been developed for discovering chromatin patterns by clustering and/or realigning read density profiles over genomic regions (see section \ref{intro_pattern_discovery}), including ChromaSig \citep{hon_chromasig:_2008}, ArchAlign \citep{lai_archalign:_2010}, CATCHProfiles \citep{nielsen_catchprofiles:_2012}, CAGT \citep{kundaje_ubiquitous_2012} and ChIPPartitioning\citep{nair_probabilistic_2014}. However, these programs have some limitations. Some do not perform a realignment, others are restricted to count data or lack an runtime efficient implementation, such as ChIPParititioning. To fill this gap, I developed SPar-K (Signal Partitioning with K-means).
\section{Algorithm}
% summary
SPar-K algorithm (see Algorithm \ref{algo_spark}) is a modified version of the regular K-means algorithm during which a set of $N$ regions of size $L$ are partitioned into $K$ clusters, using an iterative optimization procedure. Each cluster is composed of an alignment of regions sub-parts of length $L'a$ assigned to this cluster and the cluster is summarized a vector of length $L \geq L'$ that contains the average signal at each position in the alignment.
% input
The input data are stored as a $N$ rows and $L$ columns matrix $R$. The signal resolution may be at single-base or at a larger bin size. The regions are typically defined by relative positions to an anchor point, e.g. a ChIP-seq peak summit. If the signal is noisy, a data smoothing step can be performed to average out outlier values (see Algorithm \ref{algo_smooth_outliers}) and ease the partitioning procedure.
% objective function
SPar-K optimizes the alignments by minimizing the sum of squares errors. That is, the sum of the squared distances of each point to the cluster aggregation they are assigned to.
% distances
The distance between any two regions is computed using a modified correlation distance. Let us assume two regions $X$ and $Y$ of length $L$ and a shifting freedom $S$. $X$ and $Y$ will be sub-divided in $S$ slices each. Each slice has a length of $L'$=$L-S+1$ and starts at all possible offsets $s=1,2,...,S$. All $S^{2}$ pairwise comparisons between any slices of $X$ and $Y$ are computed using $1-cor(X_{i},Y_{j})$ where $X_{i}$ and $Y_{j}$ are the slices starting at offsets $i,j \in s$. If flipping is allowed, another set of $S^{2}$ comparisons is performed by flipping $Y_{j}$ (that is, the 1st position in $Y_{j}$ becomes the last and vice-versa), resulting in $2 \times S^{2}$ comparisons. Eventually, the distance between $X$ and $Y$ is the minimum of the $S^{2}$ (without flipping) or $2 \times S^{2}$ (with flipping) values. For each distance, the indices $i$ and $j$ and whether $Y_{j}$ was flipped in the best comparison are remembered as they allow to rebuilt the optimal alignment between $X$ and $Y$. The naive algorithm to do this is $\Theta(S^{2} \times L')$ in time however I could design a faster algorithm which is $\Theta(S \times L')$ by using a dynamic programming approach (see algorithm \ref{algo_distance_fast}).
% iteration walk-through
SPar-K is initialized by choosing $K$ regions to become the initial cluster aggregations of length $L$ either i) randomly (Algorithm \ref{algo_seed_random}) or ii) using the K-means++ \citep{arthur_k-means++:_2007} sampling procedure (see Algorithm \ref{algo_seed_kmeans++}). Then, each regions is aligned against each cluster aggregation an assigned to the cluster to which it has the smallest distance with. Once all $N$ regions have been aligned to a cluster, the cluster aggregations are updated by computing the average signal at each position in the alignments.
This procedure and is repeated until i) reaching the maximum number of iterations or ii) achieving convergence, that is when the alignments in each cluster do not change from one iteration to the next.
\section{Implementation}
SPar-K algorithm has been implemented as a stand-alone, fully multithreaded, C++ program. Regarding the parallellization, the computations at each step are independent of each other, leading to an ”embarrassingly parallel” situation. Thus, at each step, the computations are split into equal amounts and distributed over a pool of worker threads. Eventually, the program returns a table listing for each region the cluster assignment, the shift state and the orientation. The software distribution also includes R scripts for visualizing the data as heatmaps as shown in Figure \ref{spark_dnase}. The software source code is available from Github \url{https://github.com/romaingroux/SPar-K} and as Docker container \url{https://hub.docker.com/r/rgroux/spar-k}.
\section{Benchmarking}
\begin{figure}
\centerline{\includegraphics[scale=0.4]{images/ch_spark/supplemental_figure1.png}}
\caption{Synthethic datasets : \textbf{A} The class signal densities. \textbf{B} A synthetic dataset with a mean coverage of a 100 reads per region in average ($c$=100) and 0\% noise ($p_{s}$=1, $p_{b}$=0) and \textbf{C} one of the corresponding SPar-K partition, with shifting and flipping. The color ribbons on the side indicate the cluster assignments. \textbf{D} A synthetic dataset with a mean coverage of a 100 reads per region in average ($c$=100) and 90\% noise ($p_{s}$=0.1, $p_{b}$=0.9) and \textbf{E} one of the corresponding SPar-K partition, with shifting and flipping.
Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}.
\label{spark_simulated_data}
\end{figure}
% supplemental figure 2 from article
\begin{figure}
\centerline{\includegraphics[scale=0.4]{images/ch_spark/supplemental_figure2.png}}
\caption{\textbf{Clustering accuracy using random seeding :} to compare the clustering accuracies of the different methods, several simulated dataset containing 3 classes, different coverages (10, 50 and 100 reads per region indicated as "cov10", "cov50" and "cov100") and noise proportions (no noise, 10\% noise, 50\% noise and 90\% noise indicated as "0.0", "0.1", "0.5" and "0.9") were generated. Each dataset was clustered 50 times with each method. The Adjusted Rand Index (ARI) was computed for each partition. The ARI values are displayed as boxplots. SPar-K and ChIPPartitioning were run allowing flipping and shifting. The ARI was measured on each of the resulting data partitions. For SPar-K, "smooth" indicates outlier smoothing. For the regular K-means, "eucl." and "corr." refer to the euclidean and correlation distances. "R" stands for "random" and indicates the ARI values obtained when comparing the true cluster labels with a randomly shuffled version of it, 100 times.
Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
\label{spark_ari}
\end{figure}
% supplemental figure 4 from article
\begin{figure}
\centerline{\includegraphics[scale=0.4]{images/ch_spark/supplemental_figure4.png}}
\caption{\textbf{Median SSE :} for the simulated ChIP-seq dataset containing 3 classes, with coverage 100 and no noise, partitioned into 2 to 5 clusters. To judge whether the elbow method could be used to estimate the optimal number of clusters, this dataset was partitioned with SPar-K, allowing flip and shifting, into 2 to 5 clusters, 50 times for each set of parameters. For each number of clusters, the median SSE is shown, +/- 1 standard deviation (bars). \textbf{A} Seeding done at random, \textbf{B} seeding done at random and outlier smoothing \textbf{C} seeding done with the K-means++ method \textbf{D} seeding done with the K-means++ method and outlier smoothing. In all cases, the optimal number of clusters seemed to be 3 (which was the expected value).
Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
\label{spark_sse}
\end{figure}
% supplemental figure 5 from article
\begin{figure}
\centerline{\includegraphics[scale=0.4]{images/ch_spark/supplemental_figure5.png}}
\caption{\textbf{Running times :} to compare the run times of each program, the synthetic dataset with coverage 100 and no noise was partitioned 20 times with each program. The run times (wall clock) in second were measured. For all SPar-K and the regular K-means, the partitions were initialized using a random and K-means++ (indicated as "k++"). For ChIPPartitioning, only a random seeding was used. The partitions were then optimized for 30 iterations at most. For SPar-K and ChIPPartitioning, a shifting of 71 bins and flipping were allowed. For SPar-K, only one thread was used and "smooth" indicates outlier smoothing. For the regular K-means, "eucl." and "corr." refer to the euclidean and correlation distances.
Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
\label{spark_time}
\end{figure}
First I compared SPar-K, regular K-means and ChIPPartitioning on synthetic datasets exhibiting properties that are plausible for ChIP-seq profiles for genomic regions.
\subsection{K-means}
For the regular K-means, the "kccaFamily" function from the "flexclust" R package \citep{leisch_toolbox_2006} was used. Calls to kccaFamily(dist=distEuclidean, cent=centMean) or kccaFamily(dist=distCor, cent=centMean) were employed to partition the data using the euclidean distance and the correlation distance respectively. "distEuclidean" is a package defined function and "distCor", a custom function computing $1 - cor(x,y)$ for any two $x$ and $y$ vectors. If the correlation between $x$ and $y$ could not be computed (for instance the standard deviation of $x$ or $y$ is equal to 0), the correlation was assumed to be 0 (and the distance 1). The initial centers were chosen using one of the two following seeding strategies : i) a random sampling of $K$ points or ii) K-means++, a strategy aiming at sampling $K$ initial points as far as possible from each other.
\subsection{ChIPPartitioning}
The implementation was done in R programming language. The "em\_shape", "em\_shape\_shift" and "em\_shape\_shift\_flip" functions present in the supplemental material of \citep{nair_probabilistic_2014} were taken as such and incorporated in a R wrapper (as in Chapter \ref{encode_peaks}). For this method, the partitioning could only be initialized using a random procedure, as described in \citep{nair_probabilistic_2014}.
\subsection{Data}
I generated several synthetic datasets. Each dataset contained 1000 regions of 2001bp (+/- 1kb around a central position), equally distributed over 3 classes. The signal over a region was modeled as a mixture of class specific signal and of background signal. The class specific signal was modeled by a 1902 element density vector. The background signal was modeled using a second 1902 element density vector containing a uniform density. The first class density vector contained a Gaussian density with mean 951 and standard deviation 40 (Figure \ref{spark_simulated_data}A upper panel). The second class density was a Gaussian density of mean 950 and standard deviation 40. To create an asymmetric signal class, the values at positions 950 to 1902 (comprised) were set to the minimal value found in the original density (Figure \ref{spark_simulated_data}A middle panel). The last class contained a rectangular function with a step corresponding to the elements 830 to 1070 (Figure \ref{spark_simulated_data}A lower panel). Finally, all the densities were normalized such that the sum of each vector was 1. From these densities, the $\lambda$ values for a class $k$ were computed using the following formula :
\begin{equation}
lambdas_{k} = signal_{k} * c * p_{s} + background * c * p_{b}
\end{equation}
where $signal_{k}$ is the class characteristic signal density, $background$ a uniform density, $c$ the coverage factor, $p_{s}$ the overall signal proportion and $p_{b}$ the overall background proportion, with the constraint $p_{s} + p_{b} = 1$.
% generations
For each region, a read signal of 1902bp long was randomly sampled from Poisson distributions with the $lambdas$ values as function parameters. Then, the signal vector was introduced, in a 2001 element long vector filled of 0's, at a given offset, in a given orientation. The offset was randomly sampled from 1 to 100 . The orientation was randomly sampled with a probability of 0.3 to be in the reversed orientation. Finally, the resulting 2001bp vectors were binned using a 10bp window, that is, the signal was summed up every 10 columns leading to the creation of 201 bin long vectors. At the end of the process, a dataset was stored as a matrix of 1000 rows and 201 columns. Two examples of synthetic datasets are shown in Figure \ref{spark_simulated_data}B and D.\\
\subsection{Performances}
Performances were assessed using the Adjusted Rand index (see Figure \ref{spark_ari} and Supplemental Figure 3 in \citep{groux_spar-k:_2019}) and the optimal number of classes was estimated using the elbow method (Figure \ref{spark_sse}). As expected, regular K-means performed poorly. On the contrary, SPar-K was equally accurate as ChIPPartitioning except for the lowest coverage class. Considering speed, Spar-K outperformed ChIPPartitioning by a factor of at least 20 (Figure \ref{spark_time}).
\section{Partition of DNase and MNase data}
% figure 1 from article
\begin{figure}
\centerline{\includegraphics[scale=0.4]{images/ch_spark/figure1.png}}
\caption{Partitioning of DNaseI hypersensitivity profiles around SP1 binding sites in K562 cells. The optimal number of clusters was determined using the elbow method. \textbf{A.} Input data based on peak summits provided by ENCODE. \textbf{B.} Same regions clustered, re-aligned and oriented by SPar-K. Clusters 1, 2 and 3 are indicated by colored bars in red, blue, and green, respectively. \textbf{C.} MNase-seq read densities for the same regions, ordered, aligned and oriented as in B. \textbf{D.} Predicted SP1 binding motifs for the same regions, ordered, aligned and oriented as in B. \textbf{E.} Proportion of binding sites within each cluster having a confirmed promoter-associated TSS within +/- 300bp. \textbf{F.} Aggregations profiles for DNase-seq (red), MNase-seq (blue), promoter TSS (green) and CAGE-seq data (violet) for cluster 2 (aligned and oriented as in B). \textbf{G.} Motifs found by MEME-ChIP and Tomtom in the narrow footprints of each cluster. (*) known SP1 interactor, (c) central enrichment. Cluster 2 left and right refer to the left and right footprints seen in \textbf{B}.
Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
\label{spark_dnase}
\end{figure}
% supplemental figure 8 from article
\begin{figure}
\centerline{\includegraphics[scale=0.4]{images/ch_spark/supplemental_figure8.png}}
\caption{Nucleosome occupancy, determined by MNase-seq, in bins of 10bp, +/- 1000bp around 79'957 CTCF binding sites in GM12878 cells. \textbf{A} MNaseI-seq read density around the CTCF binding sites. ChIP-seq peak summits are aligned at position 0. The regions (rows) are ordered according the their resemblance (correlation) to the overall aggregation pattern. \textbf{B} SPar-K data partition. The number of clusters (4) was determined using the elbow method. The cluster labels are indicated by the color ribbons on the left. Within each cluster, the data have been realigned according to the shift and flip informations returned by SPar-K and the regions have been ordered according the their resemblance (correlation) to the cluster aggregation pattern. Because of the realignment, ChIP-seq peak summits are not anymore aligned at position 0. \textbf{C} Corresponding DNaseI hypersensitivity measured by DNaseI-seq at the same loci and realigned as in B. \textbf{D} CTCF motif occurrences predicted using a motif scan, at the same loci and realigned as in B. Each predicted binding site, +/- 1kb around a peak, is represented as a point. \textbf{E} Transcription start site (TSS) density at the same loci and realigned as in B. \textbf{F} Cluster 1 (red) aggregation profiles. The original peak coordinates were modified accordingly to the shift and flip values returned by SPar-K and the read densities the different data types were measured using ChIP-Cor \citep{ambrosini_chip-seq_2016}. For the TSSs and the transcription initiation (CAGE), only the data mapping on the negative strand were used to monitor transcription firing towards the nucleosome array (towards the left). \textbf{G} Proportions of regions having at least one CTCF motif +/- 1kb (same motifs as in D), for each cluster. \textbf{H} Proportions of regions having at least one TSS +/- 1kb (same TSSs as in E), for each cluster.
Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
\label{spark_ctcf}
\end{figure}
I applied SPar-K with $K=3$ to DNaseI accessibility profiles (2bp resolution) around 7'206 ChIP-seq SP1-binding peaks (+/-300bp relative to peak summit) in K562 cells (Figure \ref{spark_dnase}A). The results revealed the presence of clear footprints in all the clusters (Figure \ref{spark_dnase}B). To validate these footprints, I checked whether they were consistent with the location of nucleosomes (Figure \ref{spark_dnase}C) and SP1 binding motifs (Figure \ref{spark_dnase}D), which was indeed the case. A de novo motif analysis of the narrow footprints seen in Figure \ref{spark_dnase}B with MEME-ChIP and Tomtom \citep{bailey_meme_2009} revealed SP1-related, NFYA/B and GATA motifs (Figure \ref{spark_dnase}G) the latter two reportedly being interaction partners of SP1. Taken together, these results suggest that SPar-K is able to precisely refocus initially misaligned DNaseI profiles around SP1 binding sites.
The partitioning of SP1 binding regions revealed distinct chromatin landscapes. Cluster 1 (red) groups binding sites lying between two closely spaced nucleosomes. Cluster 2 (blue) showed a strong asymmetry suggestive of promoter regions, an interpretation supported by the presence of TSSs indicative of promoters and of CAGE tags (Figure \ref{spark_dnase}E and F). Finally, the symmetrical cluster 3 (green) contained binding sites located on a large nucleosome-free regions reminiscent of enhancer regions.
As a second example, I ran the same type of analysis on nucleosome profiles around CTCF binding sites (Figure \ref{spark_ctcf}). Overall, the results confirm observations from Chapter \ref{encode_peaks} and published in \citep{kundaje_ubiquitous_2012}. Strong nucleosome arrays became visible in all classes after realignment, with three out of four showing strong asymmetry in addition.
\section{Conclusions}
SPar-K is a useful partitioning method for moderately misaligned and randomly oriented chromatin regions. Compared to existing methods, it is competitive in terms of accuracy, superior in speed, applicable to a wider range of input signals (not restricted to count data) and easy to use.

Event Timeline