Page MenuHomec4science

ch_smile-seq.tex
No OneTemporary

File Metadata

Created
Wed, May 8, 08:28

ch_smile-seq.tex

\cleardoublepage
\chapter{SMiLE-seq data analysis}
\label{smile_seq}
\markboth{SMiLE-seq data analysis}{SMiLE-seq data analysis}
\addcontentsline{toc}{chapter}{SMiLE-seq data analysis}
%As discussed earlier, deciphering gene regulation necessitates to understand and model TFs sequence specificity. Several assay and technologies already exist to measure TF-DNA interactions in vivo (ChIP-seq \citep{robertson_genome-wide_2007}) and in vitro (HT-SELEX \citep{roulet_high-throughput_2002,zhao_inferring_2009,jolma_multiplexed_2010}, PBM \citep{bulyk_nucleotides_2002,berger_compact_2006}). Each assay has it strengths and weaknesses \citep{stormo_determining_2010}.
%ChIP-seq data allow to potentially list all binding sites of a TF in a given cell type. However, it has two majors short comings. First, it necessitate a proper antibody to perform the ChIP step. Second, because of the presence of other factors in the chromatin, the data produced not only reflect the intrinsic sequence preferences of the TF but also i) positive interactions with other TFs, ii) competition with other TFs and iii) competition with nucleosomes.
%PBM chips are simple to design and manufacture and to get a result from. However their size is constrained, limiting the number of probes and forcing the designer to optimize the sequence of each prob to maximize the number of k-mers contained in a given chip.
%Finally, HT-SELEX is also a powerful tool. It allows to obtain, at once, hundreds to thousands of sequences to which a TF is binding. However, repeated SELEX cycles introduce a double bias : i) strong affinity binding sites are exponentially enriched for and ii) a PCR step in between two cycles inevitably replicates some sequences more than others. Additionally, the presence of PCR/sequencing primers in the DNA fragments limits the effective size of the inserts.
%To help tackle the issue of better characterizing of monomeric, homodimeric and heterodimeric TFsm binding specificity, Prof Bart Deplancke's laboratory at EPFL developed a microfluidic based technology. This assay - selective microfluidics-based ligand enrichment followed by sequence (SMiLE-seq) - was developed in order to minimize hand-on time, to allow massive parallelization of the measurements and to allow the measurement of TF binding over a wide-range of affinities \citep{isakova_smile-seq_2017}.
The following section contains work made in collaboration with Alina Isakova from Prof. Bart Deplancke research group at EPFL and published in \cite{isakova_smile-seq_2017}. I personally made the presented figures and analyses, with the exception of Figure \ref{smile_seq_pipeline}.
\section{Introduction}
\begin{figure}
\begin{center}
\includegraphics[scale=0.25]{images/ch_smile-seq/figure1.jpg}
\captionof{figure}{\textbf{SMiLE-seq pipeline :} \textbf{a} Schematic representation of the experimental setup. A snapshot of three units of the microfluidic device is shown. In vitro transcribed and translated bait TF, target double-stranded DNA, and a nonspecific competitor poly-dIdC are mixed and pipetted in one of the wells of the microfluidic device. The mixtures are then passively pumped in the device (bottom panel). Newly formed TF–DNA complexes are trapped under a flexible polydimethylsiloxane membrane, and unbound molecules as well as molecular complexes are washed away (upper panel). Left, schematic representation of three individual chambers. Right, corresponding snapshots of an individual chamber taken before and after mechanical trapping. \textbf{b} Data processing pipeline. The bound DNA is eluted from all the units of the device simultaneously and collected in one tube. Recovered DNA is amplified and sequenced. The sequencing reads are then demultiplexed, and a seed sequence is identified for each sample. This seed is then used to initialize a probability matrix representing the sequence specificity model for the given TF. The model parameters are then optimized using a Hidden Markov Model-based motif discovery pipeline.
Figure and legend taken and adapted from \citep{isakova_smile-seq_2017}.}
\label{smile_seq_pipeline}
\end{center}
\end{figure}
Deciphering TF binding specificity is key to understand the regulation of gene expression. Several technologies exist to study TF specificity in vitro such as mechanically induced trapping of molecular interactions (MITOMI \cite{maerkl_systems_2007}), protein binding-microarray (PBM \cite{berger_universal_2009}) or hight throughput systematic enrichment of ligangs by exponential enrichment (HT-SELEX \cite{zhao_inferring_2009,jolma_multiplexed_2010}).
Because TFs can bind as monomers, homodimers, heterodimers and as higher order complexes, it is critical to have suited technologies to interrogate their binding specificity. Nonetheless, because of combinatorials, this undertaking represents a staggering amount of work. In order to allow robust and easy measurement of TF monomers and dimers sequence specificity, over a wide range of binding affinities, Prof. Bart Deplancke research group at EPFL developed the selective microfluidics-based ligand enrichment followed by sequence (SMiLE-seq, \cite{isakova_smile-seq_2017}). An overview of the SMiLE-seq data production and processing procedures is shown in Figure \ref{smile_seq_pipeline}. Overall, three major conceptual features should be highlighted. First, to tackle the scalability issue, the SMILe-seq plateform core is microfluidic device that contains hundreds of individual wells allowing to run as many different TF-DNA interaction assays at the same time. Second, the SMiLE-seq assay does not perform an iterative enrichment, as HT-SELEX does. SMiLE-seq is a one round selection system. Third, because TF-DNA interactions happens over a wide range of affinities, the weakest interaction can be lost during washing. In order to prevent this, complexes are protected under a membrane during the washing step.
In order to assess the quality of this new technology and its ability to produce relevant data to model TF binding, I ran de novo motif discovery analyses on these data and benchmark their predictive performances. Our laboratory developed a Hidden Markov Model (HMM) based motif discovery method and a binding motif evaluation tool. Both are available on the laboratory web portal \url{http://ccg.vital-it.ch/pwmtools/}. My involvment in this project was to develop a scalable and efficient pipeline, at the backend of our server, to run the motif discovery and benchmark steps and to analyze the results.
\section{Hidden Markov Model Motif discovery}
\label{section_smileseq_hmm}
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_smile-seq/figure_hmm.png}
\captionof{figure}{\textbf{Example of a Hidden Markov model :} initial HMM representation with a seed sequence 'ATGCC'. The upper Markov chain models + strand motif containing sequences, the middle one - strand motif containing sequences and the lower zero motif occurrence sequences. The FB, FE, RB and RE positions represents positions in the sequence that occur before and after the binding site on the forward and reverse strand. For these nodes, a self transition exist to allow the binding site to occur at a variable distance from the beginning and the end of the sequence. Once transiting toward the 1st position of the binding site, the next transition is forced toward the 2nd position in the binding site, and so on until the end of the binding site. The + strand and - strand Markov chains emission parameters are paired together (they have the same values), as represented by the grey dashed lines. The transition probabilities in red are not subjected to the Baum-Welch training. Finally, a binding model represented as a probability matrix is composed of the emission probabilities at the binding site positions.
Figure and legend taken and adapted from \citep{isakova_smile-seq_2017}}
\label{smile_seq_hmm}
\end{center}
\end{figure}
This motif discovery method is based on an initial work performed by Philipp Bucher and Giovanna Ambrosini for the DREAM5 TF-DNA Motif Recognition Challenge \citep{weirauch_evaluation_2013} - which was awarded the 2$^{nd}$ place in this contest, attesting of its performance.
This motif discovery method models the DNA sequences using an HMM. The input is composed of $N$ sequences of length $L$ and the motif is represented as a LPM \textbf{M} of dimensions $4 \times L'$ where $L' \leq L$ and with the constrain $\sum_{i=1}^4 m_{i,j} = 1$.
The sequences are modeled as a mixture of a set of consecutive positions belonging to the binding site (to which the TF binds) and of positions outside the binding site. Since the position of the binding site in each sequence is unknown, they have to be guessed in order to align the binding sites and compute the LPM. To this end, we used an HMM that can handle the hidden information about the position of the binding site within a longer sequence. The emission parameters are the base probabilities at each position inside the binding site.
Additionally, to account for experimental biases, such as unspecific binding, a sequence can bear zero binding site. The entire model is composed of three Markov chains representing each path. Because a motif can occur on the + or the - strand of the DNA sequence, the modeling of the binding site is done by two paired Markov chains. The modification of a parameter in one of these two paired chains is propagated to the equivalent parameters in the other paired chain. An example of HMM is displayed in Figure \ref{smile_seq_hmm}.
The parameter estimation given the data is performed using the Baum-Welch algorithm. Mamot \citep{schutz_mamot:_2008}, a dedicated computational framework for HMMs, is used to handle all the computations.
Because, the Baum-Welch algorithm performs an iterative optimization of the model parameters, it needs a starting state. The motif length and the starting emission parameters are estimated using an over-represented kmer analysis. The motif length was set to the best kmer length and the starting emission probabilities are set to 0.7 for the base present in the kmer and 0.1 for the three others.
\section{Binding motif evaluation}
\label{section_smileseq_pwmeval}
To evaluate and compare different binding models originating from different libraries - computed with different algorithms and data - with the SMiLE-seq data derived binding models, I used PWMEval-ChIP-peak (formerly named PWMEval), a program using a methodology proposed by Orenstein and Shamir \citep{orenstein_comparative_2014} that has been developed in-house and is available at \url{https://ccg.epfl.ch/pwmtools/pwmeval_chippeak.php}. In order to run massive batch analyses, I developed a dedicated pipeline that was run at the backend of our servers.
PWMEval takes as input a set $S^{pos}$ of $N'$ experimentally validated binding sites sequences of length $L''$ - extracted from ChIP-seq peaks - and a set $S^{neg}$ of $N'$ presumably non-binding-sites of length $L''$ - extracted 300bp downstream of the corresponding peak sequences - and a probability matrix \textbf{M} describing the binding motif.
Each positive set sequence $S^{pos}_{i}$ is scored using:
\begin{equation}
\begin{aligned}
score^{pos}_{i} & = \sum_{l=1}^{L''-L'+1} \prod_{k=1}^{L'} \frac{m_{b,k}}
{p_{b}} \\
\text{with } b & =
\begin{cases}
1 & \text{if $s^{pos}_{i,l+k-1} = $ A}.\\
2 & \text{if $s^{pos}_{i,l+k-1} = $ C}.\\
3 & \text{if $s^{pos}_{i,l+k-1} = $ G}.\\
4 & \text{if $s^{pos}_{i,l+k-1} = $ T}.\\
\end{cases} \\
\end{aligned}
\label{smile_seq_pwmeval_score}
\end{equation}
The same is done for each negative set sequence $S^{neg}_{i}$, resulting in the creation of two vectors of scores of length $N'$ - one for each sequence set. Both vectors are concatenated into a unique vector of scores. Two vector of size $N'$, containing the class labels ($N'$ times 0 or 1) are created and concatenated as well. Then the vector of labels is sorted according to the decreasing sorting order of the score vector.
If the binding model allows to perfectly segregate the positive from the negative sequences, then we expect the positive sequence scores to be larger than the negative sequence scores. Thus the positive labels should all be at the beginning of the label vector and the negative labels at the end. The propensity of the model to recognize true binding sites from other sequences can be measured by computing the area under the curve (AUC) of the receiver operating characteristic (ROC). Given the list of sorted labels, the true positive and true negative discovery rates can be computed and the AUC-ROC can be computed as $U/N^{2}$ where $U$ is the Mann-Whitney $U$ statistic for the true positive and true negative discovery rates.
% \SetKwProg{Fn}{}{\{}{}\SetKwFunction{Function}{float AUC-ROC}%
% \begin{algorithm}[H]
% \label{smile_seq_algo_auc}
% \Fn{\Function{LABELS}}
% { \KwData{$LABELS$ a vector of $2N$ score-ordered labels encoded as 0 for true negative and 1 for true positives.}
% \KwResult{the ROC-AUC}
% $TP =$ a vector of $2N$ 0's \;
% $TN =$ a vector of $2N$ 0's \;
% $n_{tp} =$ 0 \;
% $n_{tn} =$ 0 \;
% \For{$i=1$ to $2N$}
% { \If{$LABELS_{i}$ == 0}
% { $n_{tn} +=$ 1 \; }
% \Else
% { $n_{tp} +=$ 1 \; }
% $TP_{i} = n_{tp} / N$ \;
% $TN_{i} = n_{tn} / N$ \;
% }
% $AUC = \frac{U}{N^{2}}$ where $U$ is the Mann-Whitney $U$ statistic for the $TP$ and $TN$ vectors \;
% \Return{$AUC$}
% }
% \caption{Computes the AUC-ROC}
% \end{algorithm}
\section{Results}
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_smile-seq/figure2b_3a.png}
\captionof{figure}{\textbf{Predictive power of SMiLE-seq :} \textbf{A} SMiLE-seq detived motifs compared to that of previously reported motifs that are retrievable from the indicated databases. For each motif, the AUC-ROC values on the 500 top peaks of the ENCODE ChIP-seq data sets for the corresponding TF was computed. The heatmap represents the AUC values computed for each method on the respective ChIP-seq data sets that were selected based on the highest mean AUC values among all five models. \textbf{B} the predictive performances of MAX and YY1 binding models were assessed using subsets of binding sites of decreasing affinities. Inside each peak list, the peaks were ranked by score and subsets of 500 peaks were selected. Peaks 1-500 have the highest affinity, then peaks 501-1000, and so on. The boxplots indicate the distribution of AUC-ROC obtained over all available peak-lists.
Figures and legends taken and adapted from \citep{isakova_smile-seq_2017}}
\label{smileseq_auc}
\end{center}
\end{figure}
In order to assess the robustness of SMiLE-seq, I derived binding models using the HMM motif discovery procedure for all TFs for which ChIP-seq peaks have been released by the ENCODE Consortium. For each ChIP-seq peak list, the 500 best peaks were selected based on their signal enrichment score (attributed by ENCODE). The peak score reflects the TF occupancy and likely the TF binding affinity to this site. Whenever several ChIP-seq peak lists were available for a given TF, the peak list achieving the highest mean AUC value over all models available for that TF was considered.
For each TF, LPMs equivalent to the SMiLE-seq derived LPM were selected from the HT-SELEX Jolma \citep{jolma_dna-binding_2013}, JASPAR 2014 \citep{mathelier_jaspar_2014} and HOCOMOCOv10 \citep{kulakovskiy_hocomoco:_2016} motif collections and were compared to the SMiLE-seq derived LPM using the AUC-ROC procedure procedure (Figure \ref{smileseq_auc} A). This analysis reveled that, in the majority of the cases, the SMiLE-seq derived binding models were doing at least as good (MAX, CEBPb, CTCF, NFkB, ZSCAN1 TCF7, JunD, RXRa) or better (YY1, ZEB1) than the available models.
% I compared the SMiLE-seq derived binding models with the equivalent TF binding models from the HT-SELEX Jolma \citep{jolma_dna-binding_2013}, JASPAR 2014 \citep{mathelier_jaspar_2014} and HOCOMOCOv10 \citep{kulakovskiy_hocomoco:_2016} motif collections using the AUC-ROC procedure (Figure \ref{smileseq_auc} A). This analysis reveled that, in the majority of cases, the SMILe-seq derived binding models were doing at least as good (MAX, CEBPb, CTCF, NFkB, ZSCAN1 TCF7, JunD, RXRa) or better (YY1, ZEB1) than available models.
To verify that SMiLE-seq was indeed able to measure binding over a wide functional range of affinities, I computed AUC-ROC using decreasing affinity subsets of ChIP-seq peaks (Figure \ref{smileseq_auc} B and Figure \ref{suppl_smileseq_auc_2}B). For high affinity peak subsets, SMiLE-seq derived models scored as well as others but tend to become better than other models for lower affinity peaks subsets.
To further emphasize the quality of the SMiLE-seq technology and support conceptual differences with HT-SELEX, binding models were trained from HT-SELEX data using the HMM motif discovery procedure. The rational was that if the model performances could be attributed to different motif discovery methods, running the HMM motif discovery one on all data should get us ride from this confounding factor. Because SMiLE-seq does not perform iterative enrichment of high affinity binders, 1st cycle HT-SELEX data were used. Here again, SMiLE-seq derived models were at least as good as HT-SELEX models (Figure \ref{suppl_smileseq_auc_2} A).
\section{Conclusions}
These results demonstrated that SMiLE-seq is a valuable plateform to run in-vitro TF-DNA binding assays. More specifically, this was demonstrated by showing that SMiLE-seq derived models are as good or better than the models derived from ChIP-seq data (such as HOCOMOCO or JASPAR models) or from in vitro data (HT-SELEX). Moreover, it was also possible to support the fact that SMiLE-seq was able to capture interaction over a wide range of affinities. This was shown through the use of differential binding affinity sites in the AUC-ROC analyses.
All together, these results support that SMiLE-seq is a valuable technology and a strong competitor for other in vitro TF-DNA interaction assays such as HT-SELEX.

Event Timeline