Page MenuHomec4science

ch_smile-seq.tex
No OneTemporary

File Metadata

Created
Mon, May 20, 14:51

ch_smile-seq.tex

\cleardoublepage
\chapter{SMiLE-seq data analysis}
\markboth{SMiLE-seq data analysis}{SMiLE-seq data analysis}
\addcontentsline{toc}{chapter}{SMiLE-seq data analysis}
%As discussed earlier, deciphering gene regulation necessitates to understand and model TFs sequence specificity. Several assay and technologies already exist to measure TF-DNA interactions in vivo (ChIP-seq \citep{robertson_genome-wide_2007}) and in vitro (HT-SELEX \citep{roulet_high-throughput_2002,zhao_inferring_2009,jolma_multiplexed_2010}, PBM \citep{bulyk_nucleotides_2002,berger_compact_2006}). Each assay has it strengths and weaknesses \citep{stormo_determining_2010}.
%ChIP-seq data allow to potentially list all binding sites of a TF in a given cell type. However, it has two majors short comings. First, it necessitate a proper antibody to perform the ChIP step. Second, because of the presence of other factors in the chromatin, the data produced not only reflect the intrinsic sequence preferences of the TF but also i) positive interactions with other TFs, ii) competition with other TFs and iii) competition with nucleosomes.
%PBM chips are simple to design and manufacture and to get a result from. However their size is constrained, limiting the number of probes and forcing the designer to optimize the sequence of each prob to maximize the number of k-mers contained in a given chip.
%Finally, HT-SELEX is also a powerful tool. It allows to obtain, at once, hundreds to thousands of sequences to which a TF is binding. However, repeated SELEX cycles introduce a double bias : i) strong affinity binding sites are exponentially enriched for and ii) a PCR step in between two cycles inevitably replicates some sequences more than others. Additionally, the presence of PCR/sequencing primers in the DNA fragments limits the effective size of the inserts.
%To help tackle the issue of better characterizing of monomeric, homodimeric and heterodimeric TFsm binding specificity, Prof Bart Deplancke's laboratory at EPFL developed a microfluidic based technology. This assay - selective microfluidics-based ligand enrichment followed by sequence (SMiLE-seq) - was developed in order to minimize hand-on time, to allow massive parallelization of the measurements and to allow the measurement of TF binding over a wide-range of affinities \citep{isakova_smile-seq_2017}.
The following section contains work made in collaboration with Alina Isakova Prof. Bart Deplancke research group at EPFL and published in \cite{isakova_smile-seq_2017}. I personally realized the presented figures and analyses, with the exception of Figure \ref{smile_seq_pipeline}.
\subsection{Introduction}
\begin{figure}
\begin{center}
\includegraphics[scale=0.25]{images/ch_smile-seq/figure1.jpg}
\captionof{figure}{\textbf{SMiLE-seq pipeline :} \textbf{a} Schematic representation of the experimental setup. A snapshot of three units of the microfluidic device is shown. In vitro transcribed and translated bait TF, target dsDNA, and a nonspecific competitor poly-dIdC are mixed and pipetted in one of the wells of the microfluidic device. The mixtures are then passively pumped in the device (bottom panel). Newly formed TF–DNA complexes are trapped under a flexible polydimethylsiloxane membrane, and unbound molecules as well as molecular complexes are washed away (upper panel). Left, schematic representation of three individual chambers. Right, corresponding snapshots of an individual chamber taken before and after mechanical trapping. \textbf{b} Data processing pipeline. The bound DNA is eluted from all the units of the device simultaneously and collected in one tube. Recovered DNA is amplified and sequenced. The sequencing reads are then demultiplexed, and a seed sequence is identified for each sample. This seed is then used to initialize a probability matrix representing the sequence specificity model for the given TF. The model parameters are then optimized using a Hidden Markov Model-based motif discovery pipeline.
Figure and legend taken and adapted from \citep{isakova_smile-seq_2017}.}
\label{smile_seq_pipeline}
\end{center}
\end{figure}
Deciphering TF binding specificity is key to understand gene regulation. Several technologies exists to study TF specificity in vitro such as mechanically induced trapping of molecular interactions (MITOMI \cite{maerkl_systems_2007}), protein binding-microarray (PBM \cite{berger_universal_2009}) or hight throughput systematic enrichment of ligangs by exponential enrichment (HT-SELEX \cite{zhao_inferring_2009,jolma_multiplexed_2010}).
Because TFs can bind as monomer, homodimer, heterodimer and as higher order complexes, it is critical to have suited technologies to interrogate their binding specificity. Nonetheless, because of combinatorials, this undertaking represent a staggering amount of work. In order to allow robust and easy measurement of TF monomers and dimers sequence specificity, over a wide range of binding affinities, Prof. Bart Deplancke research group at EPFL developed selective microfluidics-based ligand enrichment followed by sequence (SMiLE-seq, \cite{isakova_smile-seq_2017}). An overview of the SMiLE-seq data production and processing procedures is shown in Figure \ref{smile_seq_pipeline}. Overall, three major conceptual featzres should be highlighted. First, to tackle the scalability issue, the SMILe-seq plateform core is microfluidic device that contains hundreds of individual wells allowing to run as many different TF-DNA interaction assays at the same time. Second, the SMiLE-seq assay does not perform an iterative enrichment, as HT-SELEX does. SMiLE-seq is a one round selection system. Third, because TF-DNA interactions happens over a wide range of affinities, the weakest interaction can be lost during washing. In order to prevent this, complexes are protected under a membrane during the washing step.
In order to assess the quality of this new technology and its ability to produce relevant data to model TF binding, I ran de novo motif discovery analyses on these data and benchmark their predictive performances. Our laboratory developed a Hidden Markov Model (HMM) based motif discovery method and a binding motif evaluation tool. Both are available on the laboratory web portal \url{http://ccg.vital-it.ch/pwmtools/}. My involvment in this project was to develop a scalable and efficient pipeline, at the backend of our server, to run the motif discovery and benchmark steps and to analyze the results.
\subsection{Hidden Markov Model Motif discovery}
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_smile-seq/figure_hmm.png}
\captionof{figure}{\textbf{Example of a Hidden Markov model :} initial HMM representation with a seed sequence 'ATGCC'. The upper Markov chain models + strand motif containing sequences, the middle one - strand motif containing sequences and the lower zero motif occurrence sequences. The FB, FE, RB and RE positions represents positions in the sequence that occur before and after the binding site on the forward and reverse strand. For these nodes, a self transition exist to allow the binding site to occur at a variable distance from the beginning and the end of the sequence. Once transiting toward the 1st position of the binding site, the next transition is forced toward the 2nd position in the binding site, and so on until the end of the binding site. The + strand and - strand Markov chains emission parameters are paired together (they have the same values), as represented by the grey dashed lines. The transition probabilities in red are not subjected to the Baum-Welch training. Finally, a binding model represented as a probability matrix is composed of the emission probabilities at the binding site positions.
Figure and legend taken and adapted from \citep{isakova_smile-seq_2017}}
\label{smile_seq_hmm}
\end{center}
\end{figure}
This program has been developed an implemented by Philipp Bucher and Giovanna Ambrosini - a senior scientist of the laboratory - for the DREAM5 TF-DNA Motif Recognition Challenge \citep{weirauch_evaluation_2013}. This motif discovery method was awarded the 2$^{nd}$ place in this contest, attesting of its performances.
This motif discovery method models the DNA sequences using an HMM. The sequences are composed of $N$ sequences of length $L$ and the motif is represented as a probability matrix \textbf{M} of dimensions $4 \times L'$ where $L' \leq L$ and with the constrain $\sum_{i=1}^4 m_{i,j} = 1$.
A sequence is modeled as a mixture of a set of consecutive positions belonging to the binding site (to which the TF binds) and of positions outside the binding site. Since the position of the binding site in each sequence is unknown, it is not possible to align the binding sites and to compute its base composition. Instead, we use an HMM where the hidden state is whether a position in a sequence belongs to the binding site and the emission parameters are the base probabilities at this position inside the binding site.
Additionally, to account for experimental biases, such as unspecific binding, a sequence can zero binding site. The entire model is composed of three Markov chains representing each path. Because a motif can occur on the + or the - strand of the DNA sequence, the modeling of the binding site is done by two paired Markov chains. The modification of a parameter in one of these two paired chains is propagated to the equivalent parameters in the other paired chain. An example of HMM is displayed in Figure \ref{smile_seq_hmm}.
The parameter estimation given the data is performed using the Baum-Welch algorithm. Mamot \citep{schutz_mamot:_2008}, a dedicated computational framework for HMMs, is used to handle all the computations.
Because, the Baum-Welch algorithm performs an iterative optimization of the model parameters, it needs a starting state. The motif length and the starting emission parameters are estimated using an over-represented kmer analysis. The motif length was set to the best kmer length and the starting emission probabilities are set to 0.7 for the base present in the kmer and 0.1 for the three others.
\subsection{Binding motif evaluation}
To evaluate and compare different binding models coming from different libraries - computed with different algorithms and data - with the SMiLE-seq data derived binding models, I used PWMEval-ChIP-peak (formerly named PWMEval), a program using a methodology proposed by Orenstein and Shamir \citep{orenstein_comparative_2014} that has been developed in-house and is available at \url{https://ccg.epfl.ch/pwmtools/pwmeval_chippeak.php}. In order to run massive batch analyses, I developed a dedicated pipeline that was run at the backend of our servers.
PWMEval takes as input a set $S^{pos}$ of $N$ experimentally validated binding sites sequences of length $L$ - extracted from ChIP-seq peaks - and a set $S^{neg}$ of $N$ presumably non-binding-sites of length $L$ - extracted 300bp downstream of the corresponding peak sequences - and a probability matrix \textbf{M} describing the binding motif.
Each positive set sequence $S^{pos}_{i}$ is scored using:
\begin{equation}
\begin{aligned}
score^{pos}_{i} & = \sum_{l=1}^{L-L'+1} \prod_{k=1}^{K} \frac{m_{b,k}}
{p_{b}} \\
\text{with } b & =
\begin{cases}
1 & \text{if $s^{pos}_{i,l+k-1} = $ A}.\\
2 & \text{if $s^{pos}_{i,l+k-1} = $ C}.\\
3 & \text{if $s^{pos}_{i,l+k-1} = $ G}.\\
4 & \text{if $s^{pos}_{i,l+k-1} = $ T}.\\
\end{cases} \\
\end{aligned}
\label{smile_seq_pwmeval_score}
\end{equation}
The same is done for each negative set sequence $S^{neg}_{i}$, resulting in the creation of two vectors of scores of length $N$ - one for each sequence set. Both vectors are concatenated into a unique vector of scores. Two vector of size $N$, containing the class labels ($N$ times 0 or 1) are created and concatenated as well. Then the vector of labels is sorted according to the decreasing sorting order of the score vector.
If the binding model allows to perfectly segregate the positive from the negative sequences, then we expect the positive sequence scores to be larger than the negative sequence scores. Thus the positive labels should all be at the beginning of the label vector and the negative labels at the end. The propensity of the model to recognize true binding sites from other sequences can be measured by computing the area under the curve (AUC) of the receiver operating characteristic (ROC). Given the list of sorted labels, the true positive and true negative discovery rates can be computed and the AUC-ROC can be computed using the following procedure :
\SetKwProg{Fn}{}{\{}{}\SetKwFunction{Function}{float AUC-ROC}%
\begin{algorithm}[H]
\label{smile_seq_algo_auc}
\Fn{\Function{LABELS}}
{ \KwData{$LABELS$ a vector of $2N$ score-ordered labels encoded as 0 for true negative and 1 for true positives.}
\KwResult{the ROC-AUC}
$TP =$ a vector of $2N$ 0's \;
$TN =$ a vector of $2N$ 0's \;
$n_{tp} =$ 0 \;
$n_{tn} =$ 0 \;
\For{$i=1$ to $2N$}
{ \If{$LABELS_{i}$ == 0}
{ $n_{tn} +=$ 1 \; }
\Else
{ $n_{tp} +=$ 1 \; }
$TP_{i} = n_{tp} / N$ \;
$TN_{i} = n_{tn} / N$ \;
}
$AUC = \frac{U}{N^{2}}$ where $U$ is the Mann-Whitney $U$ statistic for the $TP$ and $TN$ vectors \;
\Return{$AUC$}
}
\caption{Computes the AUC-ROC}
\end{algorithm}
\subsection{Results}
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_smile-seq/figure2b_3a.png}
\captionof{figure}{\textbf{Predictive power of SMiLE-seq :} \textbf{A} the motifs compared to that of previously reported motifs that are retrievable from the indicated databases. For each motif, the AUC-ROC values on the 500 top peaks of the ENCODE ChIP-seq data sets for the corresponding TF was computed. The heatmap represents the AUC values computed for each method on the respective ChIP-seq data sets that were selected based on the highest mean AUC values among all five models. \textbf{B} the predictive performances of MAX and YY1 binding were assessed using subsets of binding sites of decreasing affinities. Inside each peak list, the peaks were ranked by score and subsets of 500 peaks were selected. Peaks 1-500 have the highest affinity, then peaks 501-1000, and so on. The boxplots indicate the distribution of AUC-ROC obtained over all available peak-lists.}
\label{smileseq_auc}
\end{center}
\end{figure}
In order to assess the robustness of SMiLE-seq, I derived binding models using the HMM motif discovery procedure for all TFs for which ChIP-seq peaks have been released by the ENCODE Consortium were available. For each ChIP-seq peak list, the 500 best peaks were selected based on their confidence score (attributed by ENCODE). The peak score reflects TF occupancy and likely the TF binding affinity to this site. Whenever several ChIP-seq peak lists were available for a given TF, the peak list achieving the highest mean AUC value over all models for that TF was considered.
I compared the SMiLE-seq derived binding models with the equivalent TF binding models from the HT-SELEX Jolma \citep{jolma_dna-binding_2013}, JASPAR 2014 \citep{mathelier_jaspar_2014} and HOCOMOCOv10 \citep{kulakovskiy_hocomoco:_2016} motif collections using the AUC-ROC procedure (Figure \ref{smileseq_auc} A). This analysis reveled that, in the majority of cases, the SMILe-seq derived binding models were doing at least as good (MAX, CEBPb, CTCF, NFkB, ZSCAN1 TCF7, JunD, RXRa) or better (YY1, ZEB1) than available models.
To verify that SMiLE-seq was indeed able to measure binding over a wide functional range of affinities, I computed AUC-ROC using decreasing affinity subsets of ChIP-seq peaks (Figure \ref{smileseq_auc} B and Figure \ref{suppl_smileseq_auc_2}B). For high affinity peak subsets, SMiLE-seq derived models scored as well as others but tend to become better than other models for lower affinity peaks subsets.
To further emphasize the quality of the SMiLE-seq technology and support conceptual differences with HT-SELEX, binding models were trained from HT-SELEX data using the HMM motif discovery procedure. The rational was that if the model performances could be attributed to different motif discovery methods, running the HMM motif discovery one on all data should get us ride from this confounding factor. Because SMiLE-seq does not perform iterative enrichment of high affinity binders, 1st cycle HT-SELEX data were used. Here again, SMiLE-seq derived models were at least as good as HT-SELEX models (Figure \ref{suppl_smileseq_auc_2} A).
\subsection{Conclusions}
These results demonstrated that SMiLE-seq is a valuable plateform to run in-vitro TF-DNA binding assays. More specifically, this was demonstrated by showing that SMiLE-seq derived models are as good or better than models derived from ChIP-seq data (such as HOCOMOCO or JASPAR models) or from in vitro data (HT-SELEX). Moreover, it was also possible to support the fact that SMiLE-seq was able to capture interaction over a wide range of affinities. This was shown through the use of differential binding affinity sites in the AUC-ROC analyses.
All together, these results support that SMiLE-seq is a valuable technology and a strong competitor for other in vitro TF-DNA interaction assays such as HT-SELEX.

Event Timeline