Page MenuHomec4science

ch_encode_peaks.tex
No OneTemporary

File Metadata

Created
Fri, May 17, 09:25

ch_encode_peaks.tex

\cleardoublepage
\chapter{ENCODE peaks analysis}
\label{encode_peaks}
\markboth{ENCODE peaks analysis}{ENCODE peaks analysis}
\addcontentsline{toc}{chapter}{ENCODE peaks analysis}
% Modeling a TF sequence specificity only allows to partially understand how a TF binds a region. Indeed, scanning a genome using a PWM for putative binding sites often returns tens of thousands of sites with only a subset of them being really occupied within a cell. Other elements such as chromatin organization and composition are likely to drive TF binding. Thus gaining a better understanding about the chromat
% The exact mechanisms at play remain unclear but nucleosome occupancy is thought to shelter DNA sequence - as some bases are facing the core octamer or to distort the DNA structure - impeding sequence recognition by TFs. In vivo, evidences for competition between TFs and nucleosomes have been collected. Computational simulations accounting for simultaneous multiple factor binding on DNA suggested that nucleosome occupancy and TFs binding influence each other and that TF binds nucleosome depleted regions \cite{wasson_ensemble_2009}.
As discussed in Chapter \ref{intro}, the structure of the chromatin has a deep impact on TF binding. It is now clear that nucleosome occupancy fulfills more than a packaging role. It can also acts as a barrier to impede DNA reading processes and compete with TFs for sequence occupancy. Thus gaining a better understanding of how chromatin is organized around TF binding sites is crucial to understand TF binding beyond their sequence specificity only.
In an effort to better understand how the genome is organized and how its functions are fulfilled,
the ENCODE Consortium \citep{consortium_integrated_2012} released an impressive collection of coherent data representing an unprecedented picture of the chromatin in several human cell lines.
The GM12878 cells were chosen as one of the highest priority cell line. GM12878 retained the ability to divide but show a normal karyotype - unlike HeLa cells. Additionally, their genome has been sequences. All together, these features make of GM12878 cells a good model for genomic studies.
\section{Data}
% number of peaks per dataset
\begin{figure}
\begin{center}
\includegraphics[scale=0.3]{images/ch_encode_peaks/peaklist_peaknumber_GM12878.png}
\captionof{figure}{\textbf{Number of peaks in GM12878} called by ENCODE for each TF ChIP-seq experiment. The different TFs are colored by type, as defined by \citep{cheng_understanding_2012} : sequence specific TF (TFSS), chromatin structure (ChromStr) and others. The horizontal dashed lines indicate 20'000 and 40'000.}
\label{encode_peaks_gm12878_peak_number}
\end{center}
\end{figure}
% proportion of peaks with motif per dataset
\begin{figure}
\begin{center}
\includegraphics[scale=0.3]{images/ch_encode_peaks/peaklist_proportions_GM12878.png}
\captionof{figure}{\textbf{Proportion of peaks with a motif in GM12878}, for each TF ChIP-seq experiment, in green. Assuming that a TF binds to DNA through its motif, the motif should be nearby the peak center. Thus the center of each peak was scanned using a PWM describing the TF binding specificity. Each TF was associated to a log-odd PWM contained either in JASPAR Core vertebrate 2014 \citep{mathelier_jaspar_2014}, HOCOMOCO v10 \citep{kulakovskiy_hocomoco:_2016} or Jolma \citep{jolma_dna-binding_2013} collection. If a motif instance with a score corresponding to a pvalue higher or equal to $1\cdot10^{-4}$ could be found, the peak was considered bearing a motif. The different TFs are colored by type, as defined by \citep{cheng_understanding_2012} : sequence specific TF (TFSS), chromatin structure (ChromStr) and others. The horizontal dashed line indicates 0.5.}
\label{encode_peaks_gm12878_motif_prop}
\end{center}
\end{figure}
During its production phase in 2012, the ENCODE Consortium released ChIP-seq data 53 different TFs, nucleosome occupancy data (MNase-seq, \cite{kundaje_ubiquitous_2012}) and chromatin accessiblity data (DNase-seq, \citep{thurman_accessible_2012}) that were generated with a depth of coverage in GM12878 cells. The ENCODE Consortium also released ChIP-seq peaks called using a uniform processing pipeline \citep{gerstein_architecture_2012}. These peaks account for i) technical variability as they they are called from technical replicates and ii) inter peak caller discrepancies as several peak callers results were integrated together as part of the peak calling pipeline. These peaks are thus reproducible and robust to software related biases and can be considered as an excellent standard.
All data were taken from the MGA repository. The ChIP-seq peaks can be found at \url{https://ccg.epfl.ch/mga/hg19/encode/Uniform-TFBS/Uniform-TFBS.html}, the MNase-seq data at \url{https://ccg.epfl.ch/mga/hg19/encode/GSE35586/GSE35586.html} and the DNase-seq at \url{https://ccg.epfl.ch/mga/hg19/encode/UW-DNaseI-HS/UW-DNaseI-HS.html}.
The number of peaks called for each TF was highly variable and likely reflects each factor activity in this cell line (Figure \ref{encode_peaks_gm12878_peak_number}). The most abundant factor in terms of peaks was RUNX3 followed by CTCF. This observation fits to BioGPS \citep{wu_biogps:_2016} data which indicates that both RUNX3 and CTCF have a higher expression in lymphoblast and in B cells compared to other tissues. Moreover, the propensity of each TF to bind through their motifs was also variable, with again CTCF being showing the highest values \ref{encode_peaks_gm12878_motif_prop}.
\section{ChIPPartitioning : an algorithm to identify chromatin architectures}
\label{encode_peaks_chippartitioning}
% Discovering archetypical chromatin architectures over a set of regions of interest - let's say containing a TF binding site in their middle - is a long standing problem in bioinformatics. More formerly, given a matrix $R$ of dimensions $NxL$ containing $N$ vectors of read counts $r_{1}, r_{2}, ..., r_{N}$ of length $L$, each containing the number of reads mapping at a given position in a given region, find $K \leq N$ vectors of length $L' \leq L$ that contain archetypical signals found in the $N$ regions of $R$. This can actually be solved using clustering methods which groups regions that look alike into $K$ groups. The summary of the signal inside each group - for instance the mean signal for the K-means algorithm - can then be interpreted as the archetypical chromatin architectures. Biologically, different organization may reflect different functions.
% First, the $N$ regions of interest are usually aligned with respect to a feature of interest, for instance a TF binding sites. However, he chromatin features of interest - for instance the nucleosomes - may not be aligned from one region to the next. This can originate because i) of the true binding sites being fuzzely distributed around the center of the regions, ii) the chromatin features appear at a varying distance from the region centers or iii) both. Comparing two regions then necessitate to first realign the chromatin features. Second, the regions can show a functional orientation. For instance, TF binding sites have an upstream and a downstream with respect to the bound sequence. Orienting properly the regions is also required to properly compare the chromatin organizations in two regions. Finally, the signal over some regions may be sparse because of a sub-optimal sequencing depth.
% The study of signal distribution over genomic regions has been a quite active field for bulk sequencing experiments during the last decade. Dedicated algorithms \citep{hon_chromasig:_2008,nielsen_catchprofiles:_2012,kundaje_ubiquitous_2012,nair_probabilistic_2014,groux_spar-k:_2019} have been developed to cluster genomic regions based on their distribution of reads.
% Most of these algorithms and softwares deal with some of these issues cited above. However, the algorithm developed by \citep{nair_probabilistic_2014} - which I will call ChIPPartitioning - is probably the best. ChIPPartitioning is a probabilistic partitioning method that softly clusters a sets of genomic regions based on their signal shape (as opposed to the absolute values) resemblance. To ensure proper comparisons between the regions, the algorithm allows to offset one region compare to the other to retrieve a similar signal at different offsets and to flip the signal orientation. Finally, it has been demonstrated to be really robust to sparse data.
As discussed in section \ref{intro_pattern_discovery}, pattern discovery is a long standing bioinformatic problem and several algorithms have been proposed to solve it. ChIPPartitioning \citep{nair_probabilistic_2014} is probably the best of them. It is a probabilistic partitioning algorithm that softly clusters a sets of genomic regions based on their signal shape (as opposed to the absolute values) resemblance. To ensure proper comparisons between the regions, the algorithm allows to offset one region compare to the other to retrieve a similar signal at different offsets and to flip the signal orientation. Finally, it has been demonstrated to be really robust to sparse data.
ChIPPartitioning (a graphic representation of the algorithm can be found further below in Figure \ref{atac_seq_em}) models the signal over $N$ region of length $L$ as having being sampled from a mixture of $K$ different signal models (classes), using $L$ independent Poisson distributions for each region. The number of reads sequenced over this region is then the result of this sampling process. Each class model is represented by a vector $c_{k}$ of size $L' \leq L$ that represent the expected number of reads at each position for that class. These values are thus the Poisson distribution parameters. The number of reads $r_{i,j}$ at position $j$, in a region $i$ is :
\begin{equation}
\label{encode_peaks_eq_em_data_model}
r_{i,j} = \sum_{k=1}^{K} p_{k} \times X_{i,j,k}
\end{equation}
where $p_{k}$ is the probability of the k-th class and $X_{i,j,k}$ the number of reads sampled from $Poisson(\lambda=c_{k,j})$.
In order to discover the $K$ different class models - that are the chromatin signatures to find - in the data, the algorithm proceed to a maximum likelihood estimation of the Poisson distribution parameters $c_{1}, c_{2}, ..., c_{k}$ and the class probabilities $p_{1}, p_{2}, ..., p_{k}$ using an expectation-maximization (EM) framework. During the E-step, the likelihood $P(r_{i}|c_{k})$ of each region $i$, given each class $k$ and a posterior probability $P(c_{k}|r_{i})$ are computed. The posterior probabilities are interpreted as the probability that $r_{i}$ belongs to class $k$. Eventually, during the M-step, the class models $c_{1}, c_{2}, ..., c_{k}$ are updated using :
\begin{equation}
\label{encode_peaks_eq_em_update}
c_{k,j} = \sum_{i=1}^{N} p_{k} \times r_{i,j}
\end{equation}
This procedure is actually a weighted and ungaped data alignment in which the posterior probabilities are the weights with the class models containing the average number of reads at each position of the alignment.
Since each region is computed a probability to belong to each class, it participates to the update of all the class models, with different weights.
If the length of the chromatin signature searched $L'<L$, then the algorithm slides a window along the regions and searched for this signature at each possible offset. This is how it deals with alignment issue. The signal orientation issue is tackled by also performing a searched with the flipped model.
At the end of the process, this algorithm returns a posterior probability matrix of dimensions $NxKxSx2$ with $S=L-L'+1$ corresponding to regions, classes, shift states ($S$) and flip states (forward and reverse). In other words, ChIPPartitioning computes a probability of belonging to each class, to each window (in both orientation) in each region.
Because the estimation of the class model parameters is done using an EM framework, ChIPPartitioning is a heuristic algorithm. The final parameter estimates depend on the starting state (which is set randomly) and on the number of iterations run. Finally, it is worth mentioning that this algorithm is close to the MEME algorithm \citep{bailey_fitting_1994} that models DNA sequences as being sampled from a two class mixture model that represents the DNA motif to find and the noise.
Regarding implementation details, ChIPPartitioning was implemented in R programming language, as it was proposed in the supplemental material of \citep{nair_probabilistic_2014}. Nonetheless, ChIPPartitioning turned out to be relatively slow dued to the quite heavy computations it has to carry out (logarithms, exponentials and probability computations), the intrinsic limitations of the R programming language and the lack of optimization in the implementation.
\subsection{Data realignment}
\label{encode_peaks_data_realign}
ChIPPartitioning computes a set of posterior probabilities and use them to perform the class model updates. As illustrated in Figure \ref{atac_seq_em}, this procedure is actually a weighted and ungaped data alignment in which the posterior probabilities are the weights.
It is absolutely feasible to run a partitioning on a given matrix $A$, for instance MNase-seq read counts, using ChIPPartitioning, and to subsequently use the obtained posterior probabilities to compute the class models, using another data matrix, let us say $B$ of DNase-seq reads.
This procedure allows to realign a dataset $B$ as $A$ in order to co-visualize different types of signals. The only things that should be taken care of is that matrices $A$ and $B$ should have the same dimensions.
In the following sections, this is the procedure that will be used to overlay different types of data for a given partition.
\section{Nucleosome organization around transcription factor binding sites}
% examples of partitions
\begin{figure}
\begin{center}
\includegraphics[scale=0.25]{images/ch_encode_peaks/MNase_profiles.png}
\captionof{figure}{\textbf{Chromatin pattern around TF binding sites in GM12878 :} \textbf{A} For each peaklist, nucleosome occupancy was measured +/- 1kb around each individual TFBS using 10bp bins. The TFBS were then classified into 4 classes according to their nucleosome patterns using a ChIPPartitioning, allowing the patterns to be flipped and shifted. Each TF binding site was assigned a probability to belong to each of the 4 classes with a given values of shift and flip. To assess the extent of a given TF to i) display nucleosomes arrays on its flank and ii) to have nucleosome positioned with respect to its binding sites, array density and shift probability standard deviation have been measured for each class. Classes having a mean array density above 0.4 and a shift probability standard deviation under 3.5 and other custom classes are highlighted. Classes are named using the TF, the laboratory which produced the data and the class number (from 1 to 4). \textbf{B} Examples of class patterns corresponding to some of the highlighted classes for CTCF, ATF3, YY1, EBF1 and ZNF143. MNase profiles (red) were allowed to be shifted and flipped and DNaseI (blue), TSS density (violet) and sequence conservation (green) were overlaid according to MNase classification (taking into account both shift and flip). The y-axis scale represent the proportion of the highest signal for each chromatin pattern.}
\label{encode_peaks_array_measure}
\end{center}
\end{figure}
For each dataset, the peak coordinates were reassigned to the closest TF motif, if any. However dealing with unaligned signal was still necessary. Indeed, it could not be excluded that the differents TFs would not be the anchor of the chromatin organization around them and have nucleosome arrays at variable distances from their binding sites. Furthermore, dealing with the region orientation was also needed because i) all peaks did not contain a motif indicating the directionality of the binding site (Figure \ref{encode_peaks_gm12878_motif_prop}) and ii) as before, the TF binding site may not be the main driving force of the neighboring chromatin organization. However, this pre-processing step, even if it could not resolve entirely this issue, could at least soften it.
To uncover the different nucleosome architectures around TF binding site, one partition per peaklist based on the MNase-seq signal and using ChIPPartitioning were performed. Because the time required to run the partitioning procedure is long and is a linear function of the number of classes, the choice of four classes was a compromise allowing to discover several chromatin architectures while not being computationally to intense. ChIPPartitioning was also given a freedom of shifting of 15 bins (corresponding to -70bp, -60bp, ..., 0bp, ..., +60bp, +70bp) and of flipping. A visual inspection of the results revealed that all classes, for all TFs, show a nucleosome array on at least one of the side of the TF binding site (examples are displayed in Figures \ref{suppl_encode_peaks_em_ctcf}, \ref{suppl_encode_peaks_em_nrf1}, \ref{suppl_encode_peaks_em_cfos} and \ref{suppl_encode_peaks_em_max}). Additionally, it was also possible to see an increased chromatin accessibility and sequence conservation at the level of the binding site. The enhanced chromatin accessibility is compatible with the current view of TFs binding nucleosome depleted regions \citep{kundaje_ubiquitous_2012}. However, the absence of a footprint like signal is explained by the shifting. By shifting and flipping the regions, ChIPPartitioning realigns the signal over these regions, at the cost of unphasing the binding sites.
A noticeable exception to this rule was the early Early B-cell factor 1 (EBF1) that seemed to had nucleosome arrays spanning its binding sites (Figure \ref{encode_peaks_array_measure}B).
In order to explore more carefully to what extent nucleosome arrays may be organized with respect to each TF binding sites, I used the mean array density measure developed by \citep{zhang_canonical_2014}. A class pattern showing well positioned nucleosomes is typically showing sharp regions of strong signal separated by signal depleted regions reflecting of the alternance of nucleosome presence/absence. The method developed by Zhang and colleagues basically searches for strong variations of signal. The higher the score, the most the pattern contains well positioned nucleosomes. On the other hand, the ability of a TF to act as an anchor for arrays organization was measured as the standard deviation of the shift used by ChIPPartitioning. Briefly, it is possible to compute the probability density of the usage of each shift state. Assessing how much the different shift states were used is indicative of how much the individual patterns were aligned at the beginning. A low standard deviation value indicates that the shifting tends to be the same for all binding sites and thus that the nucleosome arrays occur at a fixed - unspecified - distance from the binding site. In this case, the binding site could be the array anchor.
Both values were measured for all classes discovered, for all TFs. The results are displayed in Figure \ref{encode_peaks_array_measure}. First, it was possible to identify a sub-population of classes in which the TF binding site seemed to act as an anchor for the nucleosomes. This represented binding sites for CTCF, RAD21, SMC3, YY1 and ZNF143 (see Figure \ref{encode_peaks_array_measure}A, points 6,8,10,13,14,15,18 and 19). A closer inspection of these class patterns showed a strong DNaseI footprint and a peak of sequence conservation. A DNaseI footprint is a typical pattern - composed of a signal depletion in between two signal enriched regions - revealing a region protected against the action of DNaseI by the binding of a factor. The presence of a clear footprint indicates that the underlying binding sites were aligned, supporting the fact that the binding sites are anchors for the nucleosome organization. This was further supported by the sharp peak of sequence conservation indicating, most likely reflecting the TF motif. Nonetheless, all other classes showed a wide and fuzzy chromatin accessibility pattern, as illustrated by ATF3 in Figure\ref{encode_peaks_array_measure}B, indicating miss-aligned binding sites.
Breast cancer type 1 susceptibility protein (BRCA1) was also identified using this method. The identified class (class 3, see Figure \ref{suppl_encode_peaks_em_brca1}) indeed showed well positioned nucleosomes. However, I decided not to consider this hit for two reasons : i) there was not footprint in the nucleosome depleted region indicating that the sites are not aligned and ii) the ENCODE consortium labeled this peak list as problematic (low reproducibility read coverage).
Finally, it should be noted that noisy MNase-seq patterns were attributed high nucleosome array density scores. Because the nucleosome signal is noisy, it varies a lot and got a good score. Such classes are found in the cloud of points just above the horizontal line on the right of the plot (mostly RNAPIII peak classes). Second, some CTCF binding sites displayed strongly positioned nucleosome, confirming previous reports \citep{kundaje_ubiquitous_2012, fu_insulator_2008}.
Thus even if all classes showed at least one nucleosome array, it seems that most of the TFs are not the force driving the array organization, with the notable exceptions of CTCF, RAD21, SMC3, YY1 and ZNF143.
\section{The case of CTCF, RAD21, SMC3, YY1 and ZNF143}
\label{encode_peaks_section_ctcf_rad21_smc3_yy1_znf143}
\begin{figure}
\begin{center}
\includegraphics[scale=0.25]{images/ch_encode_peaks/colocalization_ctcf.png}
\captionof{figure} {\textbf{ Colocalization with CTCF peaks in GM12878 cells : } \textbf{A} Proportion of peaks for different TFs having a CTCF peak within 10bp, 50bp and 100bp. The colours indicate different TFs. The CTCF peaklist used as reference to assess CTCF presence was CTCF.Sydh (in red), the two RAD21 peaklists are RAD21.Haib and RAD21.Sydh respectively (in blue), the SMC3 peaklist is SMC3.Sydh (in green), the YY1 peaklist is YY1.Haib (in orange) and the ZNF143 peaklist is ZNF143.Sydh (in violet). \textbf{B} Venn diagrams showing the proportion of peaks for each TF with i) an instance of its own motif, ii) a CTCF.Sydh peak within 100bp, iii) both or iv) neither of them. RAD21 and SMC3 are not represented as there is no PWM available to describe their sequence specificity. \textbf{C} ChIPPartitioning classification with shift and flip of MNase patterns +/- 1kb of YY1.Haib peaks using 10bp bins. YY1 peaks with (upper row) and without (lower row) a CTCF peak within 100bp. Two classes were used to account for "typical" and "non-typical" looking MNase patterns. DNaseI (blue), TSS density (violet) and sequence conservation (green) were overlaid according to MNase classification (taking into account both shift and flip). The number at the upper right corner of each plot indicate the overall class probability. The number of YY1 peaks is slightly smaller than in B) because peaks showing no MNase reads were not included in the classification analysis. Peaklists are named using the TF together with the laboratory which produced the data.}
\label{encode_peaks_colocalization_ctcf}
\end{center}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[scale=0.4]{images/ch_encode_peaks/CTCF_ndr_length_rad212.png}
\captionof{figure} {\textbf{Nucleosome free region at CTCF binding sites} \textbf{a} The length are represented as boxplots. The CTCF binding sites are divided into subgroups according to additional presence of SCM3, RAD21, YY1 or ZNF143. The number of binding sites in each subgroup is indicated in red above the boxplots. The presence of SMC3 only, RAD21 only and SMC3 and RAD21 together are indicated in violet, blue and orange respectively. \textbf{B} The proportion of peaks (in green), in each subgroup, having a TSS within a 1kb.}
\label{encode_peaks_ctcf_ndr}
\end{center}
\end{figure}
Two possible alternative hypotheses can explain the presence of these strong nucleosome arrays around these TFs binding sites. First, each TF has the ability to drive the formation of well spaced nucleosome arrays in their vicinity. Second, all the classes detected contains the same set of genomic regions.
Two obsevations strongly support the second hypothesis. First CTCF is known to interact with the cohesin complex \citep{stedman_cohesins_2008} - composed of SMC1, SMC3, RAD21 and either STAG1 or STAG2 \citep{losada_cohesin_2014} -, with YY1 \citep{donohoe_identification_2007} and with ZNF143 \citep{bailey_znf143_2015}. Second, the YY1 and ZNF143 showed \url{~50}\% and \url{~10}\% of direct binding respectively (Figure \ref{encode_peaks_gm12878_motif_prop}), leaving the possibility of an indirect binding mechanism, for instance through CTCF.
To further confirm this hypothesis, I measured the extent to which CTCF and these other TF peaks co-localized. To do so, each RAD21, SMC1, YY1 and ZNF143 peak was checked for the presence of a CTCF peak. The results, shown in Figure\ref{encode_peaks_colocalization_ctcf}A, support the four already known interaction between CTCF and the cohesin complex members RAD21 and SMC3, between CTCF and YY1 and to a lesser extent and between CTCF and ZNF143. Additionally, for YY1 and ZNF143, the presence of CTCF and of a canonical motif happen at separated peak subsets, as shown in Figure \ref{encode_peaks_colocalization_ctcf}B, suggesting two different binding strategies : i) through a direct recognition of the motif or ii) through another mechanism leading to a co-localization with CTCF - most likely through binding to CTCF.
Peaks are represented by the maximum read density position, as defined by ENCODE. Thus, the effective binding site of these TF can by anywhere in the peak. As a matter of fact, ZNF143 and YY1 may bind close but without direct interaction with CTCF. If SMC3, RAD21, YY1 or ZNF143 physically interact with CTCF and bind as a complex, one prediction would be that an extended nucleosome depleted region (NDR) should be observed to allow these complexes to bind.
In order to verify this hypothesis, I set up a classification method that assigns either a "nucleosome" or a "free" label to each position, in a given region based the MNase-seq signal. Assuming that the center of the CTCF peaks is in a NDR, these positions were labeled as 'free' and from there, the neighboring positions on the left and on the right were classified until finding the first position labeled 'nucleosome' (see Figure \ref{suppl_encode_peaks_ctcf_ndr}). The size spanned by the regions labeled as 'free' were then measured for each CTCF binding site. The NDR lengths were finally grouped according to the presence of RAD21, SCM3, YY1 or ZNF143 (Figure \ref{encode_peaks_ctcf_ndr}).
First, it seems that CTCF binding sites are distributed in two functional groups of regions based on the presence of other interactors : i) promoter distant regions with both RAD21 and SMC3 (the cohesin complex), ii) promoters together with YY1 and/or ZNF143. This segregation likely reflects different functions of CTCF : i) looping related functions with the cohesin complex and ii) a regulator of transcription with other partners. The fact that promoter enriched groups show an increased NDR, can be explained by an enhanced chromatin opening to accommodate for the presence of other TFs and of the RNAPII.
Interestingly the subgroups containing the cohesin complex (in orange in Figure \ref{encode_peaks_ctcf_ndr}A) show a NDR length that is function of the number of TFs present (cohesin < cohesin + YY1/ZNF143 < cohesin + YY1 + ZNF143). Because such these sites are away from promoters, it is really likely that the increased NDR size is only caused by the binding of a larger CTCF complex. Furthermore, their reduced NDR size measured is compatible with the classes of binding sites showing strong nucleosome arrays.
Finally, in order to reveal the nucleosome organization around each subset of peaks, I performed a ChIPPartitioning classification method using two classes, with one of them set to represent a flat signal (and to act as a "waste" class). The aim was to make a clear difference between "typical" and "non-typical" nucleosome organizations. For RAD12, SMC3, YY1 and ZNF143 the results showed that strong nucleosome arrays on both sides and a clear DNaseI footprint are only present when a CTCF is also present, as illustrated for YY1 in Figure \ref{encode_peaks_colocalization_ctcf}C.
Together, these results support the hypothesis that CTCF forms a complex with YY1 and/or ZNF143, additionally than with the cohesin complex. They also support the fact that only CTCF has the property of positioning nucleosome into regular arrays in its vicinity and that any other TF showing such a behaviour is likely binding with or near CTCF. As important, the apparent seggregation in terms of regions bounds by the different CTCF complexes is consistent with the hypothesis that the different functions of CTCF depends on its interactors \citep{ong_ctcf:_2014, ghirlando_ctcf:_2016}.
\section{CTCF and JunD interactomes}
% \begin{figure}
% \begin{center}
% \includegraphics[scale=0.4]{images/ch_encode_peaks/TF_associations.png}
% \captionof{figure}{\textbf{Possible interaction scenarios between TFs} \textbf{A} Indirect co-binding. The TFs dimerize and bind together on DNA. \textbf{B} Indirect co-binding. Both TF dimerize but only one binds the DNA, the other (the blue) is the tethering factor. \textbf{C} Independent co-binding. Both TF bind in close vicinity but without forming a complex. Both TFs may not be necessarily bound at the same time. \textbf{D} Interference. Both motifs partially or totally overlap each other. Whether only one TF or both can bind at the same time is unknown.}
% \label{encode_peaks_tf_association}
% \end{center}
% \end{figure}
\begin{figure}
\begin{center}
\includegraphics[scale=0.4]{images/ch_encode_peaks/ctcf_motif_association.png}
\captionof{figure}{\textbf{CTCF motif association} measured around the binding sites of different TFs. For a each TF, its binding sites, +/- 500bp, were searched for the presence of i) the TF motif and ii) CTCF motif. For each TF, a 2x2 contingency table was created with the number of peaks having i) both motifs, ii) the TF motif only, iii) CTCF motif only and iv) no motif. \textbf{A} Odd ratio (OR) of the exact Fisher test performed on each TF contingency table. The ORs are displayed with their 95\% confidence interval (CI). ORs > 1 - that is, with 1 not part of the 95\%CI - are labeled in green and indicate an association of both motifs more frequent than expected by chance. ORs < 1 are labeled in red and indicate a repulsion of both motifs more frequence than expected by chance. The CTCF dataset ORs are too high to be represented in this plot. \textbf{B} Density of CTCF motif occurrence at the absolute distance of different TF binding sites (peak centers) which also have their own motif present (at distance 0). The rows were standardized and aggregated using the Euclidean distance. \textbf{C} Same as in (B) but for TF binding sites that does not have their own motif. The absence of CTCF motif within the first 70bp around CTCF binding sites is explained by the peak processing (see section \ref{encode_peaks_methods_data}).}
\label{encode_peaks_ctcf_association}
\end{center}
\end{figure}
\begin{table}
\begin{center}
\begin{tabular}{ |c|c|c|l|l|c|c| }
\hline
\multicolumn{7}{|c|}{Curated associations} \\
\hline
TF$_{A}$ & TF$_{B}$ & Motif ass. & Type & Binder & Reported & Validated \\
\hline
CTCF & ATF2 & pos & indep.co-bind & & no & no \\
CTCF & EBF1 & pos & indep.co-bind & & yes & no \\
CTCF & MAZ & pos & indep.co-bind & & yes & no \\
CTCF & NFYb & pos & indep.co-bind & & yes & no \\
CTCF & NFkB & pos & indep.co-bind & & yes & no \\
CTCF & PAX5 & pos & indep.co-bind & & yes & no \\
CTCF & SP1 & pos & indep.co-bind & & yes & no \\
CTCF & BATF & neg & indir.co-bind & BATF & yes & no \\
CTCF & ELF1 & neg & indir.co-bind & ELF1 & yes & no \\
CTCF & IRF4 & neg & indir.co-bind & CTCF & yes & no \\
CTCF & MEF2a & neg & indir.co-bind & both & yes & no \\
CTCF & MEF2c & neg & indir.co-bind & both & yes & no \\
CTCF & NFATc & neg & indir.co-bind & CTCF & no & no \\
CTCF & NFYa & neg & indir.co-bind & CTCF & yes & no \\
CTCF & NRF1 & neg & indir.co-bind & CTCF & yes & no \\
CTCF & NRSF & neg & indir.co-bind & CTCF & yes & no \\
CTCF & PAX5 & neg & indir.co-bind & both & yes & no \\
CTCF & POU2f & neg & indir.co-bind & POU2f & yes & no \\
CTCF & RUNX3 & neg & indir.co-bind & both & no & no \\
CTCF & SRF & neg & indir.co-bind & CTCF & yes & no \\
CTCF & USF1 & neg & indir.co-bind & both & yes & no \\
CTCF & YY1 & neg & indir.co-bind & CTCF & yes & yes\\
CTCF & ZNF143 & neg & indir.co-bind & CTCF & yes & no \\
\hline
JunD & BHLHE40 & neg & indir.co-bind & BHLHE40 & yes & no \\
JunD & CTCF & neg & indir.co-bind & CTCF & yes & no \\
JunD & EBF1 & neg & indir.co-bind & EBF1 & yes & no \\
JunD & EGR1 & neg & indir.co-bind & EGR1 & yes & yes\\
JunD & ELK1 & neg & unknown & & no & no \\
JunD & IRF4 & neg & indir.co-bind & JunD & yes & yes\\
JunD & MAZ & neg & indir.co-bind & MAZ & no & no \\
JunD & PAX5 & neg & indir.co-bind & PAX5 & yes & no \\
JunD & SP1 & neg & indir.co-bind & SP1 & yes & yes\\
JunD & USF2 & neg & indir.co-bind & USF2 & yes & no \\
JunD & YY1 & neg & indir.co-bind & & yes & yes\\
JunD & ZBTB33 & neg & unknown & & yes & no \\
\hline
\end{tabular}
\captionof{table} { \textbf{Identified associations : } Details of all the TF associations identified, as well as the possible molecular mechanisms explaining them. The columns 'TF${_A}$' and 'TF${_B}$' refer to the TF involved in the association, 'Motif.ass.' to whether both motif are associated together ('positive') or repel each other ('negative'), as measured by the Fisher test, 'Type' to the proposed interaction mechanism between both TFs, 'Binder' to the TF binding DNA in case of an indirect co-binding, the value 'both' means that both tethering complexes may exist, 'Reported' to whether this interaction has already been reported in one of the following study \cite{wang_sequence_2012, neph_expansive_2012, consortium_integrated_2012, guo_high_2012} and 'Validated' to whether this physical association is experimentally validated and reported in BioGRID v.3.4.145 \citep{chatr-aryamontri_biogrid_2017}.}
\label{encode_peaks_association_table}
\end{center}
\end{table}
The study of co-binding with CTCF showed that it was possible to detect global associations. I already detected that the cohesin complex members SMC3 and RAD21 form a complex with CTCF, as expected from literature \citep{ghirlando_ctcf:_2016}. Additionally, I detected that YY1 and ZNF143 are also frequently associated with CTCF, which has also been reported \citep{ong_ctcf:_2014}.
Thus, I decided to push forward in this direction. To this end, I set up a method based on motif co-occurrence to i) relieve the necessity of observing similar chromatin architectures, as in the previous section and ii) be able to functionally characterize the detected interactions.
As previously discussed (see section \ref{intro_tf_cobinding}), several types of functional interactions between two TFs $A$ and $B$ exist : direct co-binding, indirect co-binding, independent co-binding and interference. Because the binding mechanisms are different from each other, different observations are expected. In the case of direct co-binding, both TF motifs are expected to appear in close vicinity, more often than by chance. Moreover, a spatial constrain (both spacing and orientation) reflecting the complex structure is also expected to occur. In the case of indirect co-binding, if TF$_{A}$ is the factor binding its motif and TF$_{B}$ is the tethering factor, both motifs are expected to repel (avoid) each other at TF$_{A}$ binding sites. In the case of independent co-binding, both motif$_{A}$ and motif$_{B}$ are expected to be enriched at both TF$_{A}$ and TF$_{B}$ binding sites. However, no spatial constrain is expected between the motifs. Finally, in the case of interference, both motifs are expected to overlap. However, this may not be difficult to detect.
% Several types of functional associations can occur between a TF$_{A}$ and a TF$_{B}$. Because each one of them brings different expected patterns in the data, it should be possible to detect and disentangle them. First two TFs can dimerize and bind to DNA using both DNA binding domains (DBDs) [REFERENCE NEEDED] (Figure \ref{encode_peaks_tf_association}A). I will refer to this case as \textbf{direct co-binding}. If this happens, both TF motifs are expected to appear in close vicinity, more often than by chance. Moreover, a spatial constrain (both spacing and orientation) reflecting the complex structure is also expected to occur. Second, two TFs can dimerize and bind to DNA using only one of the DBDs. This will result in having one of the TF bound to DNA while the other one will tether DNA through its interaction with the other TF (Figure \ref{encode_peaks_tf_association}B). This case will be referred to as \textbf{indirect co-binding}. In such a case, if TF$_{A}$ is the factor binding its motif and TF$_{B}$ is the tethering factor, both motifs are expected to repel (avoid) each other at TF$_{A}$ binding sites. Third, two TFs can both bind DNA using their own DBDs, in close vicinity but without any physical interaction (Figure \ref{encode_peaks_tf_association}C). In such as case, both motif$_{A}$ and motif$_{B}$ are expected to be enriched at both TF$_{A}$ and TF$_{B}$ binding sites. However, no spatial constrain is expected between the motifs. This case will be refered to as \textbf{independent co-binding}. This can be caused by a temporal relationship between both TFs where both TFs can bind to a given region asynchronously. For instance, a first TF is recruited to its binding site and ensures - somehow - a proper chromatin environment for another TF, such as illustrated during macrophage and B cells progenitors commitment \citep{heinz_simple_2010}. Finally, in case of a partial or total motif overlap, both TFs may be observed to be bound together (Figure \ref{encode_peaks_tf_association}D). In such a case, different phenomenons may explain this observation. A first possible explanation would be that two TFs compete to bind to the same region. Observing both TFs bound together could be due to an overlap of data from different cells in which only one TF is bound at the time. A second possible explanation would be that, for some reason, only one TF is bound, never the other. However, I prefer to be cautious regarding the causal mechanisms and this case will be referred to as an \textbf{interference}.
In order to collect more evidences about functional connections between TFs, I developed a simple analysis pipeline able to detect the expected patterns of motifs described above. Briefly, given a set of binding sites for a TF$_{A}$, it is possible to construct a contingency matrix containing the number of binding site with i) motif$_{A}$ and motif$_{B}$, ii) motif$_{A}$ only, iii) motif$_{B}$ only or iv) no motif and assess whether both motifs are associated or avoid each other using an exact Fisher test. Then, for pairs of motifs showing an association, displaying the spatial distribution of the motif may help to discriminate whether or not there is a spacing constrain or a motif overlap.
I investigated the association of 47 TFs for which 53 datasetes were available in GM12878 cells with CTCF or JunD. CTCF was chosen because i) most of its binding sites have a short nucleosome depleted region and show only a peak of sequence conservation at the binding site leaving a restricted space for other motifs to co-occur (Figure \ref{suppl_encode_peaks_em_ctcf}) and ii) I already collected several observation regarding CTCF. JunD was chosen as a complementary example to CTCF in the sense that i) contrarily to CTCF, it is only a trancriptional regulator, ii) it is expected to bind to regulatory regions mostly thus to open chromatin regions where other motifs are expected to co-occur , iii) \url{~50}\% of the peaks have a motif versus \url{~80}\% to \url{~90}\% for CTCF peaklists (Figure \ref{encode_peaks_gm12878_motif_prop}).
% motif co occurence
Motif co-occurrence analysis suggested several interactions. Regarding CTCF motif (Figure \ref{encode_peaks_ctcf_association}A), 8 positive motif association (ATF2, EBF1, MAZ, NFYb, NFkB, PAX5, SP1, YY1) and 16 negative motif associations (BATF, ELF1, IRF4, MEF2a, MEF2c, NFATc, NFYa, NRF1, NRSF/REST, PAX5, POU2F2/OCT2, RUNX3, SRF, USF1, YY1 and ZNF143) with other motifs were found. Regarding JunD (Figure \ref{suppl_encode_peaks_jund_association}A), positive motif association with 2 others TF motifs (BATF, cFos) and 12 negative associations with others TF motifs (ATF2, BHLHE40, CTCF, EBF1, EGR1, ELK1, IRF4, MAZ, PAX5, SP1, USF2, YY1 and ZBTB33) were found. cFos and one of the YY1-Sydh peaklists displayed evidences of poor quality (not shown and annotated as such by the ENCODE Consortium). Additionally, ATF2 is an AP1 member which possess a 2bp spacer (TGANNTCA) while JunD is a 1bp motif space (TGANTCA). Thus the strong negative interaction may simply be due to the fact that both motifs are simply mutually exclusive. In consequence, the positive associations CTCF-YY1 and JunD-cFos and the negative association JunD-ATF2 should be ignored. Additionally, JunD and BATF motifs are the same as both these TFs belong to the AP1 family. In consequence, it is impossible to say whether BATF peaks harbour a JunD or a BATF site. Thus this association should be ignored as well, leaving no positive association left with JunD motif.
% densities
The analysis of CTCF and JunD motif occurrence densities (Figures \ref{encode_peaks_ctcf_association}B and C and Figure \ref{suppl_encode_peaks_jund_association}B and C) revealed further interesting details regarding possible association mechanisms. First, positive associations showed CTCF density patterns mostly compatible with the direct co-binding and the independent co-binding scenarios (see Figure \ref{encode_peaks_ctcf_association}B). However, making a clear distinction between both is often impossible. For instance, both EBF1 peaklists show a decrease in CTCF motif density \url{~10}bp after the peak followed by an increase which could represent the spacer between CTCF and EBF1. However this is followed by a rather wide CTCF motif presence, mostly suggesting an independent co-binding scenario. An interesting candidate for a direct co-binding with CTCF is RXRa (Figure \ref{encode_peaks_ctcf_association}B). Even though the motif association was not significant, a focused co-localization of both motif appears. Second, negative associations showed CTCF and JunD density patterns compatible with the indirect co-binding scenario where the TFs would tether through CTFC or JunD, i.e. the CTCF or JunD motifs do not show a spacing constrain with the binding sites but are rather spread over ~100bp around binding sites without their own motif (Figure \ref{encode_peaks_ctcf_association}C and Figure \ref{suppl_encode_peaks_jund_association}C). Interestingly, CTCF motif around YY1 and ZNF143 binding sites lacking their own motifs (see bottom of Figure \ref{encode_peaks_ctcf_association}C) showed really focused densities, indicating that for some reason, the CTCF motif is well localized. Even if unexpected, this observation is not incompatible with the indirect co-binding scenario and further supports the results from section \ref{encode_peaks_section_ctcf_rad21_smc3_yy1_znf143}.
% results
To summarize, the motif association statistics allowed me to identify 35 associations of TFs with either CTCF or JunD (Table \ref{encode_peaks_association_table}). The strongest negative interactions for CTCF were ZNF143 and YY1, supporting the results found in the previous sections. The analysis of CTCF and JunD motif spatial distributions around peaks and a closer examination of the contingency matrices allowed to suggest details about the interacting mechanisms, including which TF binds DNA. The only two exceptions were JunD-ELK1 and JunD-ZBTB33 for which the motif occurrence densities were uninformative. Finally, out of these 35 associations, 5 were supported by experimental evidences and 5 were not already reported in previous studies or databases \citep{wang_sequence_2012, neph_expansive_2012, consortium_integrated_2012, guo_high_2012, chatr-aryamontri_biogrid_2017}.
\section{EBF1 binds nucleosomes}
\begin{figure}
\begin{center}
\includegraphics[scale=0.4]{images/ch_encode_peaks/ebf1_haib_1.png}
\captionof{figure}{\textbf{EBF1 binding sites} stand on the edge of a nucleosome. \textbf{A} Nucleosome dyad distributions around the EBF1 binding sites (from the Haib dataset). The dyad distributions have been measured from two independent datasets : i) MNase-seq data released by the ENCODE Consortium (in red) and by Gaffney et al. (in blue) \citep{gaffney_controls_2012}. \textbf{B} Dinucleotide frequencies around the nucleosome dyads from the Gaffney dataset that have an EBF1 binding site within 100bp. \textbf{C} Motif frequency around the nucleosome dyads from the Gaffney dataset that have an EBF1 binding site within 100bp. The abrupt decrease of EBF1 motif frequency at +/- 100bp reflects the nucleosome selection process.}
\label{encode_peaks_ebf1}
\end{center}
\end{figure}
% As presented above (section \ref{encode_peaks_chippartitioning}), EBF1 binding sites does not seem to present a NDR seem to be covered by a nucleosome array. This observation suggest that EBF1 can bind to nucleosomal DNA. However, because ChIPPartitioning realigns the data, one possible explanation is that it failed to properly aligned the data and that the results do not reflect reality.
% In order to clarify this, I looked at the MNase digestion profile - more specifically, at the distribution of nucleosome dyads - at EBF1 binding sites.
EBF1 is a crucial factor for B cell development. It is necessary in the early steps, for a proper lineage commitment as well as later on during the entire B cell development \citep{boller_defining_2018}. Since many years, EBF1 has been though to be able to "pioneer early changes in the target gene chromatin necessary for transcriptional activation" and proper B cell development \citep{hagman_early_2005}. Experimental evidences supported that EBF1 could be able to bind compacted naive chromatin (without noticeable mark/modification), leading to a local chromatin opening, H3K4me2 deposition, DNA demethylation and gene activation \citep{maier_early_2004,boller_pioneering_2016}. If such features makes a lot of sense during lineage commitment, the some underlying mechanisms remained mysterious, especially how EBF1 primarily binds to closed chromatin. With regard to this, the results of section \ref{encode_peaks_chippartitioning}, suggesting that EBF1 binding sites may be covered by nucleosome arrays, rose my attention. In order to collect evidences that may shed light on this, I conducted a deeper exploration of the EBF1 binding sites.
First, the distribution of nucleosome dyads - from two independent experiments - around EBF1 binding sites revealed a landscape that is compatible with a nucleosome positioned ~70bp apart from the binding sites (Figures \ref{encode_peaks_ebf1}A). This configuration would position the EBF1 binding site at the edge of the nucleosome. The 10bp periodicity visible suggested that other positioning of the EBF1 binding site exist but always at integer numbers of helix turn, such that the EBF1 binding site would always be positioned the same compared to the nucleosome surface. Surprisingly, the distribution of EBF1 motif remained the same, whether the nucleosome was containing an EBF1 bound site or not (Figure \ref{suppl_encode_peaks_ebf1_nucl}).
Second, to support the fact that these EBF1 binding sites are indeed functional sites, I compared some of their chromatin features with the entire nucleosome pool. As expected, the presence of EBF1 binding sites was correlated with an increased accessibility (Figure \ref{suppl_encode_peaks_ebf1_chrom}A), even though the opening was spread rather than narrow. Furthermore, this increased opening was concomitant with an enriched H3K4me2 deposition (Figure \ref{suppl_encode_peaks_ebf1_chrom}B), in line with the literature. Last, it was also possible to highlight a higher sequence conservation at the nucleosome edges when they had an EBF1 binding site (Figure \ref{suppl_encode_peaks_ebf1_chrom}C), suggesting a functional difference between both nucleosome pools.
% Finally, Trifonov's motif appeared along the nucleosome, EBF1 motif was rather present at the nucleosome edges. A closer look both motifs (see Figure \ref{suppl_encode_peaks_ebf1_logo} for EBF1 logo) revealed that half of Trifonov's motif (RRRRR or YYYYY) matches one half of the EBF1 motif ({A/C}CCC{A/C} and {A/G}GGG{A/G}) at the cost of 2 or 0 missmatches.
% A further inspection of the dinucleotide base composition in the nucleosome bearing an EBF1 binding site revealed a periodic pattern that is compatible with a rotationally positioned nucleosome (Figure \ref{encode_peaks_ebf1}B), as expected from literature in \citep{ioshikhes_variety_2011,gaffney_controls_2012}.
% Finally, the occurrence of the nucleosome positioning motif - YRRRRRYYYYYR where Y is C/T and R is A/G - identified by Trifonov \citep{trifonov_cracking_2011} in these nucleosomes is antiphased with the occurrence of the EBF1 motif. If Trifonov's motif appeared along the nucleosome, EBF1 motif was rather present at the nucleosome edges. A closer look both motifs (see Figure \ref{suppl_encode_peaks_ebf1_logo} for EBF1 logo) revealed that half of Trifonov's motif (RRRRR or YYYYY) matches one half of the EBF1 motif ({A/C}CCC{A/C} and {A/G}GGG{A/G}) at the cost of 2 or 0 missmatches.
% These results suggest that EBF1 can bind nucleosomal DNA. In most cases, it seems that the EBF1 binding site is located at its edge. Incidentally, the high similarity between Trifonov and EBF1 motifs suggest that EBF1 binding sequence may have a nucleosome positioning property. Interestingly, EBF1 motif, as identified by JASPAR \ref{suppl_encode_peaks_ebf1_logo}, is 14bp wide. Consequently, it is conceivable that, wherever this motif is located along the nucleosome, at least part of remains facing outward and is thus "readable".
% Based on this observation, I hypothesize that EBF1 may be a pioneering factor or that it influence nucleosomes positioning through its binding. In the first case, EBF1 would be able to target yet inaccessible loci upon the right cellular conditions. In the second case, EBF1 would rather serve to both open and close targeted sites by leading - directly or indirectly - to the positing of a nucleosome right beside of it binding site. Both scenarios make sense. Indeed, EBF1 is known to be crucial for B-cells commitment. In such developmental processes, specific enhancers are made accessible and active at different, in a coordinated manner, during the developmental process. (AND WHAT ABOUT CLOSING???)
Third, a further inspection of the sequence composition of the nucleosomes bearing an EBF1 binding site revealed i) a periodic occurrence of antiphased WW (W=A/T) and SS (S=C/G) dinucleotides and ii) a periodic occurrence of the YRRRRRYYYYYR (R=A/G, Y=C/T) nucleosome positioning motif described by Trifonov \citep{trifonov_cracking_2011}. Together, these observations suggest that EBF1 binding sites are located on the edge of a rotationally positioned nucleosome \citep{ioshikhes_variety_2011,trifonov_cracking_2011,gaffney_controls_2012}. Interestingly, Trifonov's motif appeared in counter phased with EBF1 motif. A closer look both motifs (see Figure \ref{suppl_encode_peaks_ebf1_logo} for EBF1 logo) revealed that half of Trifonov's motif (RRRRR or YYYYY) matches one half of the EBF1 motif (\{A/C\}CCC\{A/C\} or \{A/G\}GGG\{A/G\}) at the cost of 2 or 0 missmatches.
These results suggest that EBF1 can indeed bind nucleosomal DNA. The motif bound were predominantly located at the edges of the nucleosomes. Yet, this was also the fact for nucleosome that do are not bind by EBF1. This suggests that nucleosomes are already in this position before EBF1 binding, which may be the case given the presence of favorable nucleosome positioning sequences.
The reason why the EBF1 motif is already on the edges of nucleosome, even without EBF1 binding, remains unknown. One explanation could be that such sites have a double function. The first function would be to recruit EBF1 to open up the region. The second, would be that EBF1 binding sequence (together with other positioning sequences) can act as a barrier - a potential well - avoiding the nucleosome to roll over in this direction. Such a system would have the advantage of promoting a suited chromatin structure in developmentally important regions. Constraining nucleosome movement would could serve to hide regulatory elements. At the same time, these regions would remain responsive to differentiation signals through the exposition of EBF1 sites on the periphery of nucleosomes.
\section{Discussion}
Overall, the results presented in this section overall complement and support the observations made by other research groups worldwide.
% nucleosome arrays and NDR
The systematic study of the nucleosome landscape in the viccinity of TFs binding sites highlighted that nucleosome arrays are always present on the flanking regions. However, all the TFs, with the exception of CTCF, do not act as a barrier and thus are not major determinant of the chromatin architecture. Instead, an alternative mechanism, probably involving chromatin remodelers, is likely to be responsible. Furthermore, all TFs were found to bind in NDRs with the noticeable exception of EBF1.
% EBF1
Surprisingly, a large fraction of EBF1 binding sites was found to be occupied by what seemed to be a rotationally positioned nucleosome which edges are bound by EBF1. Furthermore, it appeared that EBF1 binding motif resembles a nucleosome positioning sequence and could be involved in the positioning of the nucleosome. However, at least two alternative scenarios could explain the presence of an EBF1 binding site at the entry of a nucleosome. First, EBF1 genuinely binds to such "pre-positioned" nucleosomes, in which case I am observing EBF1 true binding mechanism. Alternatively, EBF1 binding - to either nucleosomal or naked DNA - results in the positioning of a nucleosome right beside. To my opinion, the previous results suggesting a pioneer function for EBF1 \citep{boller_pioneering_2016} makes the second hypothesis more likely. EBF1 would directly engage a nucleosome and somehow trigger its displacement such that EBF1 binding site will eventually reside at the nucleosome edge. Testing this hypothesis could be performed by assaying in vitro binding of EBF1 to assembled nucleosome arrays.
% CTCF
The study of CTCF binding sites revealed that they can be grouped in i) promoter distal and ii) promoter proximal binding sites. In each of the subset, CTCF was observed to bind with a different group of interactors, suggesting different functions. At promoter distal binding sites CTCF is associated the cohesin complex while at promoter proximal regions, CTCF seems to be associated with ZNF143 and YY1.
% interaction
Finally the study of the motif co-localization, even if simple, seemed quite powerful as it allowed to identify 35 interactions with CTCF or junD. Out of these, 25 have already been proposed but without experimental support, 5 have been proposed and experimentally validated and 5 were new. These 5 new interactions are proposed to be indirect co-binding event and thus imply a physical interaction that can be tested.
\section{Methods}
\subsection{Data and data processing}
\label{encode_peaks_methods_data}
All the GM12878 ENCODE data used were mapped against hg19 genome and can be found on the MGA repository \citep{dreos_mga_2018}.
Peaks called by the ENCODE Consortium using their uniform processing pipeline \cite{gerstein_architecture_2012} were used. These peaks can be found at \url{https://ccg.epfl.ch/mga/hg19/encode/Uniform-TFBS/Uniform-TFBS.html}. Assuming that a TF binds to DNA through motif recognition, the peak center should be localized on the motif center. Thus the center of each peak was moved to the closest motif instance within 60bp. To do so, each TF was associated to a log-odd PWM contained either in JASPAR Core vertebrate 2014 \cite{mathelier_jaspar_2014}, HOCOMOCO v10 \cite{kulakovskiy_hocomoco:_2016} or Jolma \cite{jolma_dna-binding_2013} collection. Using the corresponding log-odd PWM, peak sequences were scanned to find motif instance with a score corresponding to a pvalue higher or equal to 1e-4. If such a motif instance was found, the peak position was shifted to the center of the motif instance and mapped to the corresponding strand. Otherwise, the peak position remained unchanged without strand information.
In GM12878 cells, nucleosome occupancy was assessed using MNase-seq data released by the ENCODE Consortium (GSE35586). These data can be found at \url{https://ccg.epfl.ch/mga/hg19/encode/GSE35586/GSE35586.html}. To increase sequencing depth, all replicates available for this cell line were pooled together, resulting in ~789 mio reads, and used as a single dataset. The resulting dataset is available and has the description "GM12878|Nucleosome|all (SLOW!)". Because each read was represented as a single point coordinate corresponding to their 5' edges, these coordinates were centered by 70bp in order to indicate the nucleosome dyads. Finally, another dataset was used for one analysis only. These data were released by Gaffney and colleagues \cite{gaffney_controls_2012} and can be found at \url{https://ccg.epfl.ch/mga/hg19/gaffney12/gaffney12.html} and were not centered as the coordinates already represent the center of paired-end sequenced fragments. The dataset is labeled "All Paired-end samples - 147bp fragments".
Chromatin accessibility was assessed using DNaseI-seq data released by the ENCODE Consortium \cite{boyle_high-resolution_2008} (GSE32970). To increase sequencing depth, all replicates available for GM12878 cells were pooled together, resulting in ~144 mio reads, and used as a single dataset. The individual replicates can found at \url{https://ccg.epfl.ch/mga/hg19/encode/Duke-DNaseI-HS/Duke-DNaseI-HS.html}. The reads were represented as a single point coordinate corresponding the their 5' edges but were not centered as this correspond to the exact DNaseI nick location.
The EPDnew release 003 was used as TSS annotation \cite{dreos_eukaryotic_2017} and genome sequence conservation was assessed using Phastcons \cite{siepel_evolutionarily_2005}. Both datasets can be found at \url{https://ccg.epfl.ch/mga/hg19/epd/epd.html} and \url{https://ccg.epfl.ch/mga/hg19/phastcons/phastcons.html} respectively.
\subsection{Classification of MNase patterns}
\label{encode_peaks_em_mnase}
For each TF peaklist MNase, DNase, sequence conservation and TSS density around TF binding site were assessed independently by counting the number of read mapped from -999bp to +1000bp around each peak, using 10bp bins. For each TF, 4 matrices having one row per binding site (peak) and 199 columns were created using ChIP-extract program \citep{ambrosini_chip-seq_2016}.
Probabilistic pattern classification was achieved using the ChIPPartitioning (see section \ref{encode_peaks_chippartitioning}). The algorithm was implemented as described in the supplemental materials of \cite{nair_probabilistic_2014}.
Two different procedures were used to classified MNase patterns. Both were run for 10 iterations allowing flip and a value of shift of 15 bins.
The first procedure aimed to discover 4 different pattern classes, allowing flip and a shift of 15 bins. The procedure was initialized with 4 classes. The class patterns were initialized by assigning each peak a random probability to belong to each of the 4 classes. The patterns were then computed as the weighted average of the signal given the peak class probabilities as weights. Then the prior class probabilities were initialized as $p_{k,s,f} = 1/K*S*2$ where $k$ is the class index, $s$ is the shift value in bins (here 15), $f$ is an indicative variable for the flip state (1 for "normal", 2 for "reverse"), $K$ is the number of classes (here 4) and $S$ is the maximum allowed shift in bins. The classification was run for 10 iterations. At the end, it returned a matrix of dimensions $NxKxSx2$ containing the probabilities for each of the $N$ region to belong to each of the $K$ class, for each possible shift state $S$ and for both flip states ("normal" or "reversed").
The second procedure aimed to discriminate between 2 classes : i) the binding sites describing the "average" binding sites as opposed to ii) those differing from this. To do so, class patterns were initialized to i) the aggregation over all peaks (the average pattern) and ii) a flat pattern being the mean number of counts of the input matrix. Flip and 15 bins of shift were allowed. The prior class probabilities were initialized as $p_{k,s,f} = \mathcal{N}(s,floor(S/2)+1,1)$ where the second and third parameters are the mean and the standard deviation, giving a higher prior probability to states with shift equal to 0bp.
\subsection{Quantifying nucleosome array intensity from classification results}
Nucleosome array intensity was quantified using a method developed by Zhang and colleagues \citep{zhang_canonical_2014}. Briefly, nucleosome signal is represented in 2 dimensions as a set of signal intensities for a given set of positions. Data are structured as vector $Y$ containing the nucleosome occupancy signal (for instance an EM classification class profile) for $n$ bins (for EM class profiles, 199 bins of 10bp). First, the 1$^{st}$ order derivative $D_{1}$ of $Y$ is computed. Then the 1$^{st}$ order derivative $D_{2}$ of the absolute value of $D_{1}$ is computed. Local maxima in $D_{2}$ are searched using a windows of 15 bins (corresponding to 150bp, a nucleosome width). Maxima can be interpreted as strong drop or enrichment of signal, corresponding to a pattern expected from a well positioned nucleosome array. Finally, all $D_{2}$ maxima are joint by a line and the nucleosome array intensity at each given position is the height of the line at this position. The nucleosome array density for the first and last position of $Y$ were set to 0. The average nucleosome array intensity of $Y$ was used as the nucleosome array value of the input data.
The classification of a matrix of counts having $N$ rows (regions), with $K$ classes, allowing a maximum of $S$ shift states and two flip states ("normal" and "reverse") outputs a probability matrix $P$ of dimension [$N$, $K$, $S$, 2] containing the probability for each region to belong to each class, given a shift state and a flip state. This matrix can be used to compute a vector $D_{k}$ of length $S$ containing the probability density of the shift states for a class $k$ using :
\begin{equation}
\begin{aligned}
D_{k,s} & = \frac {\sum_{i=1}^{N} (P_{i,k,s,1} + P_{i,k,s',2})} {\sum_{i=1}^{N} \sum_{s=1}^{S} (P_{i,k,s,1} + P_{i,k,s',2})} \\
\text{with } \\
s' & = S - s + 1
\end{aligned}
\label{encode_peaks_equation_shift_density1}
\end{equation}
\citep{ambrosini_chip-seq_2016}
where $s'$ represents the index of the reverse orientation and with the constrain that all the elements of $P$ sum to 1. Given the shift probability density vector $D_{k}$ of one class, computing its standard deviation was done using :
\begin{equation}
\begin{aligned}
\sigma_{k} & = \sqrt { \sum_{i=1}^{S} (X_{i}^{2} \cdot D_{k,i}) - \mu_{k}^{2} }\\
\text{with } \\
\mu_{k} & = \sum_{i=1}^{S} (X_{i} \cdot D_{k,i})
\end{aligned}
\label{encode_peaks_equation_shift_density2}
\end{equation}
where $X$ is a vector containing the position changes in bp for every shift state, i.g. for a maximum number of shift states of 15 ($S=15$) with bins of 10bp, X would contain [-70, -60, ..., 0, ..., +60, +70].
\subsection{Peak colocalization}
To measure the extent of colocalization between CTCF, YY1, ZNF143, SMC3 and RAD21, the occurrence of YY1, ZNF143, SMC3 and RAD21 peaks around CTCF peaks was computed using ChIP-extract \citep{ambrosini_chip-seq_2016}. The CTCF peak list used as reference was "wgEncodeAwgTfbsSydhGm12878Ctcfsc15914sc20UniPk" because it was the CTCF peak list containing i) the most CTCF peaks and ii) the highest proportion of peaks with a motif. Chip-extract was run separately for YY1, ZNF143, SMC3 and RAD21 using the following parameters : from -99, to 100, window size 1. Then, the propotion of CTCF peak having at least one other peak within +/-10 bp, 50bp or 100bp was computed.
\subsection{NDR detection}
Let us consider a matrix of MNase-seq counts $R$ of dimensions $NxL$ containing N vectors of read counts $r_{1}, r_{2}, ..., r_{n}$ of length $L$. Because MNase-seq reads are a direct indication of the nucleosome occupancy, detecting NDRs is about finding low signal regions, flanked by two high signal regions.
The signal in each vector $X_i$ (region) is assumed to have been sampled from a 2 class mixture of high (nucleosome) and low (nucleosome-free) signal, using a Poisson distribution. Both classes are expected to occur with a given probability $p^{nucl}_{i}$ and $p^{free}_{i}$. The rows are considered individually to lessen technical biases such as region specific sequencing depth.
The class probabilities and their mean parameters are estimated using an EM algorithm. First, during the E-step, for each position inside a region, the posterior probability of the nucleosome given the data is computed using :
\begin{equation}
\begin{aligned}
P(nucl | r_{i,l}) = \frac{p_{i}^{nucl} \times Poisson(r_{i,l}, \lambda=m_{i}^{nucl})}
{p_{i}^{nucl} \times Poisson(r_{i,l}, \lambda=m_{i}^{nucl}) +
p_{i}^{free} \times Poisson(r_{i,l}, \lambda=m_{i}^{free})}
\end{aligned}
\end{equation}
where $r_{i,l}$ is the number of reads at position $l$ in the i-th row of $R$, $m_{i}^{nucl}$ and $m_{i}^{free}$ are the mean parameters of the nucleosome and nucleosome-free classes respectively. Obviously, the nucleosome-free class posterior probability is
\begin{equation}
\begin{aligned}
P(free | r_{i,l}) = 1 - P(nucl | r_{i,l})
\end{aligned}
\end{equation}
Then, during the M-step, the class mean parameters are updated using
\begin{equation}
\begin{aligned}
m_{i}^{nucl} = & \sum_{l=1}^{L} r_{i,l} \times P(nucl | r_{i,l}) \\
m_{i}^{free} = & \sum_{l=1}^{L} r_{i,l} \times P(free | r_{i,l})
\end{aligned}
\end{equation}
and the class probabilities :
\begin{equation}
\begin{aligned}
p_{i}^{nucl} = & \frac{1} {L} \times \sum_{l=1}^{L} P(nucl | r_{i,l}) \\
p_{i}^{free} = & 1 - p_{i}^{nucl}
\end{aligned}
\end{equation}
The EM optimization of the parameter estimates was repeated for 10 iterations. At the end of the parameter estimation process, each of the $L$ positions in a region $R_{i}$ were assigned two posterior probabilities $P(nucl | r_{i,l})$ and $P(free | r_{i,l})$ to belong to each class. In all cases, the nucleosome class was the class having the highest mean parameter and the nucleosome free class the class with the smallest ($m_{i}^{nucl} > m_{i}^{free}$).
The binding sites - located in the center of the regions, at position $s = L/2$ - were assumed to be within the NDR. From that point, the NDR was extended using the following procedure :
\SetKwProg{Fn}{}{\{}{}\SetKwFunction{Function}{float NDRextend}%
\begin{algorithm}[H]
\label{encode_peaks_algo_ndr_extend}
\Fn{\Function{}}
{ \KwData{The posterior probabilities obtained for each position of $r_{i}$.}
\KwResult{the left and right coordinates of the NDR}
\tcp{NDR only covers the central location}
$left = s$ \;
$right = s$ \;
\While{$left \ne 2$ and $right \ne L-1$}
{ $p.free.l = P(free|r_{i,left})$ \;
$p.free.r = P(free|r_{i,right})$ \;
$p.nucl.l = P(nucl|r_{i,left})$ \;
$p.nucl.r = P(nucl|r_{i,right})$ \;
\tcp{bidirectional extension}
\If{$prob.free.l > p.nucl.l$ and $p.prob.free.r > p.nucl.r$}
{ $left \minuseq 1$ \;
$right \pluseq 1$ \;
}
\tcp{extension to left}
\ElseIf{$prob.free.l > p.nucl.l$}
{ $left \minuseq 1$ \; }
\tcp{extension to right}
\ElseIf{$p.prob.free.r > p.nucl.r$}
{ $right \pluseq 1$ \; }
\tcp{no more extension possible}
\Else
{ break \; }
}
\Return{$left$, $right$}
}
\caption{Searches the coordinates of the NDR using the posterior nucleosome and nucleosome free class probabilities, for a region $R_i$, from its central position.}
\end{algorithm}
The nucleosome occupancy around CTCF binding sites was measured using ChIP-extract with "wgEncodeAwgTfbsSydhGm12878Ctcfsc15914sc20UniPk" peak list as reference - because it was the CTCF peak list with the most peaks and with the highest proportion of peaks with a CTCF motif -, the ENCODE MNase-seq data described in section \ref{encode_peaks_methods_data} as targets and the following parameters : from -999bp, to 1000bp and window size 10bp.
This matrix was subjected to a ChIPPartitioning partitioning, as described in section \ref{encode_peaks_em_mnase}, to find 4 nucleosome architectures, using shifting and flipping. The resulting posterior probabilities were used to re-orient the data. If the major shift state - that is the shift state with the highest overall probability - for a given region was the "reverse" state, then the row was reversed. The re-oriented matrix was then subjected to the NDR detection. The re-orientation was done for aesthetic purposes only. Because the NDR detection was performed starting from the center position in each region - and given that reverting a vector did not change its central position - this operation had no influence on the NDR detection.
\subsection{CTCF and JunD interactors}
% Enumerating motif instances genome-wide
To enumerate instance of CTCF and JunD motif, the hg19 genome assembly was scanned using CTCF (MA0139.1 from JASPAR Core Vertebrate 2014 \citep{mathelier_jaspar_2014}) and JunD (JUND\_HUMAN.H10MO.A from HOCOMOCOv10 \citep{kulakovskiy_hocomoco:_2016}) matrices to produce lists of potential binding sites. A limit score threshold was set as the score corresponding to a pvalue of 1e-5 for each matrix, respectively. This was done using matrix\_scan program from PWMScan \citep{ambrosini_pwmscan:_2018}. Eventually, any motif instance falling inside a region classified as being a repeated element and blacklisted by the ENCODE Consortium was filtered out using count\_filter program from the ChIP-seq tools \citep{ambrosini_chip-seq_2016-1}.
% Measuring motif instance occurence near peaks
Then, for each TF peak list independently, the number of i) the TF and ii) CTCF/JunD instances +/- 1kb of each peak was measured, in bins of 1bp, using ChIP-extract program from the ChIP-seq tools \citep{ambrosini_chip-seq_2016-1}. The association were measured as follows : using the ChIP-extract results for the given peak list versus i) the TF and ii) CTCF/Jund motif instances, the number of peaks having i) at least one TF and one CTCF/Jund motif instances, ii) only TF motif instances, iii) only CTCF/JunD motif instances or iv) no motif instance. These numbers were used to build a contingency table and a two-sided Fisher exact test for association was performed. The motif relationship was considered significant if the test OR was bigger than 1 and the 95\% CI of the OR did not contain 1 or as a significant motif exclusion if the OR was smaller than 1 and the 95\% CI of the OR did not contain 1.
% Motif density around peaks
The motif occurence densities were computed from the ChiP-extract result matrices. Out of each matrix, a vector containing the number of motif instances at each possible absolute distance was computed. This was done as follows : first each each non-null cell neighbours were incremented (+/- 5 columns on each side) to turn motif instance hits into non point-like representation. A given cell value could be incremented several times. Second for each row, the column corresponding to the same absolute distances from the peak were summed together (i.g. +1bp with -1bp, +2bp with -2bp, +999bp with -999bp). The first column of the resulting matrix should contain number of motif instances present at the peak center (distance of 0bp), the second column at an absolute distance of 1bp and so one. Eventually, the row were summed up and the resulting vector was considered as the motif density vector for the given peak list. The vectors were used to create a matrix for CTCF motif and Jund motif (a vector corresponds to a row), separately, and the matrix was displayed as a heatmap. The row values were standardized and the rows hierarchically clustered using the euclidean distance.
\subsection{EBF1 and nucleosome}
The correlation between EBF1 binding sites and nucleosome dyads was made using ChIP-cor \citep{ambrosini_chip-seq_2016-1}, from the web (\url{https://ccg.epfl.ch/chipseq/chip_cor.php}). The references were the corrected EBF1 peaks (wgEncodeAwgTfbsHaibGm12878Ebf1sc137065Pcr1xUniPk dataset, for more details see section \ref{encode_peaks_methods_data}) and the targets either i) the MNase-seq data released by Gaffney et al. \citep{gaffney_controls_2012} (hg19 / DNase FAIRE etc / Gaffney 2012 ... / All Paired-end samples - 147bp fragments) or ii) the ENCODE MNase-seq data (hg19 / ENCODE DNase FAIRE etc / GSE35586 ... / GM12878 Nucleosome all (SLOW!)). In both cases, "any" strand was selected. Because Gaffney data are paired-ended and represent the fragment midpoint (the dyad), no centering was done. The ENCODE data are single-ended and a centering of 70bp (half a nucleosome) was applied to approximate the fragment midpoint. The count cut-off was set to 1 and the range to -399 to +400bp.
To isolate nucleosomes with an EBF1 binding site, the opposite ChIP-cor analysis was run : Gaffney data as references versus EBF1 binding sites as targets with count cut-off set to 1 and the range to -399 to +400bp. In the results page the "Feature Selection Tool" was used to select dyads with at least 1 EBF1 binding site (threshold parameter) located "From" -99bp "To" 100bp. The count cut-off was set to 9999 and both "Switch to depleted feature" and "Reference feature oriented" set to "Off".
These nucleosome dyads were uploaded to OProf (\url{https://ccg.epfl.ch/ssa/oprof.php}) on the SSA server \citep{ambrosini_signal_2003}. Four individual analyses were run to measure the "WW", "SS", "YRRRRRYYYYYR" and EBF1 motif occurrences. In all cases, the 5' and 3' borders were set to -399bp and 400bp, the window shift to 1bp and the search mode to "bidirectional". For "SS" and "WW", the motif to search was entered as a "Consensus sequence", the window size was set to 2bp, the reference position to 1 and the number of allowed mismatches to 0. For "YRRRRRYYYYYR", the motif was also entered as a "Consensus sequence", the window size was set to 12bp, the reference position to 6 and the number of allowed mismatches to 4. For the EBF1 motif, the JASPAR CORE Vertebrate 2018 "EBF1 MA0154.3 (length=14)" was used with a window size of 14bp, a reference position of 7 and a p-value threshold of 1e-4.
To investigate the chromatin architecture around nucleosome dyads, ChIP-cor was used. Two references were used : i) the nucleosomes with an EBF1 binding site (see above) and ii) the entire Gaffney dataset (hg19 / DNase FAIRE etc / Gaffney 2012 ... / All Paired-end samples - 147bp fragments). For each reference, three analyses were run against different target features : i) DNase-seq data to monitor chromatin accessibility (hg19 / ENCODE DNase FAIRE etc / Boyle 2008 ... DNaseI HS - GM12878 - Rep 1) with "any" strand and no centering, ii) H3K4me2 ChIP-seq data (hg19 / ENCODE ChIP-seq / GSE29611 ... / GM12878 H3k4me2) with "any" strand and a centering of 70bp (half the nucleosome) and iii) positional sequence conservation scores (hg19 / Sequence derived / Vertebrate Conservation (phastCons46way) ... / PHASTCONS VERT46) with "any" strand an no centering. For DNase-seq and sequence conservation, the range was set to -399bp to 400bp with a window with of 1bp. For H3K3me2 data, the range was set to -3999bp to 4000bp with a window width of 10bp. For the DNase-seq and the H3K4me2 data, the count cut-off were set to 1, for the sequence conservation to 9999.

Event Timeline