Page MenuHomec4science

ch_introduction.tex
No OneTemporary

File Metadata

Created
Thu, May 16, 00:48

ch_introduction.tex

\cleardoublepage
\chapter{Introduction}
\label{intro}
\markboth{Introduction}{Introduction}
\addcontentsline{toc}{chapter}{Introduction}
Each living organism contains DNA which is the molecular support on which genes are encoded. Genes are the hereditary unit of life and code for a set of instructions involved in all the aspects of life, from an organism development to the functions of a specific cell type. However, since all these instructions are not needed at the same time, gene expression needs to be regulated.
Transcription factors (TFs) are a class of nuclear proteins that can bind to specific DNA sequences and drive target gene expression. Thus TFs are a major regulator of gene expression.
This work reports the results of different computational studies, all focuses on characterizing TFs through the exploration of their TF sequence specificity, their association and the organization of chromatin around their binding sites.
\section{About chromatin}
In eukaryotes, the DNA is stored in the nucleus. Each human cell contains about two meters of DNA. In order to fit the DNA inside the nucleus, the cells have to organize and compact the genome while maintaining it readable. Unbeatable, evolution came out with an elegant solution : the chromatin. The chromatin is the association of the DNA with specialized proteins - the histones - around which it wraps. Other families of proteins are also found in the chromatin, such as RNA polymerases, histone chaperones, helicases or TFs.
\subsection{The chromatin structure}
% structure, histones, nucleosomes, genome compaction
\begin{figure}
\begin{center}
\includegraphics[scale=0.3]{images/ch_introduction/chromatin.png}
\captionof{figure}{\textbf{A} Top view of a nucleosome core particle displayed as a ribbon representation on the left and space filling representation on the left. The NCP is made of a four hetero-dimers histone octamer around which 146-148 DNA bp wraps. The histone tails protrude out of the nucleosome core particle and are accessible to other factors, unlike the inner part of the histone octamer. Taken and modified from \cite{mcginty_robert_k._and_tan_song_fundamentals_2014}. \textbf{B} The chromatin structure. Inside eukaryotes, DNA is wrapped around histones cores forming nucleosomes. Nucleosomes can then be organized into higher-level helical-like structure, compacting the DNA. The ultimate compaction state is reached at mitosis meta-phase, when the mitotic chromosomes are visible.}
\label{intro_chromatin}
\end{center}
\end{figure}
% histones
In human, there are four major (canonical) histones : H2A, H2B, H3 and H4. These four histones are found assembled together into an octamer, composed of two H2A/H2B and two H3/H4 hetero-dimers, around which ~146/8bp of DNA wrap, forming the nucleosome core particule (that I will later simply refer to as "nucleosome", Figure \ref{intro_chromatin}A). The DNA is kept wrapped around the histone octamer because of strong electrostatic interactions. Indeed, the DNA backbone, which is negatively charged in physiological conditions, shows a high affinity for the positively charged histones. As a consequence, the nucleosome is a quite stable structure.
The histones proteins are highly conserved among eukaryotes at both the sequence and the structure level. All the histones share the overall same design. They are composed of a N-terminal tail, a central histone-fold domain and of C-terminal tail. Histones associate with each other through their histone-fold domains which compose the center of the nucleosome. In contrast, the histone N-terminal tails are extruding out from the nucleosome and are hotspots for post-translational modifications (PTMs) \citep{kouzarides_chromatin_2007}.
For completeness, it should be mentioned that "variant histones" - also called "replacement histone", by opposition to the "canonical replicative histones" - exist and can replace canonical histones in nucleosomes, at specific genome locations, to fulfill dedicated functions \citep{henikoff_histone_2015}. However, this topic is outside the scope of this work.
% chromatin fibers
The genome is organized into a repetition of nucleosomes, each separated by a linker DNA, forming the 11-nm chromatin fiber. This chromatin conformation is quite relaxed and the DNA accessible. The H1 linker histone can be recruited in the nucleosomes, in which case it binds the linker DNA and makes it inaccessible. The 11-nm fiber is itself stored into a denser structure called the 30-nm fiber (Figure \ref{intro_chromatin}B). Eventually, even higher order structure are achieved, further increasing the genome compaction level \citep{ mcginty_robert_k._and_tan_song_fundamentals_2014}.
% compaction
It is now commonly accepted that the compaction of the genome comes with a trade-off. The DNA sequences found in nucleosomes are though the be unaccessible for DNA reading processes such as TF binding whereas the linker DNA remains accessible \citep{jolma_methods_2011-2}. Thus storing the genome impedes its readability. Because transcribing genes is all about reading the DNA template, the state of the chromatin eventually impact gene expression. Consequently, the cell faces a situation where it needs to keep only the immediately useful genomic regions readable while keeping the ability to open/close other regions on demand.
\subsection{The chromatin is dynamic}
% chromatin modification/remodelling
Because the required activated genes may vary over time, for instance because of lineage commitment, the chromatin structure needs to be adapted. Some regions needs to become accessible in order to be read while other are not needed anymore. Consequently, the chromatin is a highly dynamic structure that undergoes constant modifications. Two broad families of chromatin modifiers exist : ATPase chromatin remodelers and histone modifiers.
% chromatin remodelers
ATPase chromatin remodelers are a group of proteins that are able to affect the chromatin packaging by interfering directly with the nucleosome, at the cost of hydrolyzing ATP molecules. Chromatin remodelers can be subdivided into 4 sub-groups, each fulfilling a different function \citep{langst_chromatin_2015}. SWI/SNF members can slide and/or to evict nucleosomes from DNA and are linked with chromatin opening. ISWI members tend to recognize unmodified H4 histone and catalyze nucleosome spacing and chromatin compaction. CHD members are less well functionally characterized but bear chromo domains that allows them to recognize histone methylation. Finally, INO80 members seem to be able to slide and evict nucleosomes and to recognize Hollidays junction and the DNA replication fork, suggesting a role in DNA repair and replication.
% histone modifiers
Histone modifiers are enzymes that can deposite PTMs on the histone tails. Different types of PTMs exist such as acetylation or methylation. Each histone has several residues that can be modified, sometimes together. This leads to an astonishingly high combinatorial. So far more than a hundred histone PTMs have been identified, each linked with different biological functions [REFERENCE]. If the deposition of PTMs is made by dedicated factors (also called writers), this is also true for their recognitions \citep{kouzarides_chromatin_2007,hyun_writing_2017}. The allows histone PTMs to be used to recruit specific factors at given genomic location. For instance, H3 lysine 4 di-methylation (H3K4me2) has been shown to be enriched at the promoters of actively transcribed genes and at enhancers \citep{hyun_writing_2017,zhou_charting_2011} and to be specifically recognized by CHD1, a member of the CHD chromatin remodelers \citep{hyun_writing_2017}.
\subsection{Measuring nucleosome occupancy}
% MNase
The micrococcal nuclease (MNase) - an endo-exo nuclease - is a key factor in producing nucleosome occupancy maps. Subjecting a chromatin extract to a MNase treatment, upon proper experimental conditions, releases "a ‘ladder’ of discrete DNA fragments" \citep{voong_genome-wide_2017} which sizes correspond to mono-, di-, tri-, and so on nucleosome fragments. The MNase is able to digest accessible linker DNA (exo-nuclease activity) and to trims the nick edges (endo-nuclease activity). The nucleosomal DNA is protected from digestion as the histone octamer sterically hinder the MNase access to its substrate \citep{voong_genome-wide_2017}.
% MNase-seq
Originally, MNase treated DNA was then selectively amplified using PCR to map precise nucleosomes. The advent of microarray technologies then allowed to interrogate the entire genome, even though the map created had relatively low resolution \citep{jiang_nucleosome_2009}. Eventually, this limitation was circumvent by subjecting the MNase treated chromatin fragment to next generation sequencing (-seq) \citep{schones_dynamic_2008}. The advent of MNase-seq lead to the creation of high resolution - down to individual nucleosomes - genome-wide nucleosome maps \citep{schones_dynamic_2008, gaffney_controls_2012, west_nucleosomal_2014, kubik_nucleosome_2015}.
% data treatment
Mapping the sequenced fragment of MNase-seq assay against a genome of reference produces a digital readout of the nucleosome density per genomic position. If single-end sequencing is used, the read positions are usually corrected by ~70bp to identify the nucleosome dyad. If paired-end sequencing is performed, usually mono-nucleosome fragments are selected based on their sizes (\~150bp) and their central position is used as it should indicate the nucleosome dyad.
% limitations i) A/T sequence specificity ii) all nucleosomes likely not mapped
If MNase-seq allows to unravel nucleosome occupancy with an unprecedented resolution, it suffers some limitations.
First, MNase has been demonstrated to exhibit a sequence preference toward A/T rich sequences, which could potentially lead to an overdigestion of nucleosome fragments in A/T rich regions \citep{voong_genome-wide_2017}. Second, some nucleosomes have been demonstrated to be "fragile" to the experimental conditions. In yeast, specific nucleosomes were found to be sensitive to the MNase concentration and could only be detected with reduced MNase concentrations \citep{kubik_nucleosome_2015}. Here, the MNase sequence preference may be at play. But another case of fragile nucleosomes was found in human, independently of the use of MNase. In this case, the fragile nucleosomes were sensitive to regular salt concentrations used during a ChIP-seq experiment \citep{jin_h3.3/h2a.z_2009}. Thus, it is likely that MNase-seq is not able to map all the nucleosome in a given genome.
\section{About transcription factors}
% 1) specificity models, additivity, sequence scoring given model
% 2) TF complexes
% 3) co-binding scenarios
% 4) in vivo (ChIP-seq)
% 5) in vitro (HT-SELEX, PBM, B1H)
% About TF and their structure
Transcription factors (TFs) are a special class of proteins that is crucial for gene expression regulation. TFs have the special ability to recognize specific DNA sequences among others. Once recruited on the DNA template, TFs have the ability to regulate transcription by promoting or repressing the activity of the RNAPII. In the first case, one speak of (transcriptional) activators, in the second of (transcriptional) repressors. TFs share a modular architecture. Two types of domains are of particular importance for TF functions: the DNA binding domain (DBD) and the activation domain (AD).
% DNA binding domain
The DBD allows TFs to bind their DNA target. Many different DBD exist, each one being structurally different than the others. The DBD structure if typically used to classify TFs into families. This is for instance the case in the TFclass database \citep{wingender_tfclass:_2013}. In metazoans, TFs have been grouped into four distinct super-families : i) the basic domain TFs, ii) the zinc coordinated TFs, iii) the helix-turn-helix TFs and iv) the $\beta$ scaffold TFs. Each type of domains has a different structure and thus can interact with different DNA structures and sequences \citep{jolma_methods_2011-2}. Of further importance, a single DBD is able to recognize different yet similar sequences. Because the sequence differences have an impact on the TF-protein interaction interface, each sequence is bound with a different affinity. Biologically, having high and low affinity binding sites may be useful to tune the intensity of one TF action on a given gene.
% activation domain
In addition to their DBD, many TFs also bear an AD that is important for the regulation of transcription. ADs allow TFs to regulate gene expression directly, by interacting the basal transcriptional machinery, or indirectly be recruiting co-regulators. Coupled with the specific DNA recognition, this allows TFs to impact the transcription of specific regions of the genome. Whether TFs exert an activator or a repressor role, depends on the exact interaction it can exert with the transcriptional machinery and on the co-regulators it can recruit \citep{latchman_transcription_1997}.
Ultimately, the activity of a TF is regulated by controlling its access to the DNA. This can be done by sequestrating it in the cytoplasm (by any mean) or even by occupying its binding sites to impede the TF recruitement on the genome \citep{latchman_transcription_1997}.
\subsection{TF co-binding}
\label{intro_tf_cobinding}
% TF complexe, homo-dimer, hetero-dimers, independent co-binding
% Jolma and Taipale book 2011 chapter 8
% Jolman and Taipale book 2011 chapter 11.4 (nucleosome breathing / TFs cooperate to evict nucleosome and open chromatin)
\begin{figure}
\begin{center}
\includegraphics[scale=0.4]{images/ch_introduction/TF_associations.png}
\captionof{figure}{\textbf{Possible interaction scenarios between TFs} \textbf{A} Direct co-binding. The TFs dimerize and bind together on DNA. \textbf{B} Indirect co-binding. Both TF dimerize but only one binds the DNA, the other (the blue) is the tethering factor. \textbf{C} Independent co-binding. Both TF bind in close vicinity but without forming a complex. Both TFs may not be necessarily bound at the same time. \textbf{D} Interference. Both motifs partially or totally overlap each other.}
\label{intro_tf_association}
\end{center}
\end{figure}
The four above-mentioned TF super-families offer a huge variety of different TFs and thus allows a substantial complexity in terms of transcriptional regulation. Nonetheless, life further expended the possible complexity of regulatory wiring by evolving different types of combinatorial TF co-binding \citep{jolma_methods_2011-3}. By TF co-binding, I mean a functional association of TFs that requires them to bind in close vicinity, either as a complex or individually. Furthermore, from a strictly DNA-centric point of view, the binding of each TF does not need to be synchronous. One TF may bind after the other, even after it left.
% direct co-binding
First, two TFs can dimerize, forming either homo- or hetero-dimers, and bind to DNA using both DBDs (Figure \ref{intro_tf_association}A). This is for instance the case of the members of the basic domain super-family, which contains the leucine zipper and helix-loop-helix families, which are obligated dimer in order to bind DNA \citep{jolma_methods_2011-2}. This can be referred to as "direct co-binding".
% tethering
Second, two TFs can dimerize and bind to DNA using only one of the DBDs. This will result in having one of the TF binding to DNA while the other one is tethering DNA through its interaction with the other TF (Figure \ref{intro_tf_association}B). This can be referred to as "indirect co-binding".
% independent co-binding
Third, two TFs can both bind DNA using their own DBDs, in close vicinity but without any physical interaction (Figure \ref{intro_tf_association}C). This is for instance the case at distal REs, where many TFs can be found to be bound at the same time. Synergistic co-binding of several TFs has been proposed as a mechanism by which close chromatin structures could be opened and distal RE activated \citep{jolma_methods_2011-3,heinz_selection_2015}. On the other hand, the binding of different TF to a given region can be asynchronous. This is the case for TF involved at different time of the activation cascade, such as happening during macrophage and B cell progenitors commitment \citep{heinz_simple_2010}. This can be referred to as "independent co-binding".
% interference
Finally, two TFs binding motifs can overlap (Figure \ref{intro_tf_association}D). Different mechanisms may explain this phenomenon. A first possible explanation would be that two TFs compete to bind to the same region. This can occur in mechanisms linked to the regulation of a TF activity. In that case, as its binding site is occupied, TF binding is sterically hindered. A second possible explanation would be that, for some reason, only one TF is bound, never the other. This can be referred to as "interference".
\subsection{Modeling sequence specificity}
% The property for a given TF to bind different binding sites with different affinites is functionally interesting. However this variability of site recognition implies that TF can bind to non functional sites.
In the nuclei, TFs are in presence of an incredibly high number of potential binding sites. Most of them are non-site and a minority are bona fide binding sites. The ability of a TF to distinguish between both is called "specificity".
\subsubsection{Position weight matrix}
\label{intro_pwm}
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_introduction/figure_pwm.png}
\captionof{figure} {\textbf{Position weight matrix : } \textbf{A} Human JunD (JUND\textunderscore HUMAN.H10MO) PWM from the HOCOMOCO version 10 collection \cite{kulakovskiy_hocomoco:_2016}. \textbf{B} Corresponding PWM logo. The palindromic nature of the recognized motif is explained by the fact that JunD belongs to the basic helix-loop-helix family. As such, it is obligated to hetero- or homo-dimerize with another member of its family to bind DNA. Both \textbf{A} and \textbf{B} were taken from http://ccg.vital-it.ch/ssa/oprof.php}.
\label{intro_pwm_logo}
\end{center}
\end{figure}
Modeling TF specificity has been an crucial issue in biology as it would have allowed to predict where a given TF could bind in the genome and to infer its regulatory targets. However because of this "degeneracy" of recognized sequences, solving this problem turned out to be complicated.
Let us assume a simple binding reaction between a TF $TF$ and a sequence $S$, at the equilibrium, \ch{TF + S <>[][] TF.S}. From the mass action law, we can compute the association constant $K_{a}$ that represents the TF affinity for the sequence as :
\begin{equation}
\begin{aligned}
K_{a} = \frac {[TF][S]} {[TF] \times [S]}
\end{aligned}
\label{intro_ka}
\end{equation}
From this, it is possible to compute the specificity of $TF$ for $S$ as being
% spec(Xi)
\begin{equation}
\begin{aligned}
spec(S) = \frac {K_{a}(S)} {K_{a}*}
\end{aligned}
\label{intro_eq_ka}
\end{equation}
where $K_{a}*$ is the mean affinity of $TF$ for all other possible sequences. In turns, this allows to estimate the probability of the sequence $S$ to be bound, which solves the problem of binding site predictions. However, since it is at least labor intensive, at most impossible, to measure each possible sequence affinity, this solution cannot be used.
Stormo and Fields demonstrated that it was possible, under the assumptions of i) positional independence and ii) genome randomness, to create a matrix of binding energy contributions \textbf{W} of dimensions $Lx4$ that maximizes the probability of binding of a set $A$ of aligned binding sites of length $L$bp \citep{stormo_specificity_1998}. These contributions are :
\begin{equation}
\begin{aligned}
W_{b,j} = log_{2} F_{b,j} / p(b)
\end{aligned}
\label{intro_eq_pwm}
\end{equation}
where \textbf{F} is a count matrix that contains the number of times each base appears at each position in the alignment $A$, $p(b)$ the probability of a given base $b$ and $W$ is position weight matrix (PWM, Figure \ref{intro_pwm_logo}A). Thus, if we have an alignment of binding sites, it is possible to derive a model that allows to estimate the binding affinities for any DNA sequence of length $L$.
% positional independence
A pillar of the PWM theory is the hypothesis of positional independence. This assume that each base bound by the TF in the binding site does not affect the binding of any other bases in the same binding site. Thus the PWM is a mononucleotide model. This assumption has first been formulated to ease the experimental and computational work load and has been a subject of controversy \citep{man_non-independence_2001,bulyk_nucleotides_2002}. However, it is nowadays commonly admitted that, even if the assumption is violated in some cases, it does still allow to represent accurately most of the cases \citep{benos_additivity_2002,zhao_quantitative_2011}.
% types of matrices, conversion possible
Other matrix types, falsely referred to as PWMs, are used to represent TF specificity. They are all based on the sequence alignment $A$. Count matrices, such as \textbf{F}, contains the number of time a given base was found at a given position in the alignment. Probability matrices are similar to count matrices except that their values are are normalized by the number of sequences in $A$. In all cases, count and probability matrices can always be converted to a PWM.
% popularity
Overall, PWMs (and related matrices) are conceptually straightforward to apprehend and easy to work with. For instance, they can be visualized as a sequence logo (\cite{schneider_sequence_1990}, Figure \ref{intro_pwm_logo}B). Because of this, PWMs are popular and remain the most widely used type of model to represent TF specificity.
\subsubsection{Predicting binding sites}
TF specificity models are typically used for classifications problems. The problem can be stated as follows : given a sequences $S$ of length $L$ and a PWM \textbf{W} of dimensions $Lx4$, predict whether $S$ will be bound. A common way is to define a threshold score $t$ and compute
\begin{equation}
\begin{aligned}
score(S) = \sum_{i=1}^{L} W_{i,b}
\end{aligned}
\label{intro_eq_score_pwm}
\end{equation}
If $score_{S} \geq t$ then $S$ is accepted as a binding site. The non-trivial part is to define a meaningful threshold score. A conceptually similar thing can be done with a probability matrix \textbf{M}. It is possible to compute probability of observing the data given the model, the likelihood $p(S|M)$ :
\begin{equation}
\begin{aligned}
p(S|M) = \prod_{i=1}^{L} M_{i,b}
\end{aligned}
\label{intro_eq_score_prob}
\end{equation}
\subsubsection{Aligning binding sites}
% obtaining the alignment
The above solution request to have a set of aligned binding sites $A$. However, most of the times, we do not have access to this information. Rather, we have a set of longer sequences in which we know that the TF of interest binds, but not exactly where.
Many algorithms - typically called "de novo motif discovery" methods - have been developed to construct a matrix from a set of unaligned sequences. The algorithm developed by Stormo and Hartzell \citep{stormo_identifying_1989} finds an optimal alignment by maximizing a modified version of the alignment information content \citep{schneider_information_1986}. Alternatively, the algorithms developed by Hertz and Stormo \citep{hertz_identification_1990} and Lawrence and Reilly \citep{lawrence_expectation_1990} build a probability matrix that maximize the likelihood of observing the data generated by the model. This type of framework has also been used to develop MEME, excepted that it explicitly models the data as a mixture of binding and non-binding sites \citep{bailey_fitting_1994}.
\subsection{Measuring TF binding in vivo}
% ChIP
The advent of chromatin immuno-precipitation (ChIP) is central for the study of TFs. In essence, it consists in extracting the chromatin from the cell nuclei, shearing it either mechanically or enzymatically and adding an antibody (Ab) against a DNA binding protein of interest. The IP step allows to pull-down the Ab, its target as well as the DNA fragment it is bound to.
% history of DNA detection methods
Different methods, with varying throughput, have been used to identify of the purified DNA fragments. First, specific loci of interest were assayed by PCR. Then the growing availability of DNA microarrays allowed to drastically increase the throughput by testing a wide number of pre-selected loci at once \citep{jolma_methods_2011-4}. Finally, protocols subjecting the purified DNA to high throughput sequencing (-seq) \citep{barski_high-resolution_2007,robertson_genome-wide_2007} allowed to identify the bound loci in an agnostic way, with an unprecedented throughput.
% ChIP-seq
ChIP-seq has truly revolutionized genomics and the study of TFs. In a single assay, it is possible to obtain a digital readout of TF binding sites. Mapping the sequenced reads to the genome of interest allowed to create a per position occupancy score, creating a digital readout of the TF occupancy. However, because the TF binding sites are smaller than the sequenced fragments, the precise location of the TF binding remains unknown. Interestingly, ChIP-seq allows to list the regions of the genome that are occupied and also provide an estimate of the binding affinity for the regions. Indeed, the highest the affinity for a given binding site, the highest the probability of binding. This should be proportionally reflected in the density of signal \citep{jothi_genome-wide_2008}. On the down side, ChIP-seq does not allow to identify the binding site precisely. The pulled-down DNA fragments are usually longer than the exact binding sites. Thus ChIP-seq allows to identify regions +/- 100bp in which a TF binds. Nonetheless, it is possible to identify over-represented DNA sequence motifs from these regions using de novo motif discovery methods (see section \ref{intro_pwm}). Typically, the identified sequence motifs belong to i) the TF of interest and ii) co-binders (see section \ref{intro_tf_cobinding}).
\subsection{Measuring TF binding in vitro}
In vivo measurement of TF binding as several drawbacks.ChIP-seq allows to estimate the binding specificity of TF however, de novo motif discovery method mostly capture the high affinity features of the TF binding specificity \citep{stormo_determining_2010}. Additionally, in vivo, the chromatin exert an effect on TF binding (see section \ref{intro_gene_regulation}). In regards of these limitations, in vitro binding assays offer experimental solutions to investigate i) TF binding over a wider range of affinities and ii) TF intrinsic specificity, without the chromatin influence.
% MITOMI
In the recent years, many different technologies have been developed to investigate TF binding in vitro. The advent of microfluidics plateforms tremendously increased the throughput. In brief, these plateforms are typically composed of hundreds (if not more) of individual chambers and of the necessary piping to flow all the necessary reagents within each cell to run as many reactions in parallel. The reaction chambers are small and allow to use microliter reaction volumes. Maerkl and colleagues \citep{maerkl_systems_2007,geertz_massively_2012} have developed the mechanically induced trapping of molecular interactions (MITOMI). This assay is based on a microfluidic device and allows to run hundreds of affinity assays for a given TF in parallel.
% HT-SELEX
High throughput systematic evolution of ligands by exponential enrichment (HT-SELEX, \citep{zhao_inferring_2009,jolma_multiplexed_2010}) is another popular high throughput method to assay TF binding. HT-SELEX assay a TF specificity by allowing a binding reaction between the TF and tenths of millions of different DNA sequences of typically 20-30bp. The bound DNA molecules are purified by pulling down the TF. The purified DNA can either be sequenced using high throughput sequencing or be subjected to another cycle of selection. Repeated cycles allow to isolate higher affinity binder, eventually only returning a few hundreds. Under a limited number of cycles, this method has a large dynamic scale of binding affinities \citep{stormo_determining_2010} and allow to obtain a digital readout. However, the repeated cycles can introduce biaises that are hard to model properly in order to properly estimate the binding affinities.
\section{About nucleosome positioning}
% statistical positioning, sequence positioning
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_introduction/nucleosome_positioning.png}
\captionof{figure}{\textbf{Nucleosome positioning} \textbf{A} Activated gene transcription start site (TSS) region. The nucleosomes located immediately downstream of the TSS show a strong positioning. The positioning of the first nucleosome can be influence by sequence preferences. Eventually the phasing is propagated to neighboring nucleosomes through statistical positioning. The nucleosome array is not anymore visible as the nucleosomes become fuzzily positioned among the cells. \textbf{B} Influence of the rotational positioning on the sequence accessibility. Left, a sequence (indicated by the black ‘rungs’ on the DNA helix) has its major groove facing toward the nucleosome outside and is accessible. Center, a 5bp rotation of the nucleosome hides the sequence as its major groove is not facing the histone octamer. Right, another 5bp rotation makes the sequence accessible again. Both images are taken and adapted from \citep{jiang_nucleosome_2009}.}
\label{intro_nucleosome_positioning}
\end{center}
\end{figure}
The advent of MNase-seq allowed to draw high resolution maps of nucleosome occupancy in many species, such as in yeast \citep{kubik_nucleosome_2015}, mouse \citep{west_nucleosomal_2014} or human \citep{schones_dynamic_2008, gaffney_controls_2012}.
% strongly positioned nucleosomes
The wealth of data collected allowed to determined that nucleosomes do not cover the genome uniformly. Nucleosomes rather seem to show preferred locations were they sit at. Interestingly, single nucleosomes can be visualize from batch sequencing experiments, indicating that an important fraction of the cells bear nucleosomes at the same positions. In these cases, the nucleosomes are said to be "phased" or "strongly positioned" (see Figure \ref{intro_nucleosome_positioning}A).
% statistical positioning
Nucleosome arrays are a striking case of strongly positioned nucleosome. Arrays can occur throughout the human genome \citep{gaffney_controls_2012}. However, there are regions where they are enriched, for instance at the CCCTC-binding factor (CTCF) binding sites \citep{fu_insulator_2008}. It has been proposed that the arrays resulted from the nucleosomes organizing with respect to a barrier (or anchor). In this case, the barrier would be CTCF. The regular array organization has been proposed to be propagated far from their anchors because the immediately flanking nucleosome positions are constrained by the barrier. In turn, these nucleosomes constrain the lateral freedom of movement of the following ones, and so one, eventually forming the array. However, because the degree of constrain diminishes at each new nucleosome, the nucleosomes are not sufficiently phased anymore throughout the cell population anymore. They become fuzzy and the signal blur out at some point. This phenomenon is referred to as "statistical positioning" \citep{jiang_nucleosome_2009}.
% effect of sequence
Another important driver of nucleosome positioning is the DNA sequence. For instance, strongly positioned nucleosomes are also visible at the transcription start sites (TSSs) of activated genes. However, unlike for CTCF binding sites, the DNA sequence composition seem to be a major factor driving the nucleosome positioning \citep{dreos_influence_2016}. To wrap around the histone octamer the DNA should be curved, which requires some flexibility. WW (W=A/T) and SS (S=C/G) dinucleotides have been shown to curve the DNA by extending the major and the minor groove respectively \citep{jiang_nucleosome_2009}. However, because the major and minor grooves precess around the DNA helix axis, each groove alternatively faces the nucleosome center (the histone octamer) and the nucleosome outside (the opposite direction) every ~5bp (thus the DNA helix periodicity is ~10.4bp, see Figure \ref{intro_nucleosome_positioning}B). Consequently, dinucleotides favoring DNA flexibility are required to occur at different locations around the nucleosome, according to their effect on the DNA helix structure. For instance, stretching the major groove needs to occur when it is facing the nucleosome outside, to force the adjacent DNA segment to be curved toward (around) the nucleosome center. If a nucleosome is bound to a favorable sequence, the next most likely favorable binding sites are located 10bp upstream or downstream. These correspond to the locations at which all the dinucleotides will reacquire the same orientation with respect to the histone octamer. This is referred to as "rotational positioning" \citep{jiang_nucleosome_2009}. In 2011, Trifonov identified the YRRRRRYYYYYR (where R=A/G and Y=C/T) consensus sequence to be a nucleosome positioning sequence matching these criteria \citep{trifonov_cracking_2011}. The first and last positions indicate the cyclic nature of this pattern.
Interestingly, the exact positioning of a nucleosome has a deep impact on the accessibility of the DNA. None 10bp displacements have the potential of changing a sequence orientation with respect to the histone core and thus its accessibility.(Figure \ref{intro_nucleosome_positioning}B).
In vivo, both statistical and rotational positioning occur. Additionally, chromatin remodelers are also constantly catalyzing thermodynamically unfavorable nucleosome displacements in exchange of ATP hydrolysis. It is likely that each nucleosome is subjected to all of these phenomenons. However, on a single nucleosome basis, one may predominate over the others.
\section{Gene regulation in a nutshell}
\label{intro_gene_regulation}
% regulation gene expression definition
The regulation of gene expression is a highly complex biological phenomenon which allows a proper allocation of resources to each individual gene such that overall gene product output fits the cell needs as precisely as possible. In this section, I will consider gene regulation from a DNA centric perspective.
% summary and aim of the section
The mechanisms that act to regulate gene expressions are diverse and operate in an intricate manner that is highly dynamic. The status of a gene, at a given time, is the results of the actions of activating and repressing mechanisms that act either on i) the recruitement of the pre-initiation complex (PIC) and the assembly of a functional RNAPII complex at gene regulatory elements (REs) or ii) the activation of the RNAPII complex. This section will briefly introduce each of these aspects and provide the necessary information for the further understanding of this work by the reader.
\subsection{The chromatin barrier}
% chromatin is a barrier
As discussed above, the genome is stored as chromatin in the nuclei. Because nucleosome are bound to the DNA, they compete with other factors for binding sites. As such, the chromatin structure is a barrier to the recruitment of the PIC. On the brigh side, this is though to limit spurious activation of the RNAPII \citep{jolma_methods_2011-1}. On the other hand, this obviously also suppresses any gene expression. In human, the observation that TFs binding is hindered by nucleosome and that REs are nucleosome depleted suggest the existence of a mechanism that opens REs \citep{jolma_methods_2011-1}.
\subsection{TFs cooperative binding}
% cooperative binding to open nucleosomes
The cooperative binding of TFs has been demonstrated to be able to open close chromatin. In essence, this is a step-wise process during which a first TF binds its target on an accessible linker, leading to the destabilization of the neighboring nucleosome. This in turn increase the accessibility of a second TF binding site that can be engaged, further openining the chromatin. Eventually, the nucleosome is displaced of even evicted and the chromatin is locally opened \citep{jolma_methods_2011-1}. Eventually the recruitment of ATPase chromatin remodelers and/or of histone modifier allow to set up a proper chromatin environment \citep{jolma_methods_2011-1}.
The conditions for this phenomenon to ignite are not precisely known however several hypotheses and observations are of interest. First, compacted chromatin has been observed, in vitro, to undergo spontaneous local openings at the nucleosome entry sites. This phenomenon has been referred to as "nucleosome breathing" \citep{jolma_methods_2011-1}. This has the potential of creating windows of opportunity for TFs to engage their binding sites, in nucleosome arrays. Second, it has been hypothesized that, in human, some regions are able to promote a high nuclosome density that facilitate the exclusion of the H1 histone. The rational is that the DNA linkers between any two nucleosomes is too short for H1 to bind. Eventually, this prevents the inclusions of these regions in more condensed chromatin structures while leaving the linkers somewhat accessible \citep{jolma_methods_2011-3}. Together, with nucleosome breathing, this has the potential of creating engageable - but not open - windows throughout the genome.
\subsection{Pioneer TFs}
% pioneer TFs
Alternatively, a special class of TFs named "pioneer factors" have been shown to be able to bind their target in a closed chromatin environment and to induce chromatin opening after binding \citep{zaret_pioneer_2011,iwafuchi-doi_pioneer_2014}.
% introduce Fox1 to describe the precise mechanism of action
The case of the prototypical pioneer factor FoxA1 is enlightening regarding the mechanistics of pioneer TFs. In liver, FoxA1 is able to bind the inactive \textit{albumin} enhancer and prime it for activation \citep{cirillo_opening_2002}. The enhancer activation is possible because of the hybrid nature of FoxA1. It binds DNA through its DNA binding domain that has a similar structure to the H1 linker histone. Strikingly, FoxA1 can bind its motif directly on the nucleosome surface. Furthermore, FoxA1 posses a C-terminal domain that directly binds the histone core, which leads to the chromatin opening \citep{cirillo_opening_2002}. Alternatively, FoxA1 is also able to recruit co-regulators via its N-terminal trans-activation domain \citep{zaret_pioneer_2011}. Currently, many other pioneer TFs have been discovered, such as Oct4, Sox2 and Klf4 (which are 3 of the 4 Yamanaka factors) \citep{soufi_pioneer_2015} or PU.1 which has been shown to induce nucleosome remodeling at macrophage and B-cell specific enhancers \citep{heinz_simple_2010}. Interesting in regards with this, most of the TFs discovered to drive cellular reprogramming, such as the Yamanaka factors \citep{takahashi_induction_2006}, are pioneer TFs.
\subsection{Regulatory elements}
% RE definition
Chromatin opening and the recruitment of the transcriptional machinery do not happen at random in the genome but is concentrated at REs. The specific recruitment of the transcriptional machinery regulators at given genomic locations allow to concentrate the regulatory signals on specific target genes. REs can be divided in two broad classes based on their vicinity to the gene(s) they regulate : proximal REs - or promoters - and distal REs. Both classes interact together by the mean of the genome 3D structural organization. This section will briefly focus on each of these topics.
% promoters
Promoters are located immediately upstream of the target genes they regulate. Promoters functions are to recruit the RNA polymerase II complex (RNAPII) and position it properly for transcription. Interestingly two constrasting promoter groups have been identified with respect to their chromatin architectures \citep{cairns_logic_2009}. The first group include house keeping genes. This group chromatin architecture tends to be constituvely open with a nucleosome depleted region (NDR), promoting gene expression. The second group contains highly regulated genes. Unlike the first group, these promoters tend to be constituvely covered by nucleosomes, hindering TF and RNAPII recruitment. Their activation require an active chromatin remodeling by SWI/SNF ATPase family members. However, in both cases, an active gene requires an open chromatin. The NDR usually contains TF binding sites core regulatory elements (CREs) involved in the recruitment of general TFs leading to the assembly of the RNAPII \citep{lenhard_metazoan_2012}.
% enhancers
Distal REs are an crucial class of genomic regions when it comes to gene expression regulation. Distal REs are located at distances that vary from kilobases to megabases from their target gene and have the ability to influence gene expression positively, in which case they are referred to as 'enhancers', or negatively, in which case they are referred to as 'silencers'. Distal REs are enriched with closely spaced TF recognition sequences that serve for the recruitement of TFs. In turn, TFs allow to recruit other transcriptional co-regulators such as histone modifiers \citep{heinz_selection_2015}. Through chromatin looping phenomenons, the recruited TF (and all other factors) are brought in close spatial vicinity from target gene promoters. This increases TF concentrations (as well as other regulatory factors bound) at the promoter level and allows to strengthen regulatory signals directly where the RNAPII is sitting \citep{heinz_selection_2015}.
Distal REs are not always active. Instead they are highly cell line specific and thus are important determinant of the cell identity \citep{heinz_selection_2015}. Distal REs activation requires to open the chromatin in order to be accessible for TFs to bind. Currently, both cooperative TF binding and pioneer TFs are though to be involved in chromatin opening and remodeling. Upon chromatin opening, specific histone PTMs are deposited, such as H3 lysine 4 mono-methylation (H3K4me1), H3K4me2 or H3 lysine 27 acetylation (H3K27Ac) \citep{zhou_charting_2011}. For instance, during B-cell and macrophage lineage commitment, PU.1 and EBF1 are essential TFs which action activates cell type specific enhancers, leading to the enforcement of differential genomic programs \citep{boller_pioneering_2016, heinz_simple_2010}. Failure to do so leads to lineage commitment defects \citep{hagman_early_2005,kurotaki_transcriptional_2017}.
\subsection{The genome goes 3D}
% TADs
Finally, another layer of complexity involved in the regulation of gene expression can be added : the 3D organization of the genome. Nowadays clear that inside the nucleus, the genome spatial organization is tightly regulated and that it has a functional meaning \citep{bonev_organization_2016}.
As described above, enhancer and promoters physically interact together. These looping phenomenons do not happen at random. The genome is organized into compartments, also called topological association domains (TADs). A TAD can be seen as high level chromatin loop in which the physical interaction between loci are favored compared to interaction with loci outside of the TAD. As a matter of fact, enhancers action is limited to the TAD it is located in. Thus TADs can be seen as a functional regulatory genomic domains.
TADs are thought to be established and maintained by a dedicated set of structural and complexes including CTCF and the cohesin complex \citep{bonev_organization_2016}. CTCF seem to have two major functions. First it seems to facilitate promoter/enhancers interactions, within TADs and promote gene expression. Second, CTCF is found to be enriched at TAD borders and seems to be important for their proper delimitation \citep{ong_ctcf:_2014}, likely through a loop extrusion mechanism \citep{ghirlando_ctcf:_2016}. This second function is compatible with the insulator function of CTCF. Because it marks the boundary between TADs, enhancer/promoter interaction over this limit cannot happen, which would explain the insulator activity of CTCF. Finally, CTCF is often found in to interact with the cohesin complex \citep{stedman_cohesins_2008}. The cohesin complex is composed of four members : SMC1, SMC3, RAD21 and either STAG1 or STAG2 \citep{losada_cohesin_2014}. Together they form a ring-like structure in which two DNA molecules are trapped and maintained together. This structure is one of the mechanisms allowing to pinch DNA and to form loops. The cohesin complex is important for both promoter/enhancer interactions and TADs maintenance \citep{losada_cohesin_2014,bonev_organization_2016}.
\section{Digital footprinting}
\label{intro_dgf}
% DGF / DNase-seq / ATAC-seq, footprint
\begin{figure}
\begin{center}
\includegraphics[scale=0.3]{images/ch_introduction/dgf.png}
\captionof{figure}{\textbf{Digital footprinting :} \textbf{A} DNase-seq uses the endonuclease DNase to cleave DNA within accessible chromatin. Endonuclease cleavage is greatly attenuated at the protein-bound loci (the red crosses denote cleavage blockade). Accessible library fragments are generated by barcoding each cleavage site independently after restriction digestion (single cut) or as proximal cleavage pairs (double cut). \textbf{B} Assay for transposase-accessible chromatin using sequencing (ATAC-seq) uses a hyperactive transposase (Tn5) to simultaneously cleave and ligate adaptors to accessible DNA. \textbf{C} The purified DNA fragments are then subjected to massively parallel sequencing and mapped to the reference genome to generate a digital readout of per-nucleotide insertion (DNaseI nick or Tn5 transposition event) genome-wide. Figure and legend taken and adapted from \citep{vierstra_genomic_2016, klemm_chromatin_2019}.}
\label{intro_dgf}
\end{center}
\end{figure}
% DGF
Digital genomic footprinting (DGF) methods are a powerful mean to reveal all active REs in the genome at once. They allow to measure genome accessibility and reveal protein occupancy, genome-wide. The essence of DGF assay relies on reagents - enzyme or chemical (this work will only cover enzymes) - that are able to generate single- or double-stranded DNA cleavages into a chromatin-stored DNA template \citep{tsompana_chromatin_2014, vierstra_genomic_2016}
DGF assays relies on a selective degradation of the loci stored in accessible chromatin followed by high throughput sequencing (-seq). The degradation of the accessible chromatin regions can be performed using either DNaseI (DNase-seq, \cite{neph_expansive_2012}) or a modified Tn5 transposon system (assay for transposable accessible chromatin, abbreviated ATAC-seq, \cite{adey_rapid_2010,buenrostro_transposition_2013}), as shown in Figure \ref{intro_dgf}.
% DNase-seq
DNaseI is an endonuclease that can generate, in proper ionic conditions. This enzyme introduces double-strand breaks in the genome based on the DNA accessibility, with a minor sequence specificity \citep{herrera_characterization_1994}. On a technical note, DNase-seq assays are quite sensitive assays. Achieving a proper chromatin degradation - that is, avoiding over-digestion - is not an easy task and requires careful enzymatic titrations.
% ATAC-seq
ATAC-seq assays rely on a modified Tn5 transposase enzyme to selectively fragment the accessible regions of the genome \citep{adey_rapid_2010,buenrostro_transposition_2013}. The enzyme inserts small double stranded barcodes inside the DNA wherever it is accessible resulting a the creation of double strand breaks. This process, known as tagmentation, allows to i) fragment the genome and ii) inserts sequencing barcodes at once. It should be noted that the Tn5 acts as an homodimer and thus inserts two copy of the same adaptors separated from each other by 9bp \citep{adey_rapid_2010}.
% output
In both cases, the genome is chopped down into fragments wherever the chromatin is accessible. A sequencing library is then created from the fragments and the fragment ends are sequenced using high throughput sequencing technologies. Finally, the insertion sites are located by mapping the sequenced reads against the reference genome of interest. This eventually leads to the creation of a per position cut (nicks for DNase-seq, insertions for ATAC-seq) density (Figure \ref{intro_dgf}).
% footprinting
Whenever a TF, a nucleosome or any other factor is engaged in a binding interaction with the DNA template, the steric hindrance protects the DNA from being degraded by the enzyme, leading to the creation of a typical signal diminution called "footprint" (Figure \ref{intro_dgf}). A footprint is a degradation signal drop over a DNA sequence that is protected from degradation because of binding event \citep{vierstra_genomic_2016}, but I will later use the term "footprint" the refer to signal drop in a degradation signal for aggregation profiles as well.
DGF assays encounter a yet ever-growing popularity because of the wealth of data produced in a single experiment. Indeed, instead of running thousands - one per transcription factor (TF) - of chromatin immonuprecitpitation followed by sequencing (ChIP-seq) \citep{barski_high-resolution_2007} to know where each TF is binding, it is sufficient to run a single chromatin accessibility assay. However, if DGF reveals the active regulatory regions, it does not provide the information about where each individual TF is bound.

Event Timeline