Page MenuHomec4science

ch_atac-seq.tex
No OneTemporary

File Metadata

Created
Thu, May 9, 11:31

ch_atac-seq.tex

\cleardoublepage
\chapter{Chromatin accessibility of monocytes}
\label{atac_seq}
\markboth{Chromatin accessibility of monocytes}{Chromatin accessibility of monocytes}
\addcontentsline{chapter}{toc}{Chromatin accessibility of monocytes}
The chapter contains ongoing work. I present the basements of a computational framework to analyse chromatin organization around TF binding sites from ATAC-seq data. As a matter of fact, the results presented here are quite preliminary. However, in the best case, this may shape a basis for other projects. % Because reporting these results, even if incomplete, is at least useless and at most useful to the scientific community, there is no reason not to present them.
\section{Monitoring TF binding}
As discussed above (see section \ref{intro_dgf}), DGF assays are able to highlight active regulatory elements from an entire genome, at once. However, this comes at the price of an information loss. First, even if we can identify active loci likely to be bound by TFs, we have no direct information about the identity of the bound TF(s). Second, we have no idea about the function of those regions. These regions may act as transcriptional activator or repressor. This activity is ultimately bared by the TF and other complexes bound. Thus delineating a region function necessitate to identify the TFs bound here.
This task, even if difficult, can be undertaken by implementing dedicated strategies. First, it is possible to collect evidence about the identity of TF likely to bind at a given location through a motif analysis. TFs can bind DNA directly through their own DNA binding domain or indirectly, through an interaction with at least one other partner TF which binds DNA directly \citep{neph_expansive_2012}. For a given TF, direct binding events can be detected by monitoring the presence of a binding motif if a specificity model is available. Thus a footprint bearing a motif is likely to reflect a direct binding event. However, this method has two important limitations : i) related TF often share a common DNA specificity and ii) this does not detect indirect binding events. However, evidence about the presence of large complexes can be collected by studying the size of the footprint. Large complexes should leave large footprints. This approach, even if limited is able to pinpoint a handful of candidate TFs.
Second, it had been suggested that the regulatory function of a TF can be deciphered by looking at its footprint. Indeed, previous studies have showed that activator and repressor TFs tend to produce different types of footprints \citep{berest_quantification_2018}. Also, the spatial positioning of TF motif within the footprint seemed to be linked with the factor functions \citep{grossman_positional_2018}. For instance, factors associated with the regulation of transcription tend to have a motif in the middle of the footprint whereas factors known to interact with chromatin remodeling factors tend to have a footprint at the edge of the footprint, in contact with the surrounding nucleosomes.
\section{The advent of single cell DGF}
Recently, the advent of single-cell (sc) sequencing technologies have been a real game changer in the field of life science. These technological advances allowed to measure gene expression and chromatin accessibility (scATAC-seq) at a yet unprecedented resolution. As bulk sequencing was providing an average overview of what was going on, single-cell sequencing allows to monitor what is happening in each cell of a population. This advance had a profound impact on genomics for two reasons.
First, for the really first time, the heterogeneity of a cell population became accessible and could be studied at the chromatin, transcriptional and protein levels. Second, the possibility of collecting high dimensionality data from tenth of thousands of individual cells allows genomics to fully enter in the modern big data era, making commonly used machine learning methods usable as the number of parameters to estimate in the models became smaller than the number of individuals, in this case cells, in the data \citep{angerer_single_2017}.
% \section{A quick overview of scATAC-seq data analysis}
% So far, most of the single cell technologies are targeted at measuring gene expression through scRNA-seq. Naturally, dedicated algorithms and computational methods have been developed to analyze these data. Currently, the most common types of analyses made are i) data projections and dimensionality reduction such as principal component analysis (PCA), t-stochastic distributed neighbours embedding (t-SNE) or uniform manifold approximation and projection (UMAP) and ii) cell population detection by clustering the cells based on the expression of genes \citep{fan_characterizing_2016, kiselev_sc3:_2017}, by reconstructing gene regulation network \citep{aibar_scenic:_2017} or by identifying cellular states based on the accessible region motif content \citep{gonzalez-blas_cistopic:_2019}. In all cases, the use of scATAC-seq data is to determined whether a region is accessible or not. The downstream analyses characterizes the accessible region using i) the number of reads mapping in these regions as a measure of the accessibility or ii) the sequence content within these accessible regions to determine regulatory topics.
% \section{Open questions}
% \begin{figure}[!htbp]
% \begin{center}
% \includegraphics[scale=0.5]{images/ch_atac-seq/pipeline.png}
% \captionof{figure}{\textbf{framework to identify chromatin organization and use them to annotate cellular state :} the scATAC-seq data available in each individual cell are aggregated and used a if it was a bulk sequencing experiment. Regions of interest are listed using peak calling on the the bulk data. The read densities in these regions (center of the peaks +/- a given offset) are measured. The regions are then clustered based on their signal shape to identify different chromatin architectures and create a catalog. These chromatin signatures can then be used to annotate each region of interest in each cell, based on the signal resemblance. The information can be stored as a matrix (M) that can be used for downstream analyses, such as sub-population identification.}
% \label{atac_seq_pipeline}
% \end{center}
% \end{figure}
% All these methods have shown good performances to identify know and new cell populations [REFERENCES]. However, some issues remains open. First, none of these methods uses DGF data to identify different types of footprints or chromatin architecture, in terms of signal shape, at the single cell level. Second, ATAC-seq measures chromatin accessibility but also provides information about the nucleosome occupancy at accessible genomic regions \citep{buenrostro_transposition_2013}. Thus counting the number of reads mapping at a given loci is, indeed, an indication of accessibility but it does use only a small fraction of the available information. Finally, to date, no study has tried to determine whether what is observed at the bulk level can also be seen at the individual cell level and whether this can be used to infer the molecular state of the cells.
% In this project, I designed and developed the basements of a computational framework to construct a catalog of prototypical chromatin architectures from single-cell data that can later on be used to annotate individual regions, in single cell. Such a method can be useful to determine cellular molecular state and to group cells accordingly. The entire pipeline is illustrate in Figure \ref{atac_seq_pipeline}.
\section{Open issues}
I identified two interesting question with regard to ATAC-seq data. First, in the previous chapters, I studied how chromatin is organized in the vicinity of TF binding sites using a pretty standard combination of ChIP-seq, DNase-seq and MNase-seq data. However, I wanted to asses to what extend the same could be done from ATAC-seq data which are cheaper and easier to produce. Second, I wonder to what extent single-cell data could be pooled together and used as a bulk sequencing experiment.
\section{Data}
To this end, I choose to work with a publicly available sc-ATAC-seq dataset from 5'000 human blood monocytes from a healthy donor. These data have been produced by 10xGenomics (\url{https://www.10xgenomics.com}).
% 10xGenomics is one of the most promising and fast growing company specialized in sequencing technologies in the San Francisco Bay area \citep{hepler_10x_2018}. The core activity is to sell sequencing technologies and data analysis softwares to public and private entities. To advertise their products, 10xGenomics offer a free access to several high quality single cell datasets.
10xGenomics is a company active in the field of sequencing technologies and data analysis softwares. To demonstrate the capabilities of their sequencing and bioinformatics analysis technologies, 10xGenomics offers a free access to several high quality single cell datasets together with their analysis results. Thus pre-processing steps such as mapping, cell demultiplexing, sequencing adapters trimming, quality control checks have already been performed. Thus working with these data require minimum handling. Additionally, some downstream analyses such as peak calling or clustering have already been performed. For these reasons, their datasets offer all the conditions to be used as a standard to develop and benchmark new analyses methods.
Hg19 mapped reads were downloaded in bam format as well as the corresponding peaks, in bed format, called on the aggregated data.
\section{Identifying over-represented signals}
The study of signal shape (distribution) has been a quite active field for bulk sequencing experiments during the last decade. Dedicated algorithms \citep{hon_chromasig:_2008} \citep{nielsen_catchprofiles:_2012} \citep{kundaje_ubiquitous_2012} \citep{nair_probabilistic_2014} \citep{groux_spar-k:_2019} have been developed to cluster genomic regions based on their distribution of reads. As discussed in section \ref{intro_pattern_discovery}, the major issue faced are that i) to assess whether two regions are similar in terms of sequencing signal, they have to be properly aligned and ii) oriented and iii) there may be region-specific sequencing depth differences. Nonetheless, these algorithms allow to capture important trends in the data and to collect evidence about the underlying biological mechanisms at play.
\subsection{ChIPPartitioning algorithm}
ChIPPartitioning is an algorithm that has been developed by \cite{nair_probabilistic_2014} to classify regions based on their sequencing read densities and to identify archetypical sequencing densities (or models). Because the algorithm is already presented in details in section \ref{encode_peaks_chippartitioning}, it will not be discussed further here. Nonetheless, the reader is invited to read the above mentioned section in order to properly understand the points discussed below.
% Most of the above mentioned algorithms and softwares deal with some of these issues. However, ChIPPartitioning \citep{nair_probabilistic_2014} (see section \ref{encode_peaks_chippartitioning}) is really interesting. It is a probabilistic partitioning method that softly clusters a sets of genomic regions represented as a vector of counts corresponding to the number of reads (ChIP-seq, DNase-seq) along them. The regions clustered based on their signal shape resemblance. To ensure proper comparisons between the regions, the algorithm allows to offset one region compare to the other to retrieve a similar signal at different offsets and to flip the signal orientation. Finally, it has been demonstrated to be really robust to sparse data.
% This algorithm models the signal over a region of length $L$ has having being sampled from a mixture of $K$ signal models, using $L$ independent Poisson distributions. The number of reads sequenced over this region is then the result of this sampling process. The entire set of regions is assumed to have been generated from a mixture of $K$ different signal models (classes). Each class is represented by a vector of $L' \le L$ values that represent the expected number of reads at each position for that class. These values are thus the Poisson distribution parameters.
% In order to discover the $K$ different chromatin signatures in the data, the algorithm proceed to a maximum likelihood estimation of the Poisson distribution parameters using an expectation-maximization (EM) framework. Given a set of $K$ models, the likelihoods of each region given each class is computed. A posterior probability of each class given each region can, in turn, be computed. These probabilities can be interpreted as a soft clustering. The parameters of the classes are updated using a weighted aggregation of the signal. Since each region is computed a probability to belong to each class, it participates to the update of all the classes, with different weights.
% If the length of the chromatin signature searched $L'<L$, then the algorithm slides a window along the regions and searched for this signature at each possible offset. This is how it deals with alignment issue. The signal orientation issue is tackled by also performing a searched with the flipped model. The procedure is depicted in Figure \ref{atac_seq_em}A.
\subsection{EMSequence algorithm}
\begin{figure}[!htbp]
\begin{center}
\includegraphics[scale=0.18]{images/ch_atac-seq/em.png}
\captionof{figure}{\textbf{Illustration of the expectation-maximization algorithms} \textbf{A} illustration of ChIPPartitioning, an algorithm dedicated to the discovery of over-represented chromatin patterns, as described in \citep{nair_probabilistic_2014}. \textbf{B} illustration of EMSequence, an algorithm to discover over-represented DNA motifs. The overall design is the same. Both algorithms model the data has having being sampled from a distribution and perform a maximum-likelihood estimation of the distribution parameters from the data through an iterative procedure.\\
EMJoint algorithm is the combination of both ChIPPartitioning and EMSequence at the same time.}
\label{atac_seq_em}
\end{center}
\end{figure}
ChIPPartitioning algorithm presented an interesting feature : it explicitly models the sequencing signal. Thus it can be adapted to search for different types of signals, for instance nucleosomes architectures or footprints. But because footprints reflects the binding of TFs, it is also critical to be able to identify the motifs within. To this end, I modified ChIPPartitioning in order to discover over-represented sequence motifs. Let us called this new algorithm EMSequence. The following modifications have been applied to ChIPPartitioning i) how the class signal is modeled, ii) the way data likelihood are computed and iii) the update of the class models. This is illustrated in Figure \ref{atac_seq_em}B.
Conceptually, this algorithm proposed is close to the EM algorithm proposed by Lawrence and Reilly \citep{lawrence_expectation_1990} and to MEME \citep{bailey_fitting_1994}, that are both designed for de novo motif discovery. The difference with the algorithm of Lawrence and Reilly is that EMSeqquence is generalized to find several classes instead of one. The difference with MEME is that EMSequence searches all of the different classes at once instead of using an iterative procedure.
The input is composed of an integer matrix $D$ of dimensions $N \times L$ containing $N$ DNA sequences $d_{1}, d_{2}, ..., d_{N}$ of length $L$. Each sequence $d_{i}=(d_{i1}, d_{i2}, ..., d_{il})$ is a vector of integers encoding it (A=1, C=2, G=3, T=4).
The $K$ classes profiles from which the data originate, instead of being modeled as signal profile, are modeled as sequence motifs $M_{1}, M_{2}, ..., M_{K}$ of expected base probabilities (LPMs). A class motif $M_{j}$ is a matrix of dimensions $4 \times L'$ with the constrain $\sum_{b=1}^4 m_{j b,l} = 1$.
\subsubsection{without shift and flip}
For the case where $L'=L$, the original equation (1) from \citep{nair_probabilistic_2014} to compute the probability of a sequence $d_{i}$ given a class $M_{j}$ is replaced by :
\begin{equation}
\begin{aligned}
P(d_{i}|m_{j}) & = \prod_{l=1}^{L} m_{j b,l} \\
\text{where } b & = d_{il}
\end{aligned}
\label{atac_seq_emseq_likelihood}
\end{equation}
Once the posterior probabilities $P(M_{j} | d_{j})$ have been computed using equation \ref{intro_eq_post_prob}, the original equation (3) in \citep{nair_probabilistic_2014}, to update the class models, is modified as follows :
\begin{equation}
\begin{aligned}
m^{*}_{j b,l} & = \frac{\sum_{i=1}^{N} (P(M_{j} | d_{il}) \times z}
{\sum_{k=1}^{N} (P(M_{j} | d_{il})} \\
\text{with } z & =
\begin{cases}
1, & \text{if $b = d_{il}$}.\\
0, & \text{otherwise}.
\end{cases}
\label{atac_seq_emseq_update_model}
\end{aligned}
\end{equation}
where $M^{*}_{j}$ is updated model of class $j$ and $b$ takes the values $1,2,3,4$ for A, C, G and T respectively.
\subsubsection{with shift and flip}
For the sake of generality, I present the case with shift and flip because cases with shift only or flip only are special cases with shift and flip.
For the case with shifting ($L'<L$) and flipping, the original equation (9) from \citep{nair_probabilistic_2014} to compute the probability of a sub-sequence of length $L'$ starting at offset $s$ in sequence $d_{i}$ given class $M_{j}$ is replaced by :
\begin{equation}
\begin{aligned}
P(d_{i} | M_{j}, s, inv) & = \prod_{l=1}^{L'} m_{j b,l}^{inv}\\
\text{with } b & =
\begin{cases}
d_{i,s+l-1} & \text{if $inv = 1$}.\\
4 - d_{i,s+l-1} + 1 & \text{if $inv = 2$}.
\end{cases} \\
\end{aligned}
\label{atac_seq_emseq_likelihood_shift_flip}
\end{equation}
where $inv$ is a notation indicating the orientation. If $inv=1$ we are searching in forward orientation (the forward strand) and $M_{j}^{1} = M_{j}$. If $inv=2$, we are searching in flipped orientation (reverse strand) and $M_{j}^{2}$ is the reverse complement motif of $M_{j}$. Computing the reverse complement motif $M_{j}^{2}$ of a class motif $M_{j}$ is done using :
\begin{equation}
m_{j i,l}^{2} = m_{j 4-i+1, L'-l+1}
\label{atac_seq_emseq_reverse_motif}
\end{equation}
The computation of the posterior probabilities $P(M_{j}, s, inv | d_{j})$ remains the same as in \citep{nair_probabilistic_2014}. With the posterior probabilities, the model update can be undertaken. The original equation (12) in \citep{nair_probabilistic_2014} should be modified. The update of the model is made in 2 steps: i) by creating an intermediate motif for each strand separately and then ii) by combining them, as follows :
\begin{equation}
\begin{aligned}
m^{*1}_{j b,l} = \sum_{s=1}^{S} \sum_{i=1}^{N} P(M_{j}, s, inv=1 | d_{il+s-1}) \times z^{1} \\
\text{with } z^{1} & =
\begin{cases}
1, & \text{if $b = d_{il+s-1}$}.\\
0, & \text{otherwise}.
\end{cases} \\
m^{*2}_{j b,l} = \sum_{s=1}^{S} \sum_{i=1}^{N} P(M_{j}, s, inv=2 | d_{il+s-1}) \times z^{2} \\
\text{with } z^{2} & =
\begin{cases}
1, & \text{if $4 - b + 1 = d_{il+s-1}$}.\\
0, & \text{otherwise}.
\end{cases} \\
m^{*}_{j b,l} = \frac {m^{*1}_{j b,l}} {\sum_{b'=1}^{4} m^{*1}_{j b',l}} +
\frac {m^{*2}_{j 4-b+1,L'-l+1}} {\sum_{b'=1}^{4} m^{*2}_{j 4-b'+1,L'-l+1}}
\label{atac_seq_emseq_update_model_shift_flip}
\end{aligned}
\end{equation}
where $m^{*1}$ is the partial motif update for the forward strand, $m^{*2}$ is the partial motif update for the reverse strand and $b$ takes the values $1,2,3,4$ for A, C, G and T respectively.
As in the original algorithm, the optimization process is then carried on for a given number of iterations.
\subsection{EMJoint algorithm}
For completeness, I also describe a generalized EM algorithm that performs the classification of a set of regions using several different signal layers over these regions, at once. This algorithm has been implemented but was not tested as time did not permit it.
Because ChIPPartitioning and EMSequence algorithm computations are strictly identical with the exception of the likelihood computations and the model update, it is possible to design a third algorithm, called EMJoint, that models at the same time one or more read coverage signals over a region and its sequence composition. To do so, I simply mixed both previous algorithms and applied the following modifications. For the sake of simplicity, I only expose the version with shift and flip, for one read coverage signal and the DNA sequence, as it is the most general.
The input is composed of two matrices of integers, $D$ and $R$, of dimensions $N \times L$. $N$ DNA sequences $d_{1}, d_{2}, ..., d_{N}$ of length $L$ and of $N$ vectors of read counts $r_{1}, r_{2}, ..., r_{N}$ of length $L$ are contained inside each matrix respectively. Each DNA sequence $d_{i}=(d_{i1}, d_{i2}, ..., d_{il})$ is a vector of integers encoding the DNA sequence (A=1, C=2, G=3, T=4) and each read count vector $r_{i}=(r_{i1}, r_{i2}, ..., r_{il})$ is a vector of integers containing the number of reads mapping over the sequences contained in $D$.
The positions represented by each cell of the matrices $R$ and $D$ must be strictly the same. That is, if $D_{10,74}$ represents the base at position 12'342'457 of chromosome 9, $R_{10,74}$ must contain the density of reads that are mapped at position 12'342'457 of chromosome 9.
Each class is modeled by a vector of length $L'$ of expected number of reads $C_{j} = (c_{j1}, c_{j2}, ..., c_{jL'})$ and by a sequence motif $M_{j}$ of expected base probabilities $M_{j}$ of dimensions $4 \times L'$ with the constrain $\sum_{b=1}^4 m_{j b,j} = 1$.
To compute the likelihood $P(r_{i}, d_{i}, s, inv |C^{inv}_{j}, M^{inv}_{j})$ of a region, equation \ref{atac_seq_emseq_likelihood} is modified as follows :
\begin{equation}
\begin{aligned}
P(r_{i}, d_{i}, s, inv |C^{inv}_{j}, M^{inv}_{j}) & =
\prod_{l=1}^L Poisson(r_{i,l}, \lambda=c^{inv}_{j,l})
\times
m^{inv}_{j b,l}\\
\text{with } b & =
\begin{cases}
d_{i,s+l-1} & \text{if $inv = 1$}.\\
4 - d_{i,s+l-1} + 1 & \text{if $inv = 2$}.
\end{cases} \\
\end{aligned}
\label{atac_seq_emjoint_likelihood}
\end{equation}
where $\lambda$ is the mean parameter of the $Poisson$ probability mass function.
The posterior probability $P(C_{j}, M_{j} | r_{j}, d_{j})$ computation remain unchanged. Once these values have been computed, it is possible to update both part $C_{j}$ and $M_{j}$ of a class using the original equation (11) from \citep{nair_probabilistic_2014} and equation \ref{atac_seq_emseq_update_model_shift_flip} respectively.
It is possible to further generalize this algorithm in order for it to accept $Z$ different input matrices (called layers, at most 1 DNA sequence matrix and any number of read density matrices) of dimensions $N \times L$ containing different types of signal (for instance DNA sequences, TF$_{1}$ ChIP-seq, TF$_{2}$ ChIP-seq, DNase-seq, ...) for a set of $N$ regions.
This only requires to adapt how the classes are modeled and equation \ref{atac_seq_emjoint_likelihood} to sum over the $Z$ different layers instead of only two. Additionally, care should be taken to use equation \ref{atac_seq_emseq_likelihood_shift_flip} for DNA sequence layer and equation (6) from \citep{nair_probabilistic_2014} for read count layers.
\subsection{Data realignment}
% All of the above described algorithms compute a set of posterior probabilities and use them to perform the class model update. As illustrated in Figure \ref{atac_seq_em}, this procedure is actually a weighted and ungaped data alignment in which the posterior probabilities are the weights.
% It is absolutely feasible to run a partitioning on a given matrix $A$, for instance ATAC-seq read counts, using EMRead, and to subsequently use the obtained posterior probabilities to compute the class models, using another data matrix, let us say $B$ of DNA sequence. In this case, equation \ref{atac_seq_emseq_update_model_shift_flip} should be used because $B$ contains DNA sequences.
% This procedure allows to realign dataset $B$ as $A$ in order to co-visualize different types of signals. The only things that should be taken care of is that matrices $A$ and $B$ should have the same dimensions.
As for ChIPPartitioning, these algorithms compute a set of posterior probabilities and use them to perform the class model updates. Thus, each one of them can be used to partition a dataset $A$ and relign another dataset $B$, using the same procedure as described in section \ref{encode_peaks_data_realign} for ChIPPartitioning.
Furthermore, it is absolutely feasible to run a partitioning on a given matrix $A$, let us say of DNA sequences using EMSequence, and to subsequently use the obtained posterior probabilities to compute the class models using another data matrix $B$ of ATAC-seq read counts. Care should only be taken to use the appropriate data model computation equation.
In the following sections, this is the procedure that will be used to overlay different types of data for a given partition.
\subsection{Soft aggregation plots}
\label{atac_soft_ap}
Given a read density matrix $R$ containing the signal of $N$ regions of length $L$, a standard aggregation plot can be computed by averaging the signal present at each position over the rows. This results in the creation of a vector of length $L$ that represents the average signal in $R$.
In essence, ChIPPartitioning class models are aggregation plots. A data partition into $K$ classes created $K$ models that are as many aggregation vectors. Each model contains the average expected signal of a class. It only differs from the above definition by the fact that a \textit{weighted} average is computed. For a given class, each row is assigned a weight that is the (posterior) probability that this region belongs to this class. If shifting is enabled, then the aggregation vectors have a length $L'=L-S+1$ where $S$ is the shifting freedom.
Conceptually, EMSequence does exactly the same excepted that aggregating DNA sequences results in the creation of a LPM instead of a vector. Obviously, at a given position, one cannot average 'A' and 'G' together. Instead one can say that there are 50\%A, 0\%C, 50\%G and 0\%T.
In the following sections, soft aggregation plots simply correspond to ChIPPartitioning/EMSequence models.
\section{Data processing}
Prior undertaking the chromatin organization study several pre-processing steps and checks have to been taken in order to ensure a proper treatment of the data.
If a TF can protect a stretch of DNA against transposition and create a footprint, so can a nucleosome. As a matter of fact, both cases are biologically drastically different. Nucleosome compete with TFs to bind on DNA \citep{voss_dynamic_2014}. Thus nucleosome footprints represent regions of the genome that cannot be bound by TFs, if we except pioneering factors \citep{cirillo_opening_2002,zaret_pioneer_2011}. Mixing nucleosome and TF footprints could bias downstream analyses.
The sequenced fragments were split in two classes based on their sizes, following the method described in \cite{buenrostro_transposition_2013} (more details are available in section \ref{suppl_atac_seq_fragment_size} for more details). One class contained the small sized fragment that originate from accessible chromatin regions - later referred to as "open chromatin" - and the second class contained longer fragments mapping to individual nucleosomes - later referred to as "nucleosomes".
Then, I used the edges of the open chromatin reads to reveal chromatin accessibility and TF footprints and the middle position of the nucleosome fragments to reveal the histone octamer occupancy (more details in section \ref{suppl_atac_seq_measuring_signal}).
\section{Results}
To create a catalog of chromatin architectures around TF binding sites in monocytes, it was necessary to be able to align the regions of interest properly (with respect to the binding sites) or to have methods able to deal with this issue.
\subsection{Aligning the binding sites}
\begin{figure}
\begin{center}
\includegraphics[scale=0.35]{images/ch_atac-seq/peaks_rmsk_sampled_sequences_23class_2.png}
\captionof{figure}{\textbf{Central parts of the extended sequence and chromatin models} found in monocytes regulatory regions. Each read density and sequence pattern is a soft aggregation plot. The displayed logos correspond to the sequence class models found by EMSequence. The corresponding chromatin accessibility (red) and nucleosome occupancy (blue) are displayed atop of the logos. The classes are displayed by overall decreasing probability. A zoom over the central part of each class aggregation is shown in the top right inlet.}
\label{atac_seq_23class}
\end{center}
\end{figure}
The list of active regulatory regions was assumed to correspond to regions of high ATAC-seq signal. Consequently, I choose to used the peak list generated by 10xGenomics for this dataset as the list of regulatory regions of interest. The regulatory regions overlapping known repeated elements were filtered out leaving 70'462 regulatory regions.
An assessment of ChIPPartitioning indicated that ChIPPartitioning did not seem to be able to realign regulatory elements - and reveal footprints - using their chromatin accessibility profiles at the single base resolution (see section \ref{suppl_eval_emseq_chippartitioning}). However, EMSequence showed good performances to retrieve sequence motifs (see section \ref{suppl_eval_emseq_chippartitioning}). Because footprints are expected to be located over TF binding motifs, I decided to use EMSequence to realign the regions based on their sequence motif to reveal TF footprints.
% Because ChIPPartitioning did not seem to be able to realign regulatory elements - and reveal footprints - based on their chromatin accessibility profiles, I decided to use EMSequence. De novo motif discovery was not strictly speaking required. Instead, the aim was to reveal TF footprints in these regulatory regions.
In order to limit the scope of the investigation, I decided to focus on a priori important TFs for the monocyte biology \citep{kurotaki_transcriptional_2017,rico_comparative_2017}. These TFs were : jun, HIF1a, myc, PU.1, CEBPB, IRF2, IRF4, IRF8, LHX3, FOXH1, SOX, MEF2c, ELF5, STAT6, NFE2, AHR, E2F2, E2F3, KLF2, KLF4 and NR4A1. Additionally, CTCF was added together with the EGR, GATA, NFAT and RUNX families to widen the spectrum of TF included in the analysis. Because TFs within a given family tend to bind the same motif (for instance IRF4 and IRF8 or E2F2 and E2F3), binding models representative for sets of TFs were selected from the JASPAR database motif clustering \citep{castro-mondragon_rsat_2017}. In total, 23 LPMs were downloaded from JASPAR database clustering.
% In order to partition the regulatory sequences into 23 classes corresponding to the binding sites of these TFs, I used a special EMSequence setup. EMSequence model parameters were initialized using the downloaded LPMs. Then EMSequence was run for 1 iteration. The rational was that EMSequence was not required to learn the models but instead to partition and align the sequences using these known models.
In order to realign the regions and reveal the TF footprints, the 70'462 sequences of 1'001bp centered on the regulatory element mid position were subjected to a special EMSequence partionining setup. EMSequence model parameters were initialized using the downloaded LPMs. Then EMSequence was run for 1 iteration. The rational was that EMSequence was not required to learn the models but instead to partition and align the sequences using these known models. A shifting freedom of 971 was allowed, resulting in the alignment of 30bp long sub-regions, together with flipping freedom. Based on the alignment and data, the resulting 30bp ATAC-seq signal and sequence models were extended of 500bp on each side to reveal the organization of regulatory sequences. This created 23 soft aggregation plots revealing the sequence and chromatin architecture around each TF binding sites (Figures \ref{suppl_atac_seq_23class} and \ref{atac_seq_23class}).
First, from the class aggregations, footprints are clearly visible over the TFs binding motif. This is a strong evidence that the region realignment worked properly. Second the 23 different classes showed various types of footprints. For instance, CTCF shows its usual strongly positioned nucleosome arrays together with a clear chromatin opening over the motif supporting CTCF binding. The important monocyte TF PU.1 also shows an increased chromatin accessibility at its binding sites. However, the footprint drastically differ from CTCF in the sense that a clear a wide signal drop - larger than PU.1 motif only - is visible. It is also concomitant with an increased nucleosome occupancy. Conversly, LHX3 shows a pattern that rather suggest a modest chromatin opening. Finally, KLF's family binding sites a strong chromatin accessibility rather than a protection of the bound sequences, which may be compatible with a transcriptional repressive activity \citep{berest_quantification_2018}.
Third overall class probabilities gives an indication of the regulatory element content in term of motif. Its seems that CTCF motif is the most common one even though it does not mean that each motif instance is bound or even functional.
\subsection{Exploring individual TF classes}
\begin{figure}
\begin{center}
\includegraphics[scale=0.30]{images/ch_atac-seq/data_classCTCF_8class.png}
\captionof{figure}{\textbf{Soft aggregation plots of CTCF sub-classes} obtained by extracting CTCF class data and subjecting them to a ChIPPartitioning classification into 8 classes. The displayed logos correspond to each class sequence aggregation. The corresponding chromatin accessibility (red) and nucleosome occupancy (blue) are displayed atop of the logos. The classes are displayed by overall decreasing probability. A zoom over the central part of each class aggregation is shown in the top right inlet.}
\label{atac_seq_ctcf_subclass}
\end{center}
\end{figure}
The results shown in the previous section are per TF aggregation profiles. Thus a further exploration of each class is required to investigate whether several different footprint classes can be isolated per TF. To do so, I extracted the data assigned to each class and run ChIPPartitioning on these data. Because the regions have already be aligned such that the TF motifs were in their centers, ChIPPartitioning was not allowed shifting nor flipping.
As expected, applying this method refined the results. For instance, the CTCF class data classification (Figure \ref{atac_seq_ctcf_subclass}) showed sub-classes in which CTCF motif instances were likely not bound (sub-classes 8, 6, 7, 2 and 5) as well as sub-classes in which they were likely bound (sub-classes 4, 3 and 1). In the latter group, several chromatin organizations could be revealed, with approximately 35\% of the motif instances showing the canonical CTCF chromatin organization (class 4).
The same is illustrated for PU.1 and AP1 classes (Figure \ref{suppl_atac_seq_pu1_subclass} and \ref{suppl_atac_seq_ap1_subclass}). In both cases, it was possible to identify likely bound and unbound motif instance sub-classes. Also, for these two TFs, the nucleosome are not visible, in line with my previous results showing that only CTCF has nucleosome arrays organized with respect to its binding sites (see Chapter \ref{encode_peaks}).
\section{Discussions}
Even though preliminary, these results showed that this computational framework can turn useful to analysis the chromatin organization around TF binding sites using ATAC-seq data.
First, not much of a surprise, applying population level analyses to the pool of single cell data gave meaningful results.
Second, ChIPPartitioning turned out to be useless to properly phase unaligned regions based on their chromatin accessibility patterns. Instead, the newly proposed EMSequence algorithm turned out to be usable for this task, in a special setting, and was able to produce a meaningful per TF data realignment. As a reminder, short models were searched (and thus large shifting freedom was set). This alignment was then used to realign larger regions and revealed footprints. Also, a priori knowledge was fed in under the form of the initial sequence model values taken from TF binding model databases.
Third, I presented a method to extract data assigned to a class, from a probabilistic partition, without using any hard assignment shortcut. Running ChIPPartitioning on these data then turned out to revealed different chromatin organization for each TF, allowing to distinguish between likely bound and unbound motif instances.
\section{Perspectives}
As a fact, in its current form, this framework is incomplete and many things could be modified or included.
As an immediate algorithm improvement, I propose to modify EMSequence. Currently, it models the DNA sequences using LPMs. LPMs have many advantages but on the other side they imply a strict order of occurrence. For instance, a 50bp long matrix can represent two sub-motifs of 10bp each, separated by 30bp. Consequently, it cannot handle a case where the order of occurrence of the sub-motifs would be inverted or a different spacing differently than by dedicating a class to this. To circumvent this, I propose to handle the DNA sequence content using a list of k-mers. K-mers would have two immediate benefits. First, it would alleviate the ordering constrain. Any k-mer can occur before or after any other. The only thing that matters is whether the k-mers are present or not in the sequence. Second, the use of k-mers would imply to "bin" the sequences. For instance, 10-mers would split the sequences in bins of 10bp. This point is of interest for EMJoint which use jointly EMSequence and ChIPPartitioning and currently require bins of 1bp if a DNA sequence matrix is used. The usage of bigger size bins has the following advantages : i) it will smooth the read densities for ChIPPartitioning leading to higher classification accuracy, ii) it will reduce the memory requirements (the data dimensions diminishes) and thus iii) it will reduce the runtime. On the other hand, storing k-mers can be a burden. Storing all possible kmers is $\Theta(K*5^{S})$ (including 'N's) in memory where $K$ is the number of classes and $S$ the kmer length. But this can be changed to $\mathcal{O}(N*(L-S+1))$ where $N$ is the number of sequences and $L$ the sequence length, using hashtables, which scales better for high values of $K$, $S$, $N$ and $L$.
Additionally, a method to estimate the fit of a given partition and choose the best one would be an asset. This would help choosing the appropriate number of classes to search. This could be implemented using the Akaike information criterion \citep{marsland_machine_2015-1}.
So far, most of the single cell technologies are targeted at measuring gene expression through scRNA-seq. Cell sub-population detection by clustering expression matrix \citep{fan_characterizing_2016, kiselev_sc3:_2017}, using gene regulatory network reconstruction \citep{aibar_scenic:_2017} or by identifying cellular states based on the accessible region motif content \citep{gonzalez-blas_cistopic:_2019} are popular. Currently, the use of ATAC-seq data at the single cell level is mostly limited to a binary open/closed classification. One can imagine using the framework described in this chapter to draw a catalog of chromatin structures from the pool of single-cell data and use it to annotate each cell. More precisely this could be done by going back to each peak in each cell and assigning a qualitative label corresponding to the chromatin model that matches the best (the most similar) this region in this cell. Ultimately, this would lead to the creation of a matrix (cells x regions) that could be used to run clustering methods. How the similarity should be computed and whether each cell will have a high enough coverage for similarity computations to be meaningful remain open questions. Alternatively, one can replace single cells by different bulk experiments. In this case, the clustering would not isolate cell sub-populations but experiments (individuals, culture conditions, etc) that are similar to each other.
\section{Methods}
\subsection{Code availability}
All the code used in this project is available on a c4science git repository (\url{https://c4science.ch/source/scATAC-seq.git}).
\subsection{Data sources}
% read, peaks, barcode
The reads mapped to hg19 genome were downloaded, in bam format, from 10xGenomics website (\url{http://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/1.1.0/atac_v1_pbmc_5k/atac_v1_pbmc_5k_possorted_bam.bam}). The corresponding peaks called on the aggregated data were downloaded, in bed format, from \url{http://cf.10xgenomics.com/samples/cell-atac/1.1.0/atac_v1_pbmc_5k/atac_v1_pbmc_5k_peaks.bed}. A file containing cell barcode related information was downloaded from \url{http://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_5k/atac_v1_pbmc_5k_singlecell.csv} and the barcode sequences extracted using "grep -E \_cell\_[0-9]+ <file\_csv> | cut -d ',' -f 1" where <file\_csv> is the downloaded file.
%genome
The hg19 genome sequence was downloaded from the Ensembl ftp at \url{ftp://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/}. Chromosome 1, 2, ..., 22, X and Y sequences were downloaded in fasta format and concatenated together. The sequence headers were then formatted to fit a "chr<N>" format where <N> is 1, 2, ..., 22, X or Y to correspond to the sequence field values in the bed file containing the peaks.
% repeat mask
The list of repeated elements for hg19 was downloaded from USCS Genome Browser Table Browser (\url{http://genome.ucsc.edu/cgi-bin/hgTables}). The parameters were set as follows : i) clade to mammal, ii) genome to human, iii) assembly to hg19, iv) group to repeat, v) track to repeatMasker. Finally all the regions that were not mapped to chromosome 1 to 22, M, X or Y were filtered out.
\subsection{Data post-processing}
% reads
The reads that did not have a proper barcode were filtered out using "python3.6 filter\_bam.py -i <reads\_bam> --tag CB --values <barcodes\_txt> -o <output\_bam>" where <reads\_bam> is the bam file containing the reads, <barcodes\_txt> the file containing the barcodes created with the above grep command and <output\_bam> is the output bam file. filter\_bam.py is a in-house developed python program for this project.
% peaks
The peak name field was modified such that the values corresponding to a chromosome name followed a "chrN" format using "sed -E s/\^{}\textbackslash([0-9XY])/chr\textbackslash\textbackslash1/". Then only the peaks mapping to chromosome contigs were selected using "grep -E \^{}chr".
The peaks that had at least 30\% overlap with any repeated element were filtered out using "bedtools substract -A -f 0.3 -a <peaks\_bed> -b <repeats\_bed>" where <peak\_bed> and <repeats\_bed> are the peak and repeat files in bed format, from the bedtools suite \citep{quinlan_bedtools:_2010}. Finally, the peaks were sortede by position using "sort -K 1,1V -k2,2n -k3,3n".
\subsection{Model extension}
\label{atac_seq_method_model_ext}
Let's assume that we have partitioned a read density matrix $R$ of dimensions $NxL$ using $K$ classes, with a shifting freedom $S$ and with flipping. The posterior matrix probability $P$ has dimensions $NxKxSx2$ (region, class, shift, flip) and the $K$ models have a length of $L'=L-S+1$.
Extending the models is obtained by computing a larger matrix $R^{ext}$ of dimensions $NxL''$ where $L''=L+E$. In this case $E$ is the extra number of columns to add. Care should be taken to construct $R^{ext}$ by adding exactly $E/2$ columns one each side of $R$ such that $R$ is contained in the central part of $R^{ext}$.
Once $R^{ext}$ has been constructed, the model update step (equation 11 in \cite{nair_probabilistic_2014}) can be applied on it using the posterior probability matrix $P$. This will results in the creation of $K$ models of length $L''-S+1$. The $K$ original models will be contained in the central part of the $K$ extended models.
Extending a sequence model is done using exactly the same procedure with the exception that equation \ref{atac_seq_emseq_update_model_shift_flip} should be used to compute the models.
The read and sequence model extension procedures are implemented in C++ in the ReadModelExtender and SequenceModelExtender programs.
\subsection{Extracting data assigned to a class}
% The difference between hard and soft clustering (such as ChIPPartitioning) methods is that in soft clustering "the output is a membership function, so each pattern can belong to more than one group with varying degrees of membership" \citep{dalton_clustering_2009}, while in hard clustering each pattern is assigned to only one group. In the former case, isolating all regions assigned to a class $X$, creating a matrix of read density and re-running the clustering method on this matrix is straightforward and would do the trick. In the latter case, this is also possible but requires to account for the feature described above.
% Let's assume that a first matrix $R$ of dimensions $NxL$ containing $N$ regions of length $L$ has been partitioned in $K$ classes by ChIPPartitioning, with shifting freedom $S<L$ and flip. This created a probability matrix $P$ of dimensions $NxKxSx2$ (region, class, shift, flip). Computing a new matrix $R^{class}$ to extract the signal of the class of interest $X$ is actually related to the model update performed by ChIPPartitioning (see Equation 11 from \cite{nair_probabilistic_2014}). However, instead of summing (aggregating) everything in a single vector containing the class model, we want to unfold it into a matrix.
% As described in section \ref{atac_soft_ap}, ChIPPartitioning/EMSequence models are equivalent to aggregation plots. Extracting the data assigned to a given class conceptually corresponds to creating the matrix which is averaged to create to models. The computations required to do this are highly similar to those involved in the model update performed by ChIPPartitioning (see Equation 11 from \cite{nair_probabilistic_2014}). However, instead of summing (aggregating) everything in a single vector containing the class model, we want to unfold it into a matrix, that we will call .
The aim of the manipulation described in this section conceptually corresponds to creating a matrix containing only the rows (region) assigned to a given class $X$. This is of interest to run downstream analysis on this specific data subset (in this case, further clustering). For hard clustering algorithms, this can be done quite easily. It only requires to select the regions (rows) that have been assigned to class $X$. However, for soft partitioning, things are different and each and every region is assigned to all classes, with varying degrees (probabilities) \citep{dalton_clustering_2009}.
Let's assume that a first matrix of read densities $R$ of dimensions $NxL$ containing $N$ regions of length $L$ has been partitioned in $K$ classes by ChIPPartitioning, with shifting freedom $S<L$ and flip. This created a probability matrix $P$ of dimensions $NxKxSx2$ (region, class, shift, flip).
As described in section \ref{atac_soft_ap}, ChIPPartitioning models computed are equivalent to aggregation plots. The aim is thus to do the reverse operation : unfold a given class $X$ model (a given aggregation vector) to create a matrix having $N$ rows and $L-S+1$ columns where $S$ is the shifting freedom. However, if $S$ is high, then the number of columns $L-S+1$ is low and the created matrix does only provide a narrow vision on the regions of interest. Enlarging the matrix such that it has dimensions $NxL'$ where $L' \leq L-S+1$ can be desirable.
Let us proceed in two steps.
First, let us construct an extended matrix of read densities $R^{ext}$ of dimensions $NxL'$ where $L'=L+S$. Constructing $R^{ext}$ is equivalent to adding $S/2$ columns on each side of $R$ (as described in section \ref{atac_seq_method_model_ext}).
Second, let us construct the wanted final matrix $R^{class}$ of dimensions $NxL$ using a modified "class model update" procedure. This procedure, instead of aggregating the signal into a single vector to create the class model, unfold it and create a matrix (as described in algorithm \ref{atac_seq_algo_extract_class}).
Because some regions (rows) can be assigned with a really low probability to class $X$, the corresponding rows in $R^{class}$ will show a really weak signal or even no signal at all.
\SetKwProg{Fn}{}{\{}{}\SetKwFunction{Function}{matrix ExtractClassData}%
\begin{algorithm}[H]
\label{encode_peaks_algo_ndr_extend}
\Fn{\Function{}}
{ \KwData{The matrices $R'$ and $P$.}
\KwResult{The class matrix $R^{class}$}
\tcp{overall class probabilies}
$class.prob =$ vector of K 0's \;
$tot = 0$ \;
\For{$i from 1 to N$}
{ \For{$j from 1 to K$}
{ \For{$k from 1 to S$}
{ \For{$l from 1 to 2$}
{ $class.prob_{j} \pluseq p_{i,j,k,l}$ \;
$tot \pluseq \pluseq p_{i,j,k,l}$ \;
}
}
}
}
\For{$j from 1 to K$}
{ $class.prob_{j} \divideq tot$ \; }
\tcp{modified class model update}
\For{$i in 1 to N$}
{ \For{$s in 1 to S$}
{
\tcp{forward orientation}
$from.dat2.fw = s $ \;
$to.dat2.fw = from.dat2.fw + L - 1$ \;
$j.dat3.fw = 1$ \;
\For{$j.dat.2.fw from from.dat2.fw to to.dat.2.fw$}
{
$R^{class}_{j,j.dat3.fw} \pluseq \frac{P_{i,X,s,1} \times R'_{i,j.dat.2.fw}} {class.prob}_{X}$ \;
$j.dat3.fw \pluseq 1$ \;
}
\tcp{reverse orientation}
$j.dat3.fw = 1$ \;
$from.dat2.rv = L' - 1 -s $ \;
$to.dat2.rv = from.dat2.rv - (L-1) $ \;
\For{$j.dat.2.rv from from.dat2.rv down to to.dat.2.fw$}
{ $R^{class}_{j,j.dat3.fw} \pluseq \frac{P_{i,X,s,2} \times R'_{i,j.dat.2.rv}} {class.prob_{X}} $ \;
$j.dat3.fw \pluseq 1$ \;
}
}
}
\Return{$R^{class}$}
}
\caption{Computes a matrix containing the data assigned to a given class $S$.}
\label{atac_seq_algo_extract_class}
\end{algorithm}
The same can be done for sequence data. The read density and sequence data extraction procedures are implemented in C++ in the ClassReadDataCreator and ClassSequenceDataCreator respectively.
\subsection{Programs}
In order to allow an easy handling and a quick treatment of the data, the algorithms and procedures described above have been implemented in C++ and some are fully multi-threaded.
Here is a list of the relevant C++ implementations :
\begin{itemize}
\item SequenceMatrixCreator : creates a sequence matrix in which each row contains the sequence of a given region. It takes as input a fasta file containing the sequences of interest, a bed file that specifies regions of interest in the fasta sequences and a from/to range that will specify the sequence boundaries around the center position of the regions listed in the bed file.
\item ReadMatrixCreator : creates a read density matrix in which each row contains the number of reads mapping at each position of a given region. It takes as input a bam file containing the sequencing reads of interest, the bam index file, a bed file that specifies regions of interest and a from/to range that will specify the region boundaries around the center position of the regions listed in the bed file. Additionally, it also takes a bin size that is used to aggregate the counts in each region. A bin size of 1 means a signal resolution at the base-pair.
\item WhichNullRows : returns the indices of the rows that have no signal in a read density matrix. It takes a read density matrix file as input.
\item EMRead : implementation of the ChIPPartitioning algorithm. Takes a read count data matrix as input, the number of classes, the shifting and flipping parameters and return the posterior probability matrix in binary format.
\item EMSequence : implementation of the EMSequence algorithm. It takes a DNA sequence data matrix as input, the number of classes, the shifting and flipping parameters and return the posterior probability matrix in binary format.
\item EMJoint : implementation of the generalized EMJoint algorithm. It takes any number of data matrix as input, the number of classes, the shifting and flipping parameters and return the posterior probability matrix in binary format. This program can be given 0 or 1 DNA sequence matrix and any number of read count matrices as input. If a DNA sequence matrix is given, the read density matrix/matrices given aside must be at the single base-pair resolution (bin size). If only read density matrices are given, any resolution is acceptable.
\item ProbToModel : implementation of the data realignment procedure. It takes as input a data matrix (DNA sequence or read counts) and a matrix of posterior probabilities as input and returns the corresponding (read or sequence) class models. These values can be used as they are to create soft aggregation plots.
\item ClassReadDataCreator : implementation of the class data extraction procedure for read density data. It taks as input a peak file (which centers will be positioned in the center of each row), a bam file containing the reads of interest, the corresponding bam index file, a posterior probability matrix file returned by a partitioning program, a from/to range indicating the width of the final matrix and the number (the ID) of the class to extract.
\item ClassSequenceDataCreator : implementation of the class data extraction procedure for sequence data. It taks as input a peak file (which centers will be positioned in the center of each row), a fasta file containing the sequences of interest, the posterior probability matrix file returned by a partitioning program, a from/to range indicating the width of the final matrix and the number (the ID) of the class to extract.
\item ReadModelExtender : implementation of the read model extension procedure. This program takes as input a bam file containing the reads, a bam file index file, a bed file containing the regions of interest, a probability matrix created by a classification program, the from/to values that were used to create the matrix that was classified, the bin size that was used to create the matrix that was classified and the total number of columns (positions) to add to the model. The result of this program is the extended model.
\item SequenceModelExtender : implementation of the sequence model extension procedure. This program takes as input a fasta file containing the genome sequence, a bed file containing the regions of interest, a probability matrix created by a classification program, the from/to values that were used to create the matrix that was classified and the total number of columns (positions) to add to the model.
\end{itemize}
\subsection{Fragment classes}
The distribution of fragment sizes was modeled as a mixture of three classes (open chromatin, mono-nucleosomes, di-nucleosomes), each following a Gaussian distribution. Each class fragment length distribution was modeled using :
\begin{equation}
\begin{aligned}
f(x) & = a \times \exp^{\frac{-(x-m)^{2}}{2 \times s}}
\end{aligned}
\label{atac_seq_fragment_length_class}
\end{equation}
where $x$ is the fragment length, $m$ the mean fragment length for this class, $s$ the fragment length standard deviation and $a$ an amplitude factor.
The mean parameters were initialized to 50, 200 and 300bp. The standard deviation parameters were initialized to 10, 10 and 30bp and the amplitude factors to 1. The parameters were fitted to the data using the the nls() function in R using the Gauss-Newton algorithm.
The reads were sorted from the bam file using "python3.6 split\_by\_size.py -i <read\_bam> -o <out\_bam> --length <from>-<to>" where <read\_bam> is the bam file containing the reads, <out\_bam> the output bam and split\_by\_size.py is a program developed for this project. For the open chromatin fragments, <from> and <to> were set to 30 and 84. For the mono-nucleosome fragments <from> and <to> were set to 133 and 266. Finally, for the di-nucleosome fragments <from> and <to> were set to 341 and 500.
In order to create a nucleosome fragment set, the di-nucleosome fragments were cut in two at their center position using "python3.6 split\_in\_two.py -i <dinucl\_bam> -o <split\_bam>" where <dinucl\_bam> is the bam file containing the di-nucleosome fragments and <split\_bam> is the output bam file containing the the reads corresponding to the fragments cut in two. "split\_in\_two.py" is a in-house developed python program for this projet. This file was then sorted using "samtools sort <split\_bam>" from the samtools suite \citep{li_sequence_2009}. The sorted file was then merged with the mono-nucleosome fragments using "samtools sort <mononucl\_bam> <split\_bam>" where <mononucl\_bam> is the bam file containing the mono-nucleosome fragments and <split\_bam> is the sorted bam file containing the di-nucleosome fragments split in two.
Finally, the open chromatin fragment and nucleosome fragment bam files were indexed using "samtools index <bam\_file>" where <bam\_file> is the bam file to sort.
\subsection{Simulated sequences}
2'000 synthetic DNA sequences of 100bp long were generated. Two equiprobable classes were created and each sequence was assign to either of the two classes. Each class was defined by a 8bp sequence motif (Figure \ref{suppl_atac_seq_emseq_best_motifs}). Each sequence had exactly one motif occurrence, anywhere in the sequence (with a uniform probability), on either strand (equiprobable). The motif sequence was sampled using the corresponding class model. Finally, the bases outside the sequence were sampled using a mono-nucleotide model with 0.25 probability for each base.
\subsection{Binding site prediction}
\label{atac_seq_method_pwmscan}
The hg19 genome was scanned on both strands using PWMScan \citep{ambrosini_pwmscan:_2018} from its web interface (\url{https://ccg.epfl.ch/pwmtools/pwmscan.php}) to predict the binding sites of CTCF, EBF1, myc and SP1 using PWMs from the JASPAR 2018 collection \citep{khan_jaspar_2018} (MA0139.1 CTCF, MA0154.3 EBF1, MA0147.3 MYC and MA0079.3 SP1 respectively). The reference positions were set to 10, 7, 6 and 6 respectively and the threshold was set to a pvalue of $1^{-6}$ for all TFs excepted for SP1 for which a pvalue of $1^{-7}$ was used. Non-overlapping matches option was enabled.
\subsection{Realignment using JASPAR motifs}
% peaks and repeat masking
The peaks were filtered for repeated elements. The peaks that had at least 30\% overlap with any repeated element were filtered out using "bedtools substract -A -f 0.3 -a <peaks\_bed> -b <repeats\_bed>" where <peak\_bed> and <repeats\_bed> are the peak and repeat files in bed format, from the bedtools suite \citep{quinlan_bedtools:_2010}.
% matrix creation
70'642 DNA sequences of 1'0001bp corresponding to the central position of each peak +/-500bp were extracted and used to create a sequence matrix using "SequenceMatrixCreator --bed <peak\_rmsk\_bed> --fasta <hg19\_fasta> --from -500 --to 500" where <peak\_rmsk\_bed> is the bed file containing the repeat filtered peaks and <hg19\_fasta> is a fasta file containing chromosome 1, 2, ..., 22, X and Y sequences of the hg19 genome assembly.
The corresponding open chromatin and nucleosome sequencing read density matrices were created using "CorrelationMatrixCreator --bed <peak\_rmsk\_bed> --bam <open\_bam> --bai <open\_bai> --from -500 --to 500 --binSize 1 --method read\_atac" and "CorrelationMatrixCreator --bed <peak\_rmsk\_bed> --bam <nucl\_bam> --bai <nucl\_bai> --from -500 --to 500 --binSize 1 --method fragment\_center". <open\_bam> and <open\_bai> are the bam file containing the ATAC-seq reads, as provided by 10xGenomics and <open\_bai> the corresponding bam index file. The bin size of 1b ensured that the read density matrix regions exactly correspond the sequence matrix regions.
% PWMs
A total of 23 binding models were downloaded from the motif clustering of JASPAR \citep{castro-mondragon_rsat_2017}. Briefly, the motif clustering is made of a forest of trees (each tree is a cluster in which the leaves are the individual TF binding models). Internal nodes binding models are also available. As a matter of fact, they represent a consensus over multiple individual TF binding models. In order to i) have models representing the binding specificity of the TFs of interest and ii) widen the analysis to other TFs if they were sufficiently related to one of the TFs of interest in terms of specificity, I manually selected binding motifs, in the different motif trees, that would fit these requirements.
The downloaded models were :
\begin{table}[H]
\begin{center}
\begin{tabular}{ |l|l|p{90mm}|l| }
\hline
\multicolumn{4}{|c|}{Binding models downloaded} \\
\hline
Cluster ID & Node ID & TFs covered & Name \\
\hline
1 & 74 & ARID3b, LHX3 & LHX3 \\
2 & 12 & ESRRG, NR4A1, ESRRB, NR2F2 & NR4A1 \\
3 & 23 & FOSL1::JUNB, FOSL1::JUN, FOS::JUND, \newline FOSL2::JUN, FOS::JUNB, JDP2, NFE2, FOSL1, FOS, JUND, FOSL2, JUNB, JUN::JUNB, FOSL1::JUND, FOS::JUN, FOSL2::JUND, FOSB::JUNB, FOSL2::JUNB, BATF::JUN, JUN & AP1 \\
3 & 24 & NFE2L2, BACH1::MAFK, MAF::NFE2, BACH2 & NFE2 \\
4 & 22 & max::myc, MXI1, myc, mycn & myc \\
4 & 30 & ARNT, AHR::ARNT & AHR \\
4 & 31 & HIF1A, HES5, HES7 & HIF1A \\
5 & 20 & CEBPA, CEBPG, CEBPD, CEBPB, CEBPE & CEBP \\
7 & 13 & SPIC, SPI1 & PU.1 \\
7 & 17 & ELF5, ELF3, EHF, ELF1, ELF4 & ELF \\
19 & 2 & NFAT5,NFATC1,NFATC3 & NFAT \\
20 & 4 & MEF2C,MEF2B,MEF2A,MEF2D & MEF2 \\
21 & 5 & GATA3, GATA5, GATA4, GATA6, GATA1, GATA2 & GATA \\
28 & 13 & EGR2, EGR4, EGR1, EGR3 & EGR \\
28 & 14 & KLF4,KLF1,KLF9 & KLF \\
31 & 4 & IRF7, IRF9, IRF4, IRF8, IRF5 & IRF4 \\
31 & 5 & STAT1::STAT2, IRF2 & IRF2 \\
32 & STAT6 & STAT6 & STAT6 \\
33 & 1 & SOX3, SOX6 & SOX \\
38 & 3 & RUNX1, RUNX2, RUNX3 & RUNX \\
39 & 1 & E2F3, E2F2 & E2F \\
48 & CTCF & CTCF & CTCF \\
66 & 1 & FOXH1 & FOXH1 \\
\hline
\end{tabular}
\captionof{table} { \textbf{TF binding models} from JASPAR matrix clustering. Each model can be retrieved within JASPAR matrix clustering (\url{http://jaspar2018.genereg.net/matrix-clusters/vertebrates/?detail=true}) using the cluster and node ID. "TFs covered" refers to all TF which models are children of the given node. "Name" refers to the label this model is referred to in the text and figures.}
\label{atac_seq_motif_table}
\end{center}
\end{table}
All the binding models were downloaded as LFMs, in JASPAR format and were then converted into LPMs.
% partitioning
The DNA sequence matrix was partitioned into 23 classes using EMSequence with the model initialized using the 23 LPMs from JASPAR using "EMSequence --seq <matrix\_seq> --class 23 --motifs <LPM1,LPM2,...,LPM23> --shift 971 --flip --iter 1 --seed <seed> --out <file\_out>" where <matrix\_seq> is a text file containing the sequence matrix created by SequenceMatrixCreator, <LPM1,LPM2>,...,LPM23> is a coma separated list of 23 files containing the 23 JASPAR LPMs, <seed> a randomly generated seed and <file\_out> the output file containing the probability matrix.
This resulted in the creation of 23 31bp sequence models from which the corresponding read models (open chromatin and nucleosome) were computed using "ProbToModel --read <reads\_mat> --prob <prob\_mat>" where <reads\_mat> is the file containing the read density matrix created using CorrelationMatrixCreator and <prob\_mat> is the file containing the probability matrix returned by EMSequence. These models were then extended of 1000bp leading to the creation of 1031bp models that are displayed in Figures \ref{suppl_atac_seq_23class} and \ref{atac_seq_23class}.
The sequence models were extended using "SequenceModelExtender --bed <peak\_rmsk\_bed> --fasta <hg19\_fasta> --prob <prob\_mat> --from -500 --to 500 --ext 1000" where the values of the --bed, --fasta, --from and --to option were the values used for the creation of the sequence matrix using SequenceMatrixCreator and <prob\_mat> is the probability matrix file returned by EMSequence.
The read models were extended using "ReadModelExtender --bed <peak\_rmsk\_bed> --bam <open\_bam> --bai <open\_bai> --prob <prob\_mat> --from -500 --to 500 --ext 1000 --binSize 1 --method <m>" where the value of the --bed, --bam, --bai, --from, --to, --binSize and --method options were the values used for the creation of the read density matrices (open chromatin and nucleosomes) using CorrelationMatrixCreator and <prob\_mat> is the probability matrix file returned by EMSequence.
\subsection{Per TF sub-classes}
For each of the 23 TF classes, a sequence matrix, an open chromatin read density matrix and a nucleosome read density matrix were created.
The sequence matrix was created using "ClassSequenceDataCreator --bed <peak\_rmsk\_bed> --fasta <hg19\_fasta> --prob <prob\_mat> --from -500 --to 500 --k <X> --out <matrix\_out>" where the values of the --bed, --fasta, --from and --to option were the values used for the creation of the sequence matrix using SequenceMatrixCreator, <prob\_mat> is the probability matrix file returned by EMSequence, <X> the number of the class to extract and <matrix\_out> the output file.
The read density matrices were created using ""ClassReadDataCreator --bed <peak\_rmsk\_bed> --bam <open\_bam> --bai <open\_bai> --prob <prob\_mat> --from -500 --to 500 --binSize 1 --method <m>" where the value of the --bed, --bam, --bai, --from, --to, --binSize and --method options were the values used for the creation of the read density matrices (open chromatin and nucleosomes) using CorrelationMatrixCreator and <prob\_mat> is the probability matrix file returned by EMSequence.
The rows with no signal (0) in the open chromatin read density matrix were listed using "WhichNullRows --mat <reads\_mat>" where <reads\_mat> is the matrix created using ClassReadDataCreator.
The partitioning was performed on the open chromatin signal using ChIPPartitiong using "EMRead --read <reads\_mat> --iter 20 --class <k> --shift 1 --filter <filter\_txt> --seed <seed> --filter --out <prob\_mat>" where <reads\_mat> is the file containing the open chromatin read density matrix created using ClassReadDataCreator, <k> is the number of classes from 1 to 10, <filter\_txt> the file created by WhichNullRows, <seed> a randomly generated seed and <prob\_mat> is the output file containing the probability matrix. No shifting nor flipping was used because the previous step of TF motif instance alignment already solved it. The best partitioned was estimated by manual curation.

Event Timeline