Page MenuHomec4science

ch_discussion.tex
No OneTemporary

File Metadata

Created
Wed, May 8, 21:25

ch_discussion.tex

\cleardoublepage
\chapter{Discussion}
\label{discussion}
\markboth{Discussions}{Discussion}
\addcontentsline{toc}{chapter}{Discussion}
% This contribution of this work to genomic bioinformatics is dual. It has a resource development component and a research component, which importances can only be appreciated properly by considering them together.
% ressource part
% The resource aspect of this work has been presented in Chapter \ref{lab_resources}. It concerned the maintenance of the MGA and EPD databases. Even though this is not purely research work, it is a necessary support to active research that allows many researchers in different fields - from wet lab to purely computational laboratories - to ask and, importantly, answer their questions.
%EPD
% Currently, the EPD database contains the most precise genome annotation of TSSs which is a crucial information in many fields of life sciences. From the creation of efficient expression vectors in bio-engineering, to the reconstruction of gene interaction networks in computational biology.
% MGA
% The benefice of the MGA database is less direct but not less important. The wealth of publicly available sequencing data is a treasure. As for each treasure, something is sitting atop, like a terrifying dragon or a tremendous overhead effort. Re-utilization of published data is an astonishingly difficult task that requires to search ill-annotated databases, to download the selected entries and map the data, hoping that the quality is acceptable. The sole task of mapping the data requires to have the proper software and genome locally stored. The MGA, as a majestic a well armored knight, allows to get rid of the overhead and to access data in utilizable format. Not only a vast amount of data from landmark studies is available on a central platform but they are efficiently searchable because of high quality, standardized and hand curated annotations. Such initiatives contribute to allow big-data like types of projects by making the burden of re-utilization acceptable.
% research part
% The research part of my thesis was presented in chapters \ref{encode_peaks}, \ref{smile_seq}, and \ref{atac_seq}. It covered several topics related to the characterization of TF binding sites.
% chromatin structure
% First, I explored the chromatin environment of TFs which binding has been assessed in GM17878 cells by the ENCODE Consortium. Using a specialized partitioning method, it was possible to detect well organized nucleosome arrays in the vicinity of all TFs. However, only CTCF has the ability to act as a barrier against which the nucleosome arrays are organized. Other TFs did not show this property, suggesting that other chromatin remodeling mechanisms are at play. Also, as expected, all TFs showed binding to a NDR with the exception of EBF1 which seemed to be binding on nucleosome arrays.
% EBF1
% Further analyses strongly supported that EBF1 binding sites tend to be located at the edges of nucleosomes. Nonetheless, these results do not allow to state whether EBF1 binds nucleosome edges or whether the consequence of its binding is a remodeling leading to a situation in which EFB1 directly flanks a nucleosome. In line with other reports suggesting a pioneer activity for EBF1, I propose a model in which EBF1 engages a nucleosome and promotes its displacement such that EBF1 is located at its entry. Because EBF1 binding motif shows properties of nucleosome positioning sequences, EBF1 could be involved in the stabilization of the nucleosome at this position. Like a stone under a wheel impedes its movement. Some parts of this model, like EBF1 ability to engage nucleosome arrays, could be tested in vitro.
% interactions
% The analysis of ENCODE data also allowed to predict 35 interactions involving CTCF and junD. Moreover, this method allowed to segregate between four functionally different types of interactions. Out of the 35 interactions, 5 were new and could be tested in vitro as they are predicted to involve a direct physical interaction between the partners.
% SMiLE-seq
% The characterization of TFs was also tackled in terms of their binding specificity. I participated in the SMiLE-seq project of the Deplancke laboratory of EPFL. The aim was to build a microfluidic based technology that would allow high throughput in vitro measurements of TF sequence specificity. I was in charge of modeling each TF specificity and to assess the suitability of SMiLE-seq for this problem. Interestingly, the SMiLE-seq technology turned out to be a really competitive method for this problem, paving the path to bigger scale studies. For instance, the specificity of zinc finger TFs and TF-dimers remain a largely unsolved problem.
% This work also tackled computational challenges linked with partitioning and classification problems in chapters \ref{spark}, \ref{pwmscan} and \ref{atac_seq}.
% SPar-K
% The partitioning of genomic regions based on their sequencing profiles is not a trivial task. Among all the algorithms that have been developed to solve this problem, ChIPPartitioning - that has been developed by our laboratory - was the most efficient in term of partitioning accuracy. However, because it involves heavy probability related computations and because of implementation issues, it turned out to be really slow. To remedy this, I developed SPar-K which is a modified version of K-means. SPar-K achieved competitive partitioning accuracy while being clearly faster.
% PWMScan
% I also contributed to the development of PWMScan that is a software that predicts TF binding sites along a genome based on a binding specificity model. Currently, this software is the fastest existing for this type of problem but is also as accurate as all other existing competitor programs. PWMScan introduced the usage of read mapper to solve this problem as an alternative to regular genome scanners (that it can also use).
% ATAC-seq
% Finally, I modified ChIPPartitioning to create a \textit{de novo} motif discovery algorithm that can partition genomic regions based on their DNA sequence. I proposed to use this new algorithm together with ChIPPartioning to study the chromatin structure at TF binding sites. The results suggested that it was possible to study the heterogeneity of TF binding sites by using a 2 step procedure that i) aligns the regions on a given TF motif occurrences and ii) partitions the data to retrieve different groups of regions based on their chromatin accessibility patterns. In their current state, these results are preliminary. Several adjustments need to be done. For instance, the regions of interest were defined using peak calling. Stricto sensu there is nothing wrong with this. However, the full potential of DGF is not exploited. Peaks represent regions of high signal whereas footprints - which most likely are inside peaks - are the precise regions of interest. Defining the regions of interest as the footprint centers instead of peak center is likely to ease the problem of finding different classes of footprints. Furthermore, I proposed to use this framework to draw a catalog of possible chromatin/motif organizations from the pooled data that could be used to annotate each single cell in order to create cell molecular states that could be later used to find cell populations. Alternatively, the individual cells can be replaced by experiments/patients and the same strategy could be applied to discover groups of experiments or patients.
% conclusions
% In conclusions, in this work, I tackled several different aspects of the bioinformatics research related to TF and chromatin biology. The results are in line with the current state of the knowledge in the field. In most cases they confirmed previous results and in some other, they complemented them. I also developed or participated to the development of softwares, resources and technologies that are already valuable assets to the research community, such as EPD or PWMScan or that have to potential to be so, such as SMiLE-seq or SPar-K.
In this chapter, I get back to some of the major aspects - in my opinion - that I presented in the previous chapters and discuss them in a larger scope and present some related perspectives.
\section*{About the chromatin organization}
% nucleosome arrays
The systematic study of nucleosome organization arount TF binding sites revealed that all TFs showed a strong array on at least one of their flank. A possible explanation regarding this asymetry is that the upstream and downstream arrays are organized with respect to different anchors. The methodology I used to display them could only phase the arrays on one side, at the price of unphasing - to different extent - the array on the other side, rendering it hardly visible on the aggregations. Thus, it is reasonable to claim, as a general rule, that nucleosomes are organized into regular arrays on both sides of all TF binding sites.
The case of CTCF arrays - literally a school case when it comes to nucleosome arrays - was investigated further and the ISWI enzyme SNF2H has been shown to be necessary to maintain this typical nucleosome organization \citep{wiechens_chromatin_2016}. The same study also showed that SNF2H and SNF2L are necessary to maintain the nucleosome organizations at RUNX5 and JUN binding sites.
This work shows that all TFs binding sites are flanked by nucleosome arrays. The involvement of chromatin remodeler is thus likely to be general, as well. Understanding these nucleosome arrays are maintained and how exactly they are delimited is crucial. The extent to which the chromatin is open defines the boundaries of each regulatory element and thus which TF binding sites are accessible and which should remain in the closed chromatin on the flanks. It is thus unsurprising that the dysfunction or deregulation of chromatin remodelers has been linked with cancer \citep{wilson_swisnf_2011,langst_chromatin_2015}. Consequently, delineating how TF and chromatin remodeling complexes influence each other to regulate the expression of genes is crucial.
\section*{About pioneer factors}
% EBF1
This work also provided supporting evidence about the pioneer role of EBF1. The results suggest that EBF1 binding sites tend to be located at the edges of rotationally positioned nucleosomes.
To my knowledge, no direct evidence of EBF1 ability to engage closed chromatin has ever been proposed. EBF1 pioneer function was rather based on its ability to drive cellular differentiation \citep{hagman_early_2005} and to trigger chromatin remodeling \citep{maier_early_2004,boller_pioneering_2016}. This work suggests that EBF1 also exhibit the major characteristic of pioneer TFs, that is, the ability to engage DNA in inaccessible chromatin \citep{iwafuchi-doi_pioneer_2014}. Based on this result, an \textit{in vitro} assessment of this property, as performed in \cite{soufi_pioneer_2015} for Oct4, Sox2, Klf4 and c-myc, could help to further strengthen this observation.
Finally, having found a pioneer behaving TF in the ENCODE data raises a question : why no more than one pioneer TF could be found? To my opinion, the answer can be declined in two parts. First, this dataset only contained a few dozens of TFs, likely not containing more pioneer TF besides EBF1. Second, upon binding, pioneer TFs are known to trigger chromatin opening. Thus, it is likely that the observation of the chromatin organization at pioneer TF binding sites will result in the observation of a steady-state : a TF binding in an open chromatin region. As a consequence, capturing TFs bound in closed chromatin seems difficult and identifying pioneer TF, in this way, unlikely. In the light of this assumption, the EBF1 results are puzzling and call for a further delineation of the events triggered upon EBF1 binding.
\section*{About assaying TF specificity}
% EMSequence
Throughout this work, I also proposed several different algorithms and show their usefulness. One of them is EMSequence. EMSequence is a \textit{de novo} motif discovery algorithm. I proposed to use this new algorithm together with ChIPPartioning to study the chromatin structure at TF binding sites. However, other applications could be explored. For instance, in chapter \ref{smile_seq}, SMiLE-seq has been demonstrated to be an effective method to assay TF specificity. Even though not discussed in this work, it also proved to be useful to assay AP1 dimer specificity \citep{isakova_smile-seq_2017}. Interestingly, the question of TF dimer specificity remains largely unsolved. Here, an adequat usage of computational methods can allow to alleviate the experimental effort necessary to produce the relevant data. Assaying a pair of TF specificity implies to run at least 3 independent assays : each TF alone to assay the homodimers and both TFs together to assay to heterodimer. Because EMSequence is a partitioning algorithm, it should be able to discover several motifs at the same time. Thus it should be possible to assay a pair of dimerizing TFs at once and to retrieve each TF homodimer motif (if any) and the heterodimer motifs from a single experiment, using EMSequence. However, one should keep in mind that, in competition, certain dimers may be favored over others, depending on the affinity of each TF for its possible partners. Therefore, for instance, it may happen that one dimer never forms in competition, thus excluding the discovery of its binding specificity.
\section*{About the treatment of scATAC-seq data}
% ATAC-seq
Nowadays, the technologies deviced at performing single-cell measurements have become commonly used. scRNA-seq data remains the most frequently used. Dedicated computational methods allow to isolate sub-populations of cells by clustering the gene expression matrix \citep{fan_characterizing_2016, kiselev_sc3:_2017}, using gene regulatory network reconstruction \citep{aibar_scenic:_2017} or by identifying cellular states based on the accessible region motif content \citep{gonzalez-blas_cistopic:_2019}.
scATAC-seq data are encountering an yet ever growing popularity. Currently, the treatment of these data remains quite limited for the time being. It is for instance not unusual to create a matrix, as for scRNA-seq, containing the number of reads mapped at a given location in a cell and subsequently using this matrix for downstream analyses such as the detection of cell populations. However, I think that scRNA-seq and scATAC-seq data are different by nature and thus cannot be treated the same.
In chapter \ref{atac_seq}, I presented a computational method to unravel footprints from ATAC-seq data. One can imagine using the framework described in this chapter to draw a catalog of chromatin structures from the pool of single-cell data and use it to annotate each cell. More precisely this could be done by going back to each peak in each cell and assigning a qualitative label corresponding to the chromatin model that matches the best (the most similar) this region in this cell. Ultimately, this would lead to the creation of a matrix (cells x regions) that could be used to run clustering methods. How the similarity should be computed and whether each cell will have a high enough coverage for similarity computations to be meaningful remain open questions. Alternatively, one can replace single cells by different bulk experiments. In this case, the clustering would not isolate cell sub-populations but experiments (individuals, culture conditions, etc) that are similar to each other.

Event Timeline