Page MenuHomec4science

ch_discussion.tex
No OneTemporary

File Metadata

Created
Thu, May 9, 07:47

ch_discussion.tex

\cleardoublepage
\chapter{Discussion}
\label{discussion}
\markboth{Discussions}{Discussion}
\addcontentsline{toc}{chapter}{Discussion}
This contribution of this work to genomic bioinformatics is dual. It has a resource development component and a research component, which importances can only be appreciated properly by considering them together.
% ressource part
The resource aspect of this work has been presented in Chapter \ref{lab_resources}. It concerned the maintenance of the MGA and EPD databases. Even though this is not purely research work, it is a necessary support to active research that allows many researchers in different fields - from wet lab to purely computational laboratories - to ask and, importantly, answer their questions.
%EPD
Currently, the EPD database contains the most precise genome annotation of TSSs which is a crucial information in many fields of life sciences. From the creation of efficient expression vectors in bio-engineering, to the reconstruction of gene interaction networks in computational biology.
% MGA
The benefice of the MGA database is less direct but not less important. The wealth of publicly available sequencing data is a treasure. As for each treasure, something is sitting atop, like a terrifying dragon or a tremendous overhead effort. Re-utilization of published data is an astonishingly difficult task that requires to search ill-annotated databases, to download the selected entries and map the data, hoping that the quality is acceptable. The sole task of mapping the data requires to have the proper software and genome locally stored. The MGA, as a majestic an well armored knight, allows to get ride of the overhead and to access data in utilizable format. Not only a vast amount of data from landmark studies is available on a central platform but they are efficiently searchable because of high quality, standardized and hand currated annotations. Such initiatives contribute to allow big-data like types of projects by making the burden of re-utilization acceptable.
% research part
The research part of my thesis was presented in chapters \ref{encode_peaks}, \ref{smile_seq}, and \ref{atac_seq}. It covered several topics related to the characterization of TF binding sites.
% chromatin structure
First, I explored the chromatin environment of TFs which binding has been assessed in GM17878 cells by the ENCODE Consortium. Using a specialized partitioning method, it was possible to detect well organized nucleosome arrays in the vicinity of all TFs. However, only CTCF has the ability to act as a barrier against which the nucleosome arrays are organized. Other TFs did not show this property, suggesting that other chromatin remodeling mechanisms are at play. Also, as expected, all TFs showed binding to a NDR with the exception of EBF1 which seemed to be binding on nucleosome arrays.
% EBF1
Further analyses strongly supported that EBF1 binding sites tend to be located at the edges of nucleosomes. Nonetheless, these results do not allow to state whether EBF1 binds nucleosome edges or whether the consequence of its binding is a remodeling leading to a situation in which EFB1 directly flanks a nucleosome. In line with other reports suggesting a pioneer activity for EBF1, I propose a model in which EBF1 engages a nucleosome and promotes its displacement such that EBF1 is located at its entry. Because EBF1 binding motif shows properties of nucleosome positioning sequences, EBF1 could be involved in the stabilization of the nucleosome at this position. Like a stone under a wheel impedes its movement. Some parts of this model, like EBF1 ability to engage nucleosome arrays, could be tested in vitro.
% interactions
The analysis of ENCODE data also allowed to predict 35 interactions involving CTCF and junD. Moreover, this method allowed to segregate between four functionally different types of interactions. Out of the 35 interactions, 5 were new and could be tested in vitro as they are predicted to involve a direct physical interaction between the partners.
% SMiLE-seq
The characterization of TFs was also tackled in terms of their binding specificity. I participated in the SMiLE-seq project of the Deplancke laboratory of EPFL. The aim was to build a microfluidic based technology that would allow high throughput in vitro measurements of TF sequence specificity. I was in charge of modeling each TF specificity and to assess the suitability of SMiLE-seq for this problem. Interestingly, the SMiLE-seq technology turned out to be a really competitive method for this problem, paving the path to bigger scale studies. For instance, the specificity of zinc finger TFs and TF-dimers remain a largely unsolved problem.
This work also tackled computational challenges linked with partitioning and classification problems in chapters \ref{spark}, \ref{pwmscan} and \ref{atac_seq}.
% SPar-K
The partitioning of genomic regions based on their sequencing profiles is not a trivial task. Among all the algorithms that have been developed to solve this problem, ChIPPartitioning - that has been developed by our laboratory - was the most efficient in term of partitioning accuracy. However, because it involves heavy probability related computations and because of implementation issues, it turned out to be really slow. To remedy this, I developed SPar-K which is a modified version of K-means. SPar-K achieved competitive partitioning accuracy while being clearly faster.
% PWMScan
I also contributed to the development of PWMScan that is a software that predicts TF binding sites along a genome based on a binding specificity model. Currently, this software is the fastest existing for this type of problem but is also as accurate as all other existing competitor programs. PWMScan introduced the usage of read mapper to solve this problem as an alternative to regular genome scanners (that it can also use).
% ATAC-seq
Finally, I modified ChIPPartitioning to create a de novo mofif discovery algorithm that can partition genomic regions based on their DNA sequence. I proposed to use this new algorithm together with ChIPPartioning to study the chromatin structure at TF binding sites. The results suggested that it was possible to study the heterogeneity of TF binding sites by using a 2 step procedure that i) aligns the regions on a given TF motif and ii) partitions the data to retrieve different groups of regions based on their chromatin accessibility patterns. In their current state, these results are preliminary. Several adjustments need to be done. For instance, the regions of interest were defined using peak calling. Stricto sensu there is nothing wrong with this. However, the full potential of DGF is not exploited. Peaks represent regions of high signal whereas footprints - which most likely are inside peaks - are the precise regions of interest. Defining the regions of interest as the footprint centers instead of peak center is likely to ease the problem of finding different classes of footprints. Furthermore, I proposed to use this framework to draw a catalog of possible chromatin/motif organizations from the pooled data that could be used to annotate each single cell in order to create cell molecular states that could be later used to find cell populations. Alternatively, the individual cells can be replaced by experiments/patients and the same strategy could be applied to discover groups of experiments or patients.
% conclusions
In conclusions, in this work, I tackled several different aspects of the bioinformatics research related to TF and chromatin biology. The results are in line with the current state of the knowledge in the field. In most cases they confirmed previous results and in some other, they complemented them. I also developed or participated to the development of softwares, resources and technologies that are already valuable assets to the research community, such as EPD or PWMScan or that have to potential to be so, such as SMiLE-seq or SPar-K.

Event Timeline