diff --git a/images/ch_encode_peaks/CTCF_ndr_length_rad211.png b/images/ch_encode_peaks/CTCF_ndr_length_rad211.png new file mode 100644 index 0000000..11d275d Binary files /dev/null and b/images/ch_encode_peaks/CTCF_ndr_length_rad211.png differ diff --git a/images/ch_encode_peaks/CTCF_ndr_length_rad212.png b/images/ch_encode_peaks/CTCF_ndr_length_rad212.png new file mode 100644 index 0000000..02cba0f Binary files /dev/null and b/images/ch_encode_peaks/CTCF_ndr_length_rad212.png differ diff --git a/images/ch_encode_peaks/ctcf_ndr.png b/images/ch_encode_peaks/ctcf_ndr.png new file mode 100644 index 0000000..fe781f9 Binary files /dev/null and b/images/ch_encode_peaks/ctcf_ndr.png differ diff --git a/images/ch_encode_peaks/wgEncodeAwgTfbsSydhGm12878Brca1a300IggmusUniPk_MNase_GM12878_allpeaks_EM_4class_15shift_flip.png b/images/ch_encode_peaks/wgEncodeAwgTfbsSydhGm12878Brca1a300IggmusUniPk_MNase_GM12878_allpeaks_EM_4class_15shift_flip.png new file mode 100644 index 0000000..d9af6b8 Binary files /dev/null and b/images/ch_encode_peaks/wgEncodeAwgTfbsSydhGm12878Brca1a300IggmusUniPk_MNase_GM12878_allpeaks_EM_4class_15shift_flip.png differ diff --git a/main/ch_atac-seq.aux b/main/ch_atac-seq.aux index 8b51c44..7f05e4d 100644 --- a/main/ch_atac-seq.aux +++ b/main/ch_atac-seq.aux @@ -1,145 +1,145 @@ \relax \providecommand\hyper@newdestlabel[2]{} \citation{vierstra_genomic_2016} \citation{neph_expansive_2012} \citation{adey_rapid_2010,buenrostro_transposition_2013} \citation{barski_high-resolution_2007} \citation{vierstra_genomic_2016} \citation{vierstra_genomic_2016} \citation{adey_rapid_2010,buenrostro_transposition_2013} \citation{adey_rapid_2010} \citation{adey_rapid_2010} -\@writefile{toc}{\contentsline {chapter}{\numberline {4}Chromatin accessibility of monocytes}{41}{chapter.4}} +\@writefile{toc}{\contentsline {chapter}{\numberline {4}Chromatin accessibility of monocytes}{43}{chapter.4}} \@writefile{lof}{\addvspace {10\p@ }} \@writefile{lot}{\addvspace {10\p@ }} \@writefile{loa}{\addvspace {10\p@ }} -\@writefile{chapter}{\contentsline {toc}{Chromatin accessibility of monocytes}{41}{chapter.4}} -\@writefile{toc}{\contentsline {section}{\numberline {4.1}ATAC-seq}{41}{section.4.1}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.1}{\ignorespaces \textbf {ATAC-seq principle :} ATAC-seq uses a hyperactive Tn5 transposase to simultaneously cleave genomic DNA at accessible loci and ligate adaptors. These adaptors can serve as sequencing barcodes. A subsequent step of ligation allows to add sequencing adaptors. The purified DNA fragments are then subjected to massively parallel sequencing to generate a digital readout of per-nucleotide insertion (transposition event) genome-wide. Figure and legent taken and adapted from \citep {vierstra_genomic_2016}.\relax }}{42}{figure.caption.26}} -\newlabel{atac_seq_atac_seq}{{4.1}{42}{\textbf {ATAC-seq principle :} ATAC-seq uses a hyperactive Tn5 transposase to simultaneously cleave genomic DNA at accessible loci and ligate adaptors. These adaptors can serve as sequencing barcodes. A subsequent step of ligation allows to add sequencing adaptors. The purified DNA fragments are then subjected to massively parallel sequencing to generate a digital readout of per-nucleotide insertion (transposition event) genome-wide. Figure and legent taken and adapted from \citep {vierstra_genomic_2016}.\relax }{figure.caption.26}{}} +\@writefile{chapter}{\contentsline {toc}{Chromatin accessibility of monocytes}{43}{chapter.4}} +\@writefile{toc}{\contentsline {section}{\numberline {4.1}ATAC-seq}{43}{section.4.1}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.1}{\ignorespaces \textbf {ATAC-seq principle :} ATAC-seq uses a hyperactive Tn5 transposase to simultaneously cleave genomic DNA at accessible loci and ligate adaptors. These adaptors can serve as sequencing barcodes. A subsequent step of ligation allows to add sequencing adaptors. The purified DNA fragments are then subjected to massively parallel sequencing to generate a digital readout of per-nucleotide insertion (transposition event) genome-wide. Figure and legent taken and adapted from \citep {vierstra_genomic_2016}.\relax }}{44}{figure.caption.27}} +\newlabel{atac_seq_atac_seq}{{4.1}{44}{\textbf {ATAC-seq principle :} ATAC-seq uses a hyperactive Tn5 transposase to simultaneously cleave genomic DNA at accessible loci and ligate adaptors. These adaptors can serve as sequencing barcodes. A subsequent step of ligation allows to add sequencing adaptors. The purified DNA fragments are then subjected to massively parallel sequencing to generate a digital readout of per-nucleotide insertion (transposition event) genome-wide. Figure and legent taken and adapted from \citep {vierstra_genomic_2016}.\relax }{figure.caption.27}{}} \citation{neph_expansive_2012} \citation{berest_quantification_2018} \citation{grossman_positional_2018} -\@writefile{toc}{\contentsline {section}{\numberline {4.2}Monitoring TF binding}{43}{section.4.2}} +\@writefile{toc}{\contentsline {section}{\numberline {4.2}Monitoring TF binding}{45}{section.4.2}} \citation{angerer_single_2017} \citation{fan_characterizing_2016,kiselev_sc3:_2017} \citation{aibar_scenic:_2017} \citation{gonzalez-blas_cistopic:_2019} \citation{buenrostro_transposition_2013} -\@writefile{toc}{\contentsline {section}{\numberline {4.3}The advent of single cell DGF}{44}{section.4.3}} -\@writefile{toc}{\contentsline {section}{\numberline {4.4}A quick overview of scATAC-seq data analysis}{44}{section.4.4}} -\@writefile{toc}{\contentsline {section}{\numberline {4.5}Open questions}{44}{section.4.5}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.2}{\ignorespaces \textbf {framework to identify chromatin organization and use them to annotate cellular state :} the scATAC-seq data available in each individual cell are aggregated and used a if it was a bulk sequencing experiment. Regions of interest are listed using peak calling on the the bulk data. The read densities in these regions (center of the peaks +/- a given offset) are measured. The regions are then clustered based on their signal shape to identify different chromatin architectures and create a catalog. These chromatin signatures can then be used to annotate each region of interest in each cell, based on the signal resemblance. The information can be stored as a matrix (M) that can be used for downstream analyses, such as sub-population identification.\relax }}{45}{figure.caption.27}} -\newlabel{atac_seq_pipeline}{{4.2}{45}{\textbf {framework to identify chromatin organization and use them to annotate cellular state :} the scATAC-seq data available in each individual cell are aggregated and used a if it was a bulk sequencing experiment. Regions of interest are listed using peak calling on the the bulk data. The read densities in these regions (center of the peaks +/- a given offset) are measured. The regions are then clustered based on their signal shape to identify different chromatin architectures and create a catalog. These chromatin signatures can then be used to annotate each region of interest in each cell, based on the signal resemblance. The information can be stored as a matrix (M) that can be used for downstream analyses, such as sub-population identification.\relax }{figure.caption.27}{}} +\@writefile{toc}{\contentsline {section}{\numberline {4.3}The advent of single cell DGF}{46}{section.4.3}} +\@writefile{toc}{\contentsline {section}{\numberline {4.4}A quick overview of scATAC-seq data analysis}{46}{section.4.4}} +\@writefile{toc}{\contentsline {section}{\numberline {4.5}Open questions}{46}{section.4.5}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.2}{\ignorespaces \textbf {framework to identify chromatin organization and use them to annotate cellular state :} the scATAC-seq data available in each individual cell are aggregated and used a if it was a bulk sequencing experiment. Regions of interest are listed using peak calling on the the bulk data. The read densities in these regions (center of the peaks +/- a given offset) are measured. The regions are then clustered based on their signal shape to identify different chromatin architectures and create a catalog. These chromatin signatures can then be used to annotate each region of interest in each cell, based on the signal resemblance. The information can be stored as a matrix (M) that can be used for downstream analyses, such as sub-population identification.\relax }}{47}{figure.caption.28}} +\newlabel{atac_seq_pipeline}{{4.2}{47}{\textbf {framework to identify chromatin organization and use them to annotate cellular state :} the scATAC-seq data available in each individual cell are aggregated and used a if it was a bulk sequencing experiment. Regions of interest are listed using peak calling on the the bulk data. The read densities in these regions (center of the peaks +/- a given offset) are measured. The regions are then clustered based on their signal shape to identify different chromatin architectures and create a catalog. These chromatin signatures can then be used to annotate each region of interest in each cell, based on the signal resemblance. The information can be stored as a matrix (M) that can be used for downstream analyses, such as sub-population identification.\relax }{figure.caption.28}{}} \citation{hepler_10x_2018} \citation{hon_chromasig:_2008} \citation{nielsen_catchprofiles:_2012} \citation{kundaje_ubiquitous_2012} \citation{nair_probabilistic_2014} \citation{groux_spar-k:_2019} -\@writefile{toc}{\contentsline {section}{\numberline {4.6}Data}{46}{section.4.6}} -\@writefile{toc}{\contentsline {section}{\numberline {4.7}Identification of catalog of chromatin architectures}{46}{section.4.7}} +\@writefile{toc}{\contentsline {section}{\numberline {4.6}Data}{48}{section.4.6}} +\@writefile{toc}{\contentsline {section}{\numberline {4.7}Identification of catalog of chromatin architectures}{48}{section.4.7}} \citation{nair_probabilistic_2014} \citation{nair_probabilistic_2014} \citation{nair_probabilistic_2014} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.1}EMRead : an algorithm to identify over-represented chromatin architecture}{47}{subsection.4.7.1}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.3}{\ignorespaces \textbf {Illustration of the expectation-maximization algorithms} \textbf {A} illustration of EMRead, an algorithm dedicated to the discovery of over-represented chromatin patterns, as described in \citep {nair_probabilistic_2014}. \textbf {B} illustration of EMSequence, an algorithm to discover over-represented DNA motifs. The overall design is the same. Both algorithms model the data has having being sampled from a distribution and perform a maximum-likelihood estimation of the distribution parameters from the data through an iterative procedure. EMJoint algorithm is the combination of both EMRead and EMSequence at the same time.\relax }}{47}{figure.caption.28}} -\newlabel{atac_seq_em}{{4.3}{47}{\textbf {Illustration of the expectation-maximization algorithms}\\ \textbf {A} illustration of EMRead, an algorithm dedicated to the discovery of over-represented chromatin patterns, as described in \citep {nair_probabilistic_2014}.\\ \textbf {B} illustration of EMSequence, an algorithm to discover over-represented DNA motifs. The overall design is the same. Both algorithms model the data has having being sampled from a distribution and perform a maximum-likelihood estimation of the distribution parameters from the data through an iterative procedure.\\ EMJoint algorithm is the combination of both EMRead and EMSequence at the same time.\relax }{figure.caption.28}{}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.1}EMRead : an algorithm to identify over-represented chromatin architecture}{49}{subsection.4.7.1}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.3}{\ignorespaces \textbf {Illustration of the expectation-maximization algorithms} \textbf {A} illustration of EMRead, an algorithm dedicated to the discovery of over-represented chromatin patterns, as described in \citep {nair_probabilistic_2014}. \textbf {B} illustration of EMSequence, an algorithm to discover over-represented DNA motifs. The overall design is the same. Both algorithms model the data has having being sampled from a distribution and perform a maximum-likelihood estimation of the distribution parameters from the data through an iterative procedure. EMJoint algorithm is the combination of both EMRead and EMSequence at the same time.\relax }}{49}{figure.caption.29}} +\newlabel{atac_seq_em}{{4.3}{49}{\textbf {Illustration of the expectation-maximization algorithms}\\ \textbf {A} illustration of EMRead, an algorithm dedicated to the discovery of over-represented chromatin patterns, as described in \citep {nair_probabilistic_2014}.\\ \textbf {B} illustration of EMSequence, an algorithm to discover over-represented DNA motifs. The overall design is the same. Both algorithms model the data has having being sampled from a distribution and perform a maximum-likelihood estimation of the distribution parameters from the data through an iterative procedure.\\ EMJoint algorithm is the combination of both EMRead and EMSequence at the same time.\relax }{figure.caption.29}{}} \citation{nair_probabilistic_2014} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.2}EMSequence : an algorithm to identify over-represented sequences}{48}{subsection.4.7.2}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.2}EMSequence : an algorithm to identify over-represented sequences}{50}{subsection.4.7.2}} \citation{nair_probabilistic_2014} \citation{nair_probabilistic_2014} \citation{nair_probabilistic_2014} \citation{nair_probabilistic_2014} -\@writefile{toc}{\contentsline {subsubsection}{without shift and flip}{49}{subsection.4.7.2}} -\newlabel{atac_seq_emseq_likelihood}{{4.1}{49}{without shift and flip}{equation.4.7.1}{}} -\newlabel{atac_seq_emseq_update_model}{{4.2}{49}{without shift and flip}{equation.4.7.2}{}} -\@writefile{toc}{\contentsline {subsubsection}{with shift and flip}{49}{equation.4.7.2}} -\newlabel{atac_seq_emseq_likelihood_shift_flip}{{4.3}{49}{with shift and flip}{equation.4.7.3}{}} -\newlabel{atac_seq_emseq_reverse_motif}{{4.4}{49}{with shift and flip}{equation.4.7.4}{}} -\newlabel{atac_seq_emseq_update_model_shift_flip}{{4.5}{50}{with shift and flip}{equation.4.7.5}{}} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.3}EMJoint : an algorithm to identify over-represented sequences and chromatin architectures}{50}{subsection.4.7.3}} +\@writefile{toc}{\contentsline {subsubsection}{without shift and flip}{51}{subsection.4.7.2}} +\newlabel{atac_seq_emseq_likelihood}{{4.1}{51}{without shift and flip}{equation.4.7.1}{}} +\newlabel{atac_seq_emseq_update_model}{{4.2}{51}{without shift and flip}{equation.4.7.2}{}} +\@writefile{toc}{\contentsline {subsubsection}{with shift and flip}{51}{equation.4.7.2}} +\newlabel{atac_seq_emseq_likelihood_shift_flip}{{4.3}{51}{with shift and flip}{equation.4.7.3}{}} +\newlabel{atac_seq_emseq_reverse_motif}{{4.4}{51}{with shift and flip}{equation.4.7.4}{}} +\newlabel{atac_seq_emseq_update_model_shift_flip}{{4.5}{52}{with shift and flip}{equation.4.7.5}{}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.3}EMJoint : an algorithm to identify over-represented sequences and chromatin architectures}{52}{subsection.4.7.3}} \citation{nair_probabilistic_2014} \citation{nair_probabilistic_2014} -\newlabel{atac_seq_emjoint_likelihood}{{4.6}{51}{EMJoint : an algorithm to identify over-represented sequences and chromatin architectures}{equation.4.7.6}{}} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.4}Data realignment}{51}{subsection.4.7.4}} +\newlabel{atac_seq_emjoint_likelihood}{{4.6}{53}{EMJoint : an algorithm to identify over-represented sequences and chromatin architectures}{equation.4.7.6}{}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.4}Data realignment}{53}{subsection.4.7.4}} \citation{voss_dynamic_2014} \citation{cirillo_opening_2002,zaret_pioneer_2011,soufi_pioneer_2015} \citation{buenrostro_transposition_2013} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.5}Implementations}{52}{subsection.4.7.5}} -\@writefile{toc}{\contentsline {section}{\numberline {4.8}Results}{52}{section.4.8}} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.8.1}Fragment size analysis}{52}{subsection.4.8.1}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.4}{\ignorespaces \textbf {Fragment size analysis} \textbf {A :} sequenced fragment size density. The three peaks, from left to right, indicate i) the open chromatin fragments, ii) the mono-nucleosome fragments and iii) the di-nucleosome fragments. The 10bp oscillation reflect the DNA pitch. A mixture model composed of three Gaussian distributions was fitted to the data in order to model the fragment sizes. The class fit is shown as dashed lines : open chromatin (red), mono-nucleosomes (blue) and di-nucleosomes (green). The violet dashed line show the sum of the three classes. \textbf {B :} probability that a fragment belongs to any of the three fragment classes, given its size i) open chromatin (red), ii) mono-nucleosomes (blue) and iii) di-nucleosomes (green). The vertical dashed lines indicates, for each class, the size limit at which the class probability drops below 0.9. With these limites, the class spans are i) 30-84bp for open chromatin (red), ii) 133-266bp for mono-nucleosomes (blue) and iii) 341-500bp for di-nucleosomes (green). The upper limit of the di-nucleosome class was arbitrarily set to 500bp. \textbf {C :} final fragment classes. Each fragments which size overlapped the size range spanned by a class, was assigned to that class. This ensured a high confidence assignment for more than 134 million fragments, leaving 46 millions of ambiguous and long fragments (>500bp) unassigned.\relax }}{53}{figure.caption.29}} -\newlabel{atac_seq_fragment_size}{{4.4}{53}{\textbf {Fragment size analysis} \textbf {A :} sequenced fragment size density. The three peaks, from left to right, indicate i) the open chromatin fragments, ii) the mono-nucleosome fragments and iii) the di-nucleosome fragments. The 10bp oscillation reflect the DNA pitch.\\ A mixture model composed of three Gaussian distributions was fitted to the data in order to model the fragment sizes. The class fit is shown as dashed lines : open chromatin (red), mono-nucleosomes (blue) and di-nucleosomes (green). The violet dashed line show the sum of the three classes.\\ \textbf {B :} probability that a fragment belongs to any of the three fragment classes, given its size i) open chromatin (red), ii) mono-nucleosomes (blue) and iii) di-nucleosomes (green). The vertical dashed lines indicates, for each class, the size limit at which the class probability drops below 0.9. With these limites, the class spans are i) 30-84bp for open chromatin (red), ii) 133-266bp for mono-nucleosomes (blue) and iii) 341-500bp for di-nucleosomes (green). The upper limit of the di-nucleosome class was arbitrarily set to 500bp.\\ \textbf {C :} final fragment classes. Each fragments which size overlapped the size range spanned by a class, was assigned to that class. This ensured a high confidence assignment for more than 134 million fragments, leaving 46 millions of ambiguous and long fragments (>500bp) unassigned.\relax }{figure.caption.29}{}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.5}{\ignorespaces \textbf {Signal around CTCF motifs : } the human genome was scanned with a CTCF PWM and different aggregated signal densities were measured for open chromatin (red lines), mono nucleosome (blue lines), di-nucleosomes (green lines) and for a pool of mono-nucleosome fragments with di-nucleosomes fragments cut in two at their center position (violet line). \textbf {Top row :} each position of the fragments, from the start of the first read to the end of the second, were used. \textbf {Middle row :} each position of the reads were used. \textbf {Bottom row :} only one position at the read edges for open chromatin fragment and the central position of nucleosome fragment were used. The open chromatin read edges were modified by +4bp and -5bp for +strand and -strand reads respectively. The aggregated densities were measured using bin sizes of 1 (left column), 2 (middle column) and 10bp (right column).\relax }}{54}{figure.caption.30}} -\newlabel{atac_seq_ctcf_all_data}{{4.5}{54}{\textbf {Signal around CTCF motifs : } the human genome was scanned with a CTCF PWM and different aggregated signal densities were measured for open chromatin (red lines), mono nucleosome (blue lines), di-nucleosomes (green lines) and for a pool of mono-nucleosome fragments with di-nucleosomes fragments cut in two at their center position (violet line). \textbf {Top row :} each position of the fragments, from the start of the first read to the end of the second, were used. \textbf {Middle row :} each position of the reads were used. \textbf {Bottom row :} only one position at the read edges for open chromatin fragment and the central position of nucleosome fragment were used. The open chromatin read edges were modified by +4bp and -5bp for +strand and -strand reads respectively.\\ The aggregated densities were measured using bin sizes of 1 (left column), 2 (middle column) and 10bp (right column).\relax }{figure.caption.30}{}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.7.5}Implementations}{54}{subsection.4.7.5}} +\@writefile{toc}{\contentsline {section}{\numberline {4.8}Results}{54}{section.4.8}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.8.1}Fragment size analysis}{54}{subsection.4.8.1}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.4}{\ignorespaces \textbf {Fragment size analysis} \textbf {A :} sequenced fragment size density. The three peaks, from left to right, indicate i) the open chromatin fragments, ii) the mono-nucleosome fragments and iii) the di-nucleosome fragments. The 10bp oscillation reflect the DNA pitch. A mixture model composed of three Gaussian distributions was fitted to the data in order to model the fragment sizes. The class fit is shown as dashed lines : open chromatin (red), mono-nucleosomes (blue) and di-nucleosomes (green). The violet dashed line show the sum of the three classes. \textbf {B :} probability that a fragment belongs to any of the three fragment classes, given its size i) open chromatin (red), ii) mono-nucleosomes (blue) and iii) di-nucleosomes (green). The vertical dashed lines indicates, for each class, the size limit at which the class probability drops below 0.9. With these limites, the class spans are i) 30-84bp for open chromatin (red), ii) 133-266bp for mono-nucleosomes (blue) and iii) 341-500bp for di-nucleosomes (green). The upper limit of the di-nucleosome class was arbitrarily set to 500bp. \textbf {C :} final fragment classes. Each fragments which size overlapped the size range spanned by a class, was assigned to that class. This ensured a high confidence assignment for more than 134 million fragments, leaving 46 millions of ambiguous and long fragments (>500bp) unassigned.\relax }}{55}{figure.caption.30}} +\newlabel{atac_seq_fragment_size}{{4.4}{55}{\textbf {Fragment size analysis} \textbf {A :} sequenced fragment size density. The three peaks, from left to right, indicate i) the open chromatin fragments, ii) the mono-nucleosome fragments and iii) the di-nucleosome fragments. The 10bp oscillation reflect the DNA pitch.\\ A mixture model composed of three Gaussian distributions was fitted to the data in order to model the fragment sizes. The class fit is shown as dashed lines : open chromatin (red), mono-nucleosomes (blue) and di-nucleosomes (green). The violet dashed line show the sum of the three classes.\\ \textbf {B :} probability that a fragment belongs to any of the three fragment classes, given its size i) open chromatin (red), ii) mono-nucleosomes (blue) and iii) di-nucleosomes (green). The vertical dashed lines indicates, for each class, the size limit at which the class probability drops below 0.9. With these limites, the class spans are i) 30-84bp for open chromatin (red), ii) 133-266bp for mono-nucleosomes (blue) and iii) 341-500bp for di-nucleosomes (green). The upper limit of the di-nucleosome class was arbitrarily set to 500bp.\\ \textbf {C :} final fragment classes. Each fragments which size overlapped the size range spanned by a class, was assigned to that class. This ensured a high confidence assignment for more than 134 million fragments, leaving 46 millions of ambiguous and long fragments (>500bp) unassigned.\relax }{figure.caption.30}{}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.5}{\ignorespaces \textbf {Signal around CTCF motifs : } the human genome was scanned with a CTCF PWM and different aggregated signal densities were measured for open chromatin (red lines), mono nucleosome (blue lines), di-nucleosomes (green lines) and for a pool of mono-nucleosome fragments with di-nucleosomes fragments cut in two at their center position (violet line). \textbf {Top row :} each position of the fragments, from the start of the first read to the end of the second, were used. \textbf {Middle row :} each position of the reads were used. \textbf {Bottom row :} only one position at the read edges for open chromatin fragment and the central position of nucleosome fragment were used. The open chromatin read edges were modified by +4bp and -5bp for +strand and -strand reads respectively. The aggregated densities were measured using bin sizes of 1 (left column), 2 (middle column) and 10bp (right column).\relax }}{56}{figure.caption.31}} +\newlabel{atac_seq_ctcf_all_data}{{4.5}{56}{\textbf {Signal around CTCF motifs : } the human genome was scanned with a CTCF PWM and different aggregated signal densities were measured for open chromatin (red lines), mono nucleosome (blue lines), di-nucleosomes (green lines) and for a pool of mono-nucleosome fragments with di-nucleosomes fragments cut in two at their center position (violet line). \textbf {Top row :} each position of the fragments, from the start of the first read to the end of the second, were used. \textbf {Middle row :} each position of the reads were used. \textbf {Bottom row :} only one position at the read edges for open chromatin fragment and the central position of nucleosome fragment were used. The open chromatin read edges were modified by +4bp and -5bp for +strand and -strand reads respectively.\\ The aggregated densities were measured using bin sizes of 1 (left column), 2 (middle column) and 10bp (right column).\relax }{figure.caption.31}{}} \citation{buenrostro_transposition_2013} -\@writefile{lof}{\contentsline {figure}{\numberline {4.6}{\ignorespaces \textbf {Signal around CTCF, SP1, myc and EBF1 motifs :} the human genome was scanned with using one PWM per TF. For each TF, the open chromatin architecture was measured by considering the corrected read edges (red) and the nucleosome occupancy (blue) by considering the center of the nucleosome fagments from the nucleosome fragment dataset. The motif location is indicated by the dashed lines.\relax }}{55}{figure.caption.31}} -\newlabel{atac_seq_ctcf_sp1_myc_ebf1_footprint}{{4.6}{55}{\textbf {Signal around CTCF, SP1, myc and EBF1 motifs :} the human genome was scanned with using one PWM per TF. For each TF, the open chromatin architecture was measured by considering the corrected read edges (red) and the nucleosome occupancy (blue) by considering the center of the nucleosome fagments from the nucleosome fragment dataset. The motif location is indicated by the dashed lines.\relax }{figure.caption.31}{}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.6}{\ignorespaces \textbf {Signal around CTCF, SP1, myc and EBF1 motifs :} the human genome was scanned with using one PWM per TF. For each TF, the open chromatin architecture was measured by considering the corrected read edges (red) and the nucleosome occupancy (blue) by considering the center of the nucleosome fagments from the nucleosome fragment dataset. The motif location is indicated by the dashed lines.\relax }}{57}{figure.caption.32}} +\newlabel{atac_seq_ctcf_sp1_myc_ebf1_footprint}{{4.6}{57}{\textbf {Signal around CTCF, SP1, myc and EBF1 motifs :} the human genome was scanned with using one PWM per TF. For each TF, the open chromatin architecture was measured by considering the corrected read edges (red) and the nucleosome occupancy (blue) by considering the center of the nucleosome fagments from the nucleosome fragment dataset. The motif location is indicated by the dashed lines.\relax }{figure.caption.32}{}} \citation{adey_rapid_2010} \citation{buenrostro_transposition_2013,li_identification_2019} \citation{neph_expansive_2012} \citation{fu_insulator_2008} \citation{neph_expansive_2012} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.8.2}Measuring open chromatin and nucleosome occupancy}{56}{subsection.4.8.2}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.8.2}Measuring open chromatin and nucleosome occupancy}{58}{subsection.4.8.2}} \citation{kundaje_ubiquitous_2012} \citation{nair_probabilistic_2014} -\@writefile{toc}{\contentsline {subsection}{\numberline {4.8.3}Evaluation of EMRead and EMSequence}{57}{subsection.4.8.3}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.7}{\ignorespaces \textbf {Open chromatin classes around CTCF motifs :} EMRead was run without shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }}{58}{figure.caption.32}} -\newlabel{atac_seq_emread_ctcf_noshift_flip}{{4.7}{58}{\textbf {Open chromatin classes around CTCF motifs :} EMRead was run without shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }{figure.caption.32}{}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.8}{\ignorespaces \textbf {Open chromatin classes around CTCF motifs :} EMRead was run with shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }}{58}{figure.caption.33}} -\newlabel{atac_seq_emread_ctcf_shift_flip}{{4.8}{58}{\textbf {Open chromatin classes around CTCF motifs :} EMRead was run with shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }{figure.caption.33}{}} -\@writefile{toc}{\contentsline {subsubsection}{EMRead}{59}{subsection.4.8.3}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.9}{\ignorespaces \textbf {Classification performances on simulated data :} \textbf {Left} 50 different data partitions were run using EMSequence. The discovered models were then used to assign a class label to each sequence. These assigned labels were then compared to the true labels using the AUC under the ROC curve. The red line indicates the AUC value achieved by the true motifs. \textbf {Right} the 50 ROC curves corresponding to each partition. The red lines indicates the true motifs ROC curve. The curves under the diagonal are the cases where the 1st discovered class corresponded to the 2nd true class and vice-versa. For these cases, the AUC is actually the area over the curve.\relax }}{60}{figure.caption.34}} -\newlabel{atac_seq_emseq_auc_roc}{{4.9}{60}{\textbf {Classification performances on simulated data :} \textbf {Left} 50 different data partitions were run using EMSequence. The discovered models were then used to assign a class label to each sequence. These assigned labels were then compared to the true labels using the AUC under the ROC curve. The red line indicates the AUC value achieved by the true motifs. \textbf {Right} the 50 ROC curves corresponding to each partition. The red lines indicates the true motifs ROC curve. The curves under the diagonal are the cases where the 1st discovered class corresponded to the 2nd true class and vice-versa. For these cases, the AUC is actually the area over the curve.\relax }{figure.caption.34}{}} -\@writefile{toc}{\contentsline {subsubsection}{EMSequence}{60}{figure.caption.33}} -\@writefile{lof}{\contentsline {figure}{\numberline {4.10}{\ignorespaces \textbf {SP1 motifs :} partition of 15'883 801bp sequences centered on a SP1 binding site using EMSequence. The different classes are ordered by decreasing overall probability. Arrows atop of the motifs indicates tandem arrangements of SP1 motifs.\relax }}{61}{figure.caption.35}} -\newlabel{atac_seq_emseq_sp1_10class}{{4.10}{61}{\textbf {SP1 motifs :} partition of 15'883 801bp sequences centered on a SP1 binding site using EMSequence. The different classes are ordered by decreasing overall probability. Arrows atop of the motifs indicates tandem arrangements of SP1 motifs.\relax }{figure.caption.35}{}} +\@writefile{toc}{\contentsline {subsection}{\numberline {4.8.3}Evaluation of EMRead and EMSequence}{59}{subsection.4.8.3}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.7}{\ignorespaces \textbf {Open chromatin classes around CTCF motifs :} EMRead was run without shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }}{60}{figure.caption.33}} +\newlabel{atac_seq_emread_ctcf_noshift_flip}{{4.7}{60}{\textbf {Open chromatin classes around CTCF motifs :} EMRead was run without shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }{figure.caption.33}{}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.8}{\ignorespaces \textbf {Open chromatin classes around CTCF motifs :} EMRead was run with shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }}{60}{figure.caption.34}} +\newlabel{atac_seq_emread_ctcf_shift_flip}{{4.8}{60}{\textbf {Open chromatin classes around CTCF motifs :} EMRead was run with shifing but with flipping to identify different classes of footprints around 26'650 CTCF motifs. The aggregation signal around the 6 different classes found are shown by decreasing class probability. The open chromatin patterns are displayed in red, the nucleosomes are displayed in blue. The aggregated DNA sequence is displayed as a logo. The y-axis ranges from the minimum to the maximum signal observed. For the DNA logo, this corresponds to 0 and 2 bits respectively.\relax }{figure.caption.34}{}} +\@writefile{toc}{\contentsline {subsubsection}{EMRead}{61}{subsection.4.8.3}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.9}{\ignorespaces \textbf {Classification performances on simulated data :} \textbf {Left} 50 different data partitions were run using EMSequence. The discovered models were then used to assign a class label to each sequence. These assigned labels were then compared to the true labels using the AUC under the ROC curve. The red line indicates the AUC value achieved by the true motifs. \textbf {Right} the 50 ROC curves corresponding to each partition. The red lines indicates the true motifs ROC curve. The curves under the diagonal are the cases where the 1st discovered class corresponded to the 2nd true class and vice-versa. For these cases, the AUC is actually the area over the curve.\relax }}{62}{figure.caption.35}} +\newlabel{atac_seq_emseq_auc_roc}{{4.9}{62}{\textbf {Classification performances on simulated data :} \textbf {Left} 50 different data partitions were run using EMSequence. The discovered models were then used to assign a class label to each sequence. These assigned labels were then compared to the true labels using the AUC under the ROC curve. The red line indicates the AUC value achieved by the true motifs. \textbf {Right} the 50 ROC curves corresponding to each partition. The red lines indicates the true motifs ROC curve. The curves under the diagonal are the cases where the 1st discovered class corresponded to the 2nd true class and vice-versa. For these cases, the AUC is actually the area over the curve.\relax }{figure.caption.35}{}} +\@writefile{toc}{\contentsline {subsubsection}{EMSequence}{62}{figure.caption.34}} +\@writefile{lof}{\contentsline {figure}{\numberline {4.10}{\ignorespaces \textbf {SP1 motifs :} partition of 15'883 801bp sequences centered on a SP1 binding site using EMSequence. The different classes are ordered by decreasing overall probability. Arrows atop of the motifs indicates tandem arrangements of SP1 motifs.\relax }}{63}{figure.caption.36}} +\newlabel{atac_seq_emseq_sp1_10class}{{4.10}{63}{\textbf {SP1 motifs :} partition of 15'883 801bp sequences centered on a SP1 binding site using EMSequence. The different classes are ordered by decreasing overall probability. Arrows atop of the motifs indicates tandem arrangements of SP1 motifs.\relax }{figure.caption.36}{}} \citation{chatr-aryamontri_biogrid_2017} \citation{castro-mondragon_rsat_2017} \@setckpt{main/ch_atac-seq}{ -\setcounter{page}{63} +\setcounter{page}{65} \setcounter{equation}{6} \setcounter{enumi}{13} \setcounter{enumii}{0} \setcounter{enumiii}{0} \setcounter{enumiv}{0} \setcounter{footnote}{0} \setcounter{mpfootnote}{0} \setcounter{part}{0} \setcounter{chapter}{4} \setcounter{section}{8} \setcounter{subsection}{3} \setcounter{subsubsection}{0} \setcounter{paragraph}{0} \setcounter{subparagraph}{0} \setcounter{figure}{10} \setcounter{table}{0} \setcounter{NAT@ctr}{0} \setcounter{FBcaption@count}{0} \setcounter{ContinuedFloat}{0} \setcounter{KVtest}{0} \setcounter{subfigure}{0} \setcounter{subfigure@save}{0} \setcounter{lofdepth}{1} \setcounter{subtable}{0} \setcounter{subtable@save}{0} \setcounter{lotdepth}{1} \setcounter{lips@count}{2} \setcounter{lstnumber}{1} \setcounter{Item}{13} \setcounter{Hfootnote}{0} \setcounter{bookmark@seq@number}{0} \setcounter{AM@survey}{0} \setcounter{ttlp@side}{0} \setcounter{myparts}{0} \setcounter{parentequation}{0} \setcounter{AlgoLine}{17} \setcounter{algocfline}{1} \setcounter{algocfproc}{1} \setcounter{algocf}{1} \setcounter{float@type}{8} \setcounter{nlinenum}{0} \setcounter{lstlisting}{0} \setcounter{section@level}{0} } diff --git a/main/ch_encode_peaks.aux b/main/ch_encode_peaks.aux index 0a1b63a..538c471 100644 --- a/main/ch_encode_peaks.aux +++ b/main/ch_encode_peaks.aux @@ -1,96 +1,102 @@ \relax \providecommand\hyper@newdestlabel[2]{} \citation{cheng_understanding_2012} \citation{cheng_understanding_2012} \citation{mathelier_jaspar_2014} \citation{kulakovskiy_hocomoco:_2016} \citation{jolma_dna-binding_2013} \citation{cheng_understanding_2012} \citation{mathelier_jaspar_2014} \citation{kulakovskiy_hocomoco:_2016} \citation{jolma_dna-binding_2013} \citation{cheng_understanding_2012} \citation{gerstein_architecture_2012} \citation{wu_biogps:_2016} +\citation{ghirlando_ctcf:_2016} \@writefile{toc}{\contentsline {chapter}{\numberline {2}ENCODE peaks analysis}{23}{chapter.2}} \@writefile{lof}{\addvspace {10\p@ }} \@writefile{lot}{\addvspace {10\p@ }} \@writefile{loa}{\addvspace {10\p@ }} \@writefile{toc}{\contentsline {chapter}{ENCODE peaks analysis}{23}{chapter.2}} \@writefile{toc}{\contentsline {section}{\numberline {2.1}Data}{23}{section.2.1}} \@writefile{lof}{\contentsline {figure}{\numberline {2.1}{\ignorespaces \textbf {Number of peaks in GM12878} called by ENCODE for each TF ChIP-seq experiment. The different TFs are colored by type, as defined by \citep {cheng_understanding_2012} : sequence specific TF (TFSS), non specific TF (TFNS), chromatin structure (ChromStr), chromatin modifier (ChromModif), RNAPII associated factors (Pol2), RNAPIII associated factors (Pol3) and others. The horizontal dashed lines indicate 20'000 and 40'000.\relax }}{24}{figure.caption.19}} \newlabel{encode_peaks_gm12878_peak_number}{{2.1}{24}{\textbf {Number of peaks in GM12878} called by ENCODE for each TF ChIP-seq experiment. The different TFs are colored by type, as defined by \citep {cheng_understanding_2012} : sequence specific TF (TFSS), non specific TF (TFNS), chromatin structure (ChromStr), chromatin modifier (ChromModif), RNAPII associated factors (Pol2), RNAPIII associated factors (Pol3) and others. The horizontal dashed lines indicate 20'000 and 40'000.\relax }{figure.caption.19}{}} \@writefile{lof}{\contentsline {figure}{\numberline {2.2}{\ignorespaces \textbf {Proportion of peaks with a motif in GM12878}, for each TF ChIP-seq experiment, in green. Assuming that a TF binds to DNA through its motif, the motif should be nearby the peak center. Thus the center of each peak was scanned using a PWM describing the TF binding specificity. Each TF was associated to a log-odd PWM contained either in JASPAR Core vertebrate 2014 \citep {mathelier_jaspar_2014}, HOCOMOCO v10 \citep {kulakovskiy_hocomoco:_2016} or Jolma \citep {jolma_dna-binding_2013} collection. If a motif instance with a score corresponding to a pvalue higher or equal to $1\cdot 10^{-4}$ could be found, the peak was considered bearing a motif. The different TFs are colored by type, as defined by \citep {cheng_understanding_2012} : sequence specific TF (TFSS), non specific TF (TFNS), chromatin structure (ChromStr), chromatin modifier (ChromModif), RNAPII associated factors (Pol2), RNAPIII associated factors (Pol3) and others. The horizontal dashed line indicates 0.5.\relax }}{24}{figure.caption.20}} \newlabel{encode_peaks_gm12878_motif_prop}{{2.2}{24}{\textbf {Proportion of peaks with a motif in GM12878}, for each TF ChIP-seq experiment, in green. Assuming that a TF binds to DNA through its motif, the motif should be nearby the peak center. Thus the center of each peak was scanned using a PWM describing the TF binding specificity. Each TF was associated to a log-odd PWM contained either in JASPAR Core vertebrate 2014 \citep {mathelier_jaspar_2014}, HOCOMOCO v10 \citep {kulakovskiy_hocomoco:_2016} or Jolma \citep {jolma_dna-binding_2013} collection. If a motif instance with a score corresponding to a pvalue higher or equal to $1\cdot 10^{-4}$ could be found, the peak was considered bearing a motif. The different TFs are colored by type, as defined by \citep {cheng_understanding_2012} : sequence specific TF (TFSS), non specific TF (TFNS), chromatin structure (ChromStr), chromatin modifier (ChromModif), RNAPII associated factors (Pol2), RNAPIII associated factors (Pol3) and others. The horizontal dashed line indicates 0.5.\relax }{figure.caption.20}{}} \citation{hon_chromasig:_2008,nielsen_catchprofiles:_2012,kundaje_ubiquitous_2012,nair_probabilistic_2014,groux_spar-k:_2019} \citation{nair_probabilistic_2014} \@writefile{toc}{\contentsline {section}{\numberline {2.2}ChIPPartitioning : an algorithm to identify chromatin architectures}{25}{section.2.2}} \@writefile{toc}{\contentsline {subsection}{\numberline {2.2.1}Data realignment}{26}{subsection.2.2.1}} \citation{zhang_canonical_2014} \@writefile{toc}{\contentsline {section}{\numberline {2.3}Nucleosome organization around transcription factor binding sites}{27}{section.2.3}} \@writefile{lof}{\contentsline {figure}{\numberline {2.3}{\ignorespaces \textbf {Chromatin pattern around TF binding sites in GM12878 :} \textbf {A} For each peaklist, nucleosome occupancy was measured +/- 1kb around each individual TFBS using 10bp bins. The TFBS were then classified into 4 classes according to their nucleosome patterns using a ChIPPartitioning, allowing the patterns to be flipped and shifted. Each TF binding site was assigned a probability to belong to each of the 4 classes with a given values of shift and flip. To assess the extent of a given TF to i) display nucleosomes arrays on its flank and ii) to have nucleosome positioned with respect to its binding sites, array density and shift probability standard deviation have been measured for each class. Classes having a mean array density above 0.4 and a shift probability standard deviation under 3.5 and other custom classes are highlighted. Classes are named using the TF, the laboratory which produced the data and the class number (from 1 to 4). \textbf {B} Examples of class patterns corresponding to some of the highlighted classes for CTCF, ATF3, YY1, EBF1 and ZNF143. MNase profiles (red) were allowed to be shifted and flipped and DNaseI (blue), TSS density (violet) and sequence conservation (green) were overlaid according to MNase classification (taking into account both shift and flip). The y-axis scale represent the proportion of the highest signal for each chromatin pattern.\relax }}{28}{figure.caption.21}} \newlabel{encode_peaks_array_measure}{{2.3}{28}{\textbf {Chromatin pattern around TF binding sites in GM12878 :} \textbf {A} For each peaklist, nucleosome occupancy was measured +/- 1kb around each individual TFBS using 10bp bins. The TFBS were then classified into 4 classes according to their nucleosome patterns using a ChIPPartitioning, allowing the patterns to be flipped and shifted. Each TF binding site was assigned a probability to belong to each of the 4 classes with a given values of shift and flip. To assess the extent of a given TF to i) display nucleosomes arrays on its flank and ii) to have nucleosome positioned with respect to its binding sites, array density and shift probability standard deviation have been measured for each class. Classes having a mean array density above 0.4 and a shift probability standard deviation under 3.5 and other custom classes are highlighted. Classes are named using the TF, the laboratory which produced the data and the class number (from 1 to 4). \textbf {B} Examples of class patterns corresponding to some of the highlighted classes for CTCF, ATF3, YY1, EBF1 and ZNF143. MNase profiles (red) were allowed to be shifted and flipped and DNaseI (blue), TSS density (violet) and sequence conservation (green) were overlaid according to MNase classification (taking into account both shift and flip). The y-axis scale represent the proportion of the highest signal for each chromatin pattern.\relax }{figure.caption.21}{}} \citation{kundaje_ubiquitous_2012,fu_insulator_2008} -\citation{ghirlando_ctcf:_2016} +\citation{stedman_cohesins_2008} +\citation{donohoe_identification_2007} +\citation{bailey_znf143_2015} \@writefile{toc}{\contentsline {section}{\numberline {2.4}The case of CTCF, RAD21, SMC3, YY1 and ZNF143}{29}{section.2.4}} \@writefile{lof}{\contentsline {figure}{\numberline {2.4}{\ignorespaces \textbf { Colocalization with CTCF peaks in GM12878 cells : } \textbf {A} Proportion of peaks for different TFs having a CTCF peak within 10bp, 50bp and 100bp. The colours indicate different TFs. The CTCF peaklist used as reference to assess CTCF presence was CTCF.Sydh (in red), the two RAD21 peaklists are RAD21.Haib and RAD21.Sydh respectively (in blue), the SMC3 peaklist is SMC3.Sydh (in green), the YY1 peaklist is YY1.Haib (in orange) and the ZNF143 peaklist is ZNF143.Sydh (in violet). \textbf {B} Venn diagrams showing the proportion of peaks for each TF with i) an instance of its own motif, ii) a CTCF.Sydh peak within 100bp, iii) both or iv) neither of them. RAD21 and SMC3 are not represented as there is no PWM available to describe their sequence specificity. \textbf {C} ChIPPartitioning classification with shift and flip of MNase patterns +/- 1kb of YY1.Haib peaks using 10bp bins. YY1 peaks with (upper row) and without (lower row) a CTCF peak within 100bp. Two classes were used to account for "typical" and "non-typical" looking MNase patterns. DNaseI (blue), TSS density (violet) and sequence conservation (green) were overlaid according to MNase classification (taking into account both shift and flip). The number at the upper right corner of each plot indicate the overall class probability. The number of YY1 peaks is slightly smaller than in B) because peaks showing no MNase reads were not included in the classification analysis. Peaklists are named using the TF together with the laboratory which produced the data.\relax }}{30}{figure.caption.22}} \newlabel{encode_peaks_colocalization_ctcf}{{2.4}{30}{\textbf { Colocalization with CTCF peaks in GM12878 cells : } \textbf {A} Proportion of peaks for different TFs having a CTCF peak within 10bp, 50bp and 100bp. The colours indicate different TFs. The CTCF peaklist used as reference to assess CTCF presence was CTCF.Sydh (in red), the two RAD21 peaklists are RAD21.Haib and RAD21.Sydh respectively (in blue), the SMC3 peaklist is SMC3.Sydh (in green), the YY1 peaklist is YY1.Haib (in orange) and the ZNF143 peaklist is ZNF143.Sydh (in violet). \textbf {B} Venn diagrams showing the proportion of peaks for each TF with i) an instance of its own motif, ii) a CTCF.Sydh peak within 100bp, iii) both or iv) neither of them. RAD21 and SMC3 are not represented as there is no PWM available to describe their sequence specificity. \textbf {C} ChIPPartitioning classification with shift and flip of MNase patterns +/- 1kb of YY1.Haib peaks using 10bp bins. YY1 peaks with (upper row) and without (lower row) a CTCF peak within 100bp. Two classes were used to account for "typical" and "non-typical" looking MNase patterns. DNaseI (blue), TSS density (violet) and sequence conservation (green) were overlaid according to MNase classification (taking into account both shift and flip). The number at the upper right corner of each plot indicate the overall class probability. The number of YY1 peaks is slightly smaller than in B) because peaks showing no MNase reads were not included in the classification analysis. Peaklists are named using the TF together with the laboratory which produced the data.\relax }{figure.caption.22}{}} +\@writefile{lof}{\contentsline {figure}{\numberline {2.5}{\ignorespaces \textbf {Nucleosome free region at CTCF binding sites} \textbf {a} The length are represented as boxplots. The CTCF binding sites are divided into subgroups according to additional presence of SCM3, RAD21, YY1 or ZNF143. The number of binding sites in each subgroup is indicated in red above the boxplots. The presence of SMC3 only, RAD21 only and SMC3 and RAD21 together are indicated in violet, blue and orange respectively. \textbf {B} The proportion of peaks (in green), in each subgroup, having a TSS within a 1kb.\relax }}{31}{figure.caption.23}} +\newlabel{encode_peaks_ctcf_ndr}{{2.5}{31}{\textbf {Nucleosome free region at CTCF binding sites} \textbf {a} The length are represented as boxplots. The CTCF binding sites are divided into subgroups according to additional presence of SCM3, RAD21, YY1 or ZNF143. The number of binding sites in each subgroup is indicated in red above the boxplots. The presence of SMC3 only, RAD21 only and SMC3 and RAD21 together are indicated in violet, blue and orange respectively. \textbf {B} The proportion of peaks (in green), in each subgroup, having a TSS within a 1kb.\relax }{figure.caption.23}{}} +\citation{stedman_cohesins_2008} \citation{dreos_mga_2018} \citation{gerstein_architecture_2012} \citation{mathelier_jaspar_2014} \citation{kulakovskiy_hocomoco:_2016} \citation{jolma_dna-binding_2013} \citation{gaffney_controls_2012} -\@writefile{toc}{\contentsline {section}{\numberline {2.5}Study of CTCF interactor motifs}{31}{section.2.5}} -\@writefile{toc}{\contentsline {section}{\numberline {2.6}The EBF1 case}{31}{section.2.6}} -\@writefile{toc}{\contentsline {section}{\numberline {2.7}Methods}{31}{section.2.7}} -\@writefile{toc}{\contentsline {subsection}{\numberline {2.7.1}Data and data processing}{31}{subsection.2.7.1}} +\@writefile{toc}{\contentsline {section}{\numberline {2.5}Study of CTCF interactor motifs}{33}{section.2.5}} +\@writefile{toc}{\contentsline {section}{\numberline {2.6}The EBF1 case}{33}{section.2.6}} +\@writefile{toc}{\contentsline {section}{\numberline {2.7}Methods}{33}{section.2.7}} +\@writefile{toc}{\contentsline {subsection}{\numberline {2.7.1}Data and data processing}{33}{subsection.2.7.1}} \citation{boyle_high-resolution_2008} \citation{dreos_eukaryotic_2017} \citation{siepel_evolutionarily_2005} \@setckpt{main/ch_encode_peaks}{ -\setcounter{page}{33} +\setcounter{page}{35} \setcounter{equation}{0} \setcounter{enumi}{13} \setcounter{enumii}{0} \setcounter{enumiii}{0} \setcounter{enumiv}{0} \setcounter{footnote}{0} \setcounter{mpfootnote}{0} \setcounter{part}{0} \setcounter{chapter}{2} \setcounter{section}{7} \setcounter{subsection}{1} \setcounter{subsubsection}{0} \setcounter{paragraph}{0} \setcounter{subparagraph}{0} -\setcounter{figure}{4} +\setcounter{figure}{5} \setcounter{table}{0} \setcounter{NAT@ctr}{0} \setcounter{FBcaption@count}{0} \setcounter{ContinuedFloat}{0} \setcounter{KVtest}{0} \setcounter{subfigure}{0} \setcounter{subfigure@save}{0} \setcounter{lofdepth}{1} \setcounter{subtable}{0} \setcounter{subtable@save}{0} \setcounter{lotdepth}{1} \setcounter{lips@count}{2} \setcounter{lstnumber}{1} \setcounter{Item}{13} \setcounter{Hfootnote}{0} \setcounter{bookmark@seq@number}{0} \setcounter{AM@survey}{0} \setcounter{ttlp@side}{0} \setcounter{myparts}{0} \setcounter{parentequation}{0} \setcounter{AlgoLine}{0} \setcounter{algocfline}{0} \setcounter{algocfproc}{0} \setcounter{algocf}{0} \setcounter{float@type}{8} \setcounter{nlinenum}{0} \setcounter{lstlisting}{0} \setcounter{section@level}{0} } diff --git a/main/ch_encode_peaks.tex b/main/ch_encode_peaks.tex index 01a5315..b66563b 100644 --- a/main/ch_encode_peaks.tex +++ b/main/ch_encode_peaks.tex @@ -1,140 +1,161 @@ \cleardoublepage \chapter{ENCODE peaks analysis} \markboth{ENCODE peaks analysis}{ENCODE peaks analysis} \addcontentsline{toc}{chapter}{ENCODE peaks analysis} % Modeling a TF sequence specificity only allows to partially understand how a TF binds a region. Indeed, scanning a genome using a PWM for putative binding sites often returns tens of thousands of sites with only a subset of them being really occupied within a cell. Other elements such as chromatin organization and composition are likely to drive TF binding. Thus gaining a better understanding about the chromat % The exact mechanisms at play remain unclear but nucleosome occupancy is thought to shelter DNA sequence - as some bases are facing the core octamer or to distort the DNA structure - impeding sequence recognition by TFs. In vivo, evidences for competition between TFs and nucleosomes have been collected. Computational simulations accounting for simultaneous multiple factor binding on DNA suggested that nucleosome occupancy and TFs binding influence each other and that TF binds nucleosome depleted regions \cite{wasson_ensemble_2009}. As discussed above, the organization of chromatin has a deep impact on TF binding. Nucleosomes and TFs are in competition to bind DNA. Because TFs are ultimate forces driving gene expression, understanding how chromatin influence them, or at least how chromatin is organized around them, is crucial. It is now clear that nucleosome occupancy fulfills more than a packaging role. It can also acts as a barrier to impede DNA reading processes and compete with TFs for sequence occupancy. Thus gaining a better understanding of how chromatin is organized around TF binding sites is crucial to understand TF binding beyond their sequence specificity only. In an effort to better understand how the genome is organized and how its functions are fulfilled, the ENCODE Consortium which released an impressive collection of coherent data representing an unprecedented picture of the chromatin in human cell lines. The GM12878 cells were chosen as one of the highest priority cell line. GM12878 are widely-used lymphoblastoids. Because of their ability to divide and of their normal karyotype - unlike HeLa cells - these cells are a good model for genomic studies. \section{Data} % number of peaks per dataset \begin{figure} \begin{center} \includegraphics[scale=0.3]{images/ch_encode_peaks/peaklist_peaknumber_GM12878.png} \captionof{figure}{\textbf{Number of peaks in GM12878} called by ENCODE for each TF ChIP-seq experiment. The different TFs are colored by type, as defined by \citep{cheng_understanding_2012} : sequence specific TF (TFSS), non specific TF (TFNS), chromatin structure (ChromStr), chromatin modifier (ChromModif), RNAPII associated factors (Pol2), RNAPIII associated factors (Pol3) and others. The horizontal dashed lines indicate 20'000 and 40'000.} \label{encode_peaks_gm12878_peak_number} \end{center} \end{figure} % proportion of peaks with motif per dataset \begin{figure} \begin{center} \includegraphics[scale=0.3]{images/ch_encode_peaks/peaklist_proportions_GM12878.png} \captionof{figure}{\textbf{Proportion of peaks with a motif in GM12878}, for each TF ChIP-seq experiment, in green. Assuming that a TF binds to DNA through its motif, the motif should be nearby the peak center. Thus the center of each peak was scanned using a PWM describing the TF binding specificity. Each TF was associated to a log-odd PWM contained either in JASPAR Core vertebrate 2014 \citep{mathelier_jaspar_2014}, HOCOMOCO v10 \citep{kulakovskiy_hocomoco:_2016} or Jolma \citep{jolma_dna-binding_2013} collection. If a motif instance with a score corresponding to a pvalue higher or equal to $1\cdot10^{-4}$ could be found, the peak was considered bearing a motif. The different TFs are colored by type, as defined by \citep{cheng_understanding_2012} : sequence specific TF (TFSS), non specific TF (TFNS), chromatin structure (ChromStr), chromatin modifier (ChromModif), RNAPII associated factors (Pol2), RNAPIII associated factors (Pol3) and others. The horizontal dashed line indicates 0.5.} \label{encode_peaks_gm12878_motif_prop} \end{center} \end{figure} In these cells, the ENCODE Consortium released ChIP-seq data 53 different TFs. Additionally, nucleosome occupancy (MNase-seq) and chromatin accessiblity (DNasI-seq) data were generated with a depth of coverage. Furthermore, the ENCODE Consortium also released peaks called using their uniform processing pipeline \cite{gerstein_architecture_2012}. These peaks are interesting because i) they are called from technical replicate ChIP-seq samples and ii) several peak callers are used and the different results are integrated. These peaks are thus reproducible [REFERENCE IDR] and robust to peak caller discrepancies and can be considered an excellent standard. The number of peaks called for each TF was highly variable and likely reflects each factor activity in this cell line (Figure \ref{encode_peaks_gm12878_peak_number}). The most abundant factor in terms of peaks was RUNX3 followed by CTCF. This observation fits to BioGPS \citep{wu_biogps:_2016} data which indicates that both RUNX3 and CTCF have a higher expression in lymphoblast and in B cells compared to other tissues. Regarding CTCF, it is involved in chromatin looping \citep{ghirlando_ctcf:_2016}. Because it implies that two CTCF molecules form an homodimer dued to the genome 3D conformation, it potential multiply by 2 the number of CTCF peaks. Moreover, the propensity of each TF to bind through their motifs was also variable, with again CTCF being showing the highest values \ref{encode_peaks_gm12878_motif_prop}. \section{ChIPPartitioning : an algorithm to identify chromatin architectures} Discovering archetypical chromatin architectures over a set of regions of interest - let's say containing a TF binding site in their middle - is a long standing problem in bioinformatics. More formerly, given a matrix $R$ of dimensions $NxL$ containing $N$ vectors of read counts $r_{1}, r_{2}, ..., r_{N}$ of length $L$, each containing the number of reads mapping at a given position in a given region, find $K \leq N$ vectors of length $L' \leq L$ that contain archetypical signals found in the $N$ regions of $R$. This can actually be solved using clustering methods which groups regions that look alike into $K$ groups. The summary of the signal inside each group - for instance the mean signal for the K-means algorithm - can then be interpreted as the archetypical chromatin architectures. Biologically, different organization may reflect different functions. First, the $N$ regions of interest are usually aligned with respect to a feature of interest, for instance a TF binding sites. However, he chromatin features of interest - for instance the nucleosomes - may not be aligned from one region to the next. This can originate because i) of the true binding sites being fuzzely distributed around the center of the regions, ii) the chromatin features appear at a varying distance from the region centers or iii) both. Comparing two regions then necessitate to first realign the chromatin features. Second, the regions can show a functional orientation. For instance, TF binding sites have an upstream and a downstream with respect to the bound sequence. Orienting properly the regions is also required to properly compare the chromatin organizations in two regions. Finally, the signal over some regions may be sparse because of a sub-optimal sequencing depth. The study of signal distribution over genomic regions has been a quite active field for bulk sequencing experiments during the last decade. Dedicated algorithms \citep{hon_chromasig:_2008,nielsen_catchprofiles:_2012,kundaje_ubiquitous_2012,nair_probabilistic_2014,groux_spar-k:_2019} have been developed to cluster genomic regions based on their distribution of reads. Most of these algorithms and softwares deal with some of these issues cited above. However, the algorithm developed by \citep{nair_probabilistic_2014} - which I will call ChIPPartitioning - is probably the best. ChIPPartitioning is a probabilistic partitioning method that softly clusters a sets of genomic regions based on their signal shape (as opposed to the absolute values) resemblance. To ensure proper comparisons between the regions, the algorithm allows to offset one region compare to the other to retrieve a similar signal at different offsets and to flip the signal orientation. Finally, it has been demonstrated to be really robust to sparse data. This algorithm models the signal over a region of length $L$ has having being sampled from a mixture of $K$ signal models, using $L$ independent Poisson distributions. The number of reads sequenced over this region is then the result of this sampling process. The entire set of regions is assumed to have been generated from a mixture of $K$ different signal models (classes). Each class is represented by a vector of $L' \leq L$ values that represent the expected number of reads at each position for that class. These values are thus the Poisson distribution parameters. In order to discover the $K$ different chromatin signatures in the data, the algorithm proceed to a maximum likelihood estimation of the Poisson distribution parameters using an expectation-maximization (EM) framework. Given a set of $K$ models, the likelihoods of each region given each class is computed. A posterior probability of each class given each region can, in turn, be computed. These probabilities can be interpreted as a soft clustering. The parameters of the classes are updated using a weighted aggregation of the signal. Since each region is computed a probability to belong to each class, it participates to the update of all the classes, with different weights. If the length of the chromatin signature searched $L'