ch_group_projects.tex
No OneTemporary
Actions

Subscribers

None

File Metadata

Created: Mon, May 20, 10:53

ch_group_projects.tex
View Options

	\cleardoublepage
	\chapter{Published laboratory projects}
	\label{lab_projects}
	\markboth{Published laboratory projects}{Published laboratory projects}
	\addcontentsline{toc}{chapter}{Published laboratory projects}

	This section contains the description of projects that have been accomplished as part of a laboratory-wide effort. Resources such as softwares and databases which have been subjected to a publication in peer-reviewed article are listed here.




	\section{Mass Genome Annotation repository}
	\label{section_mga}

	As part of an effort to enhance reproducibility and re-use of NGS data, my laboratory developed and maintain the Mass Genome Annotation repository (MGA). This repository is a resource that has been designed to store publicly available NGS data released on other primary data repository, in a highly standardized manner with a high quality data annotation (metadata). The MGA has been described in \citep{dreos_mga_2018}. This work was mostly undertaken by René Dreos, a postdoctoral fellow of the laboratory. My involvement in this project was related to data processing, curration and annotation.

	\subsection{Introduction}

	Nowadays, publishing in a peer-reviewed article requires the authors to submit their NGS data to primary repository, such as GEO \citep{barrett_ncbi_2011} or ArrayExpress \citep{rustici_arrayexpress_2013}. In both cases, the raw sequence data are deposited on these platform with a minimal, often incomplete, difficult to understand if not wrong data annotation.

	Because working with such data requires to, at least, map the data to a genome and perform some data quality controls, it is a substantial investment of work and time which is often a barrier impeding the re-use of the data and the reproducibility of published studies.

	\subsection{MGA content and organization}

	\begin{figure}[!htbp]
	\begin{center}
	\includegraphics[scale=0.8]{images/ch_group_projects/mga_figure1.jpeg}
	\captionof{figure}{\textbf{Content of the MGA repository by 2018} \textbf{A} Proportion of samples in the database grouped by type. \textbf{B} Proportion of samples grouped by organism. Assemblies belonging to the same organism are merged together. \textbf{C} Samples numbers stratified by type and organism. Dot areas are proportional to the total number of samples in that category. The corresponding numbers can be found in a weakly updated table posted on the MGA home page at \url{http://ccg.vital-it.ch/mga}.
	Figure and legend taken and adapted from \citep{dreos_mga_2018}.}
	\label{lab_projects_mga_stats}
	\end{center}
	\end{figure}

	Currently, the MGA contains more than 24'000 samples in 15 different species : human, mouse, rat, macaque, dog, chicken, zebrafish, worm, fruit fly, bee, arabidopsis, corn, plasmodium, baker's yeast and fission yeast. In all species, except in human, mouse, fruit fly and worm the data are mapped to a single genome assembly, called primary assembly. IAmong the hosted samples, landmark datasets such as the ENCODE \citep{consortium_integrated_2012}, RoadMap \citep{roadmap_epigenomics_consortium_integrative_2015} or Fantom5 \citep{lizio_gateways_2015} datasets are present. Each sample in the MGA belongs to one of the 13 mandatory data categories :

	\begin{enumerate}
	\item ChIP-seq reads : reads mapped to one of the diverse genome assemblies.
	\item ENCODE ChIP-seq read : reads produced and mapped by the ENCODE Consortium
	\item RoadMap-ChIP-seq : reads produced and mapped by the RoadMap Consortium from different tissues.
	\item ChIP-seq peaks : peaks called by the data author and released together with the reads.
	\item ENCODE ChIP-seq peaks : peaks called by the ENCODE Consortium on their ChIP-seq data.
	\item Transcription profiling : 5' end of RNA produced using methods such as CAGE or GRO-cap
	\item ENCODE transcription profiling : 5' end of RNA produced and mapped by the ENCODE Consortium.
	\item DNase FAIRE etc : chromatin accessibility and nucleosome occupancy data.
	\item ENCODE DNase FAIRE etc : ENCODE produced and mapped data for chromatin accessibility and nucleosome occupancy.
	\item RoadMap DNase FAIRE etc : RoadMap produced data for chromatin accessibility and nucleosome occupancy.
	\item DNA methylation : various methylation data such as bisulfite sequencing MBDCap and so on.
	\item Sequence derived : DNA sequence measure features such as PWM matches, natural variants and PhastCons \citep{siepel_evolutionarily_2005} and PhyloP \citep{pollard_detection_2010} conservation tracks.
	\item Genome annotation : TSSs from different databases, intro/exon boundaries.
	\end{enumerate}

	All the data available on the MGA are stored in Simple Genome Annotation (SGA, \cite{ambrosini_chip-seq_2016}). SGA is a single coordinate format. In essence, all data are represented as a single coordinate along the genome. Nonetheless, this does not compromise the accuracy of the data as each type is represented appropriately. ChIP-seq peaks and paired-end sequenced fragments are represented by their middle position, single-end sequenced reads by their 5' end, TSSs are single base coordinates anyway, and so one. Additionally, this minimizes the disk space required to store these data. However, in any case, SGA formatted data can easily be converted to BED or to GFF using dedicated conversion tools \citep{ambrosini_chip-seq_2016}.

	In order to enhance original results reproducibility as much as possible, processed data, if available on the primary repositories, are always preferred. Otherwise, the raw sequencing data are downloaded and processed using a general pipeline comprising i) read mapping using Bowtie \citep{langmead_ultrafast_2009} or Bowtie 2 \citep{langmead_fast_2012}, ii) conversion to BED using the SAMTools \citep{li_sequence_2009} and BEDTools \citep{quinlan_bedtools:_2010} suits and a final conversion to SGA using ChIP-seq server conversion tools \cite{ambrosini_chip-seq_2016}. As on GEO, data (called samples) for a given study/article belong to a same serie. Finally, the metadata are created. A full description of the data, their biological significance, their processing is available in HTML format. Two additional machine readable text files are available with i) the sample information ii) the serie information.

	Importantly, it should be highlighted that the MGA is fully interconnected in-house developed analysis tools hosted on the ChIP-seq \citep{ambrosini_chip-seq_2016} and Signal Analysis Search (SSA, \cite{ambrosini_signal_2003}) servers. These servers contains tool to perform peak-calling, correlation analyses, sequence analysis, format conversions and much more. All the data hosted can thus be readily analyzed using any of these in-house developed tools.

	\subsection{Conclusions}

	The MGA repository is an important asset to the scientific community. It allows anybody to undertake quickly a wide range of data analyses, together with the ChIP-seq and SSA servers. Additionally, all these data are readily available and can be downloaded in MGA or BED format. Using them is so convenient that I used MGA hosted data in almost all my projects.

	\clearpage

	\section{Eukaryotic Promoter Database}

	The laboratory developed and maintains the Eukaryotic Promoter Database (EPD), a database of eukaryotic RNA-polymerase II TSSs compiled from high throughput transcription profiling data. This section briefly recapitulates the results presented in \citep{dreos_eukaryotic_2017} and also present a more up-to-date content of the database. Most of the work presented in this section should be credited to René Dreos and Patrick Meylan, respectively a former and a current post-doctoral follows from the laboratory.

	\subsection{Introduction}

	\begin{figure}
	\begin{center}
	\centerline{\includegraphics[scale=0.9]{images/ch_group_projects/epd_figure1.jpeg}}
	\caption{\textbf{Schematic representation of the EPDnew pipeline} \textbf{A} Download of authoritative gene catalogs and primary TSS mapping data from public databases, data repositories and consortium websites. \textbf{B} Quality control (QC) of incoming data (e.g. read mapping efficiency, contaminations, etc.). \textbf{C} Data passing QC are reformatted and incorporated into the MGA repository. \textbf{D} Selection of a subset of TSS mapping experiments for generating a new organism-specific TSS collection. \textbf{E} Input data for a new module of EPDnew. \textbf{F} Organism-specific automatic database assembly pipeline tailored to the input data, see \citep{dreos_epd_2013} for a detailed description of the human EPDnew assembly pipeline. \textbf{G} Preliminary or final TSS collection \textbf{H} Manual sanity checks of individual randomly selected promoter entries using the corresponding entry viewer. \textbf{I} Automatic quality evaluation of the TSS collections as a whole by motif enrichment tests, see Figure \ref{lab_projects_epd_motifs} for an example. \ref{L} Feedback is collected from quality evaluation steps H and I. This may lead to the exclusion, replacement or addition of source data sets or modifications (e.g. program parameter fine-tuning) of the computational database generation pipeline. Note that the development of a final, publicly released EPDnew module typically involves several evaluation-modification cycles. Figure and legend taken and adapted from \citep{dreos_eukaryotic_2017}.}
	\label{lab_projects_epd_pipeline}
	\end{center}
	\end{figure}

	The eukaryotic promoter database is an old resource initiated by Bucher and Trifonov from the manual curration of experimental data \citep{bucher_compilation_1986}. Since its beginning, EPD was designed as a sequence annotation that indicates, for a given sequence, which positions are used to initiate transcription.

	With the advent of high throughput transcription profiling assays - such as CAGE and GRO-seq - TSS identification and mapping has drastically changed. In consequence, EPDnew - a new dedicated branch of EPD - was created several years ago \citep{dreos_epd_2013} to make full use of these new data.

	In essence, each EPDnew release is created using a semi-automated computational pipeline that identifies genomic regions showin high mRNA initiations at the beginning of annotated genes. The input data are subjected to a severe quality control checks before entering the EPDnew pipeline. Additionally, the resultes are manually verified throughout the entire process, ensuring high curration standards. The entire pipeline is depicted in Figure \ref{lab_projects_epd_pipeline}.

	EPDnew contains computational annotations of genome assemblies from publicly available high throughput 5' mRNA sequencing data. The following sections contain a description of the current state of EPD and of the recent novelties introduced.

	\subsection{EPDnew now annotates (some of) your mushrooms and vegetables}

	\begin{table}
	\begin{center}
	\begin{tabular}{ l l l }
	\hline
	Organism, release & Promoters, genes with TSS, genes & TSS libraries \\
	\hline
	\textit{H. sapiens} (6) & 29598, 16455, 17056 (96\%) & 1311 \\
	\textit{H. sapiens} nc (6) & 2339, 1894, 3496 (54\%) & 1311 \\
	\textit{M. mulata} (1) & 9575, 9026, 20593 (44\%) & 15 \\
	\textit{M. musculus} (3) & 25111, 20213, 24864 (81\%) & 965 \\
	\textit{M. musculus} nc (3) & 3077, 2938, 10184 (29\%) & 965 \\
	\textit{R. norvegicus} (1) & 12601, 12013, 21919 (55\%) & 13 \\
	\textit{G. gallus} (1) & 6127, 5632, 16837 (33\%) & 32 \\
	\textit{C. familiaris} (1) & 7545, 7321, 19971 (37\%) & 12 \\
	\textit{D. melanogaster} (5) & 16972, 13399, 13660 (98\%) & 375 \\
	\textit{A. melifera} (1) & 6493, 5712, 10727 (53\%) & 16 \\
	\textit{D. rerio} (1) & 10728, 10235, 18606 (55\%) & 21 \\
	\textit{C. elegans} (1) & 7120, 6363, 11786 (54\%) & 9 \\
	\textit{A. thaliana} (4) & 22703, 22701, 27149 (83\%) & 13 \\
	\textit{Z. mays} (1) & 17081, 15828, 26651 (59\%) & 8 \\
	\textit{S. cervisiae} (2) & 5117, 5117, 5819 (88\%) & 22 \\
	\textit{S. pombe} (2) & 4802, 4801, 5128 (94\%) & 16 \\
	\textit{P. falciparum} (1) & 5597, 4028, 4994 (80\%) & 12 \\
	\hline
	\end{tabular}
	\caption{\textbf{Current contents of EPDnew} 'Promoters' indicate the number of TSS entries in EPDnew. 'Genes' indicates the number of genes having at least one TSS annotated in EPDnew. 'Genes' indicates the number of protein coding genes contained in the genome annotation (except for nc species). 'nc' stands for non-coding and indicates the long non-coding gene annotations. For 'nc' entries, 'genes' refers to the number of long non-coding genes present in the annotation. In parenthesis are indicated the percentages of genes having a at least one TSS annotated in EPDnew.}
	\label{lab_projects_epd_stats}
	\end{center}
	\end{table}

	With years, EPDnew has substantially grown since its beginnings. It has grown from a promoter collection annotating protein coding genes in five animal model organisms (human, mouse, fruit fly, zebrafish and worm, \cite{dreos_eukaryotic_2015}), to 10 (human, mouse, fruit fly, zebrafish, worm, bee, arabidopsis, maize, brewer's yeast and fission yeast, \cite{dreos_eukaryotic_2017}) and now 16 organisms annotating protein coding and non coding genes (Table \ref{lab_projects_epd_stats}). The number of genes containing at least one annotated TSS in EPDnew is variable among species. However \textit{H. sapiens}, \textit{D. melanogaster} and \textit{S. pombe} are approaching a complete gene coverage of protein coding genes with 96\%, 98\% and 94\%.

	\subsection{Increased mapping precision in human}

	\begin{figure}
	\begin{center}
	\centerline{\includegraphics[scale=0.3]{images/ch_group_projects/epd_motifs.png}}
	\caption{\textbf{TSS Mapping precision} Occurrence of the TATA-box \textbf{A} and initiator \textbf{B} around \textit{H.sapiens} TSSs from EPDnew releases (004 and 006) and from a list of gene starts from UCSC Gene list, which was used as input for the generation of the EPDnew collection. This figure was created using Oprof from the SSA server \citep{ambrosini_signal_2003}. Detailed instructions to recreate the figure can be found below.}
	\label{lab_projects_epd_motifs}
	\end{center}
	\end{figure}

	The human annotation has been generated with >1300 experiments, containing dozens of billions of reads. It is currently and by far the largest data collection among EPDnew. Importantly, even if the number of TSSs reported increased compared to what has been reported in \citep{dreos_eukaryotic_2017} (25 503 using 1088 datasets vs 29 598 using 1311 datasets) the coverage is reaching saturation (95\% vs 96\%). Thus most of the newly discovered TSSs are alternative TSSs. Nonetheless, the overall TSS mapping precision is still increasing, as shown in Figure \ref{lab_projects_epd_motifs}. This is illustrated using the positional distributions of the TATA-box and the initiator motifs which are both core promoter elements that are expected to appear at a fixed distance from the TSS. The increased frequencies seen in EPDnew release 006 compared the other TSS annotations indicate a better alignment of the TSSs.

	\subsection{Integration of EPDnew with other resources}

	EPDnew is also hosted in the MGA repository (see section \ref{section_mga} and \citep{dreos_mga_2018}). As such, EPDnew TSSs are available together with the ChIP-seq (TF binding, histone marks), the nucleosome occupancy, chromatin accessibility, SNP and sequence conservation data present. As a consequence, EPDnew can easily be integrated into diverse genomic analyses through the tools hosted on the ChIP-seq \citep{ambrosini_chip-seq_2016} and SSA \citep{ambrosini_signal_2003} servers.

	Besides, EPD could be explored through a viewer relying on a selection of tracks to be visualized in the UCSC Genome Browser \citep{dreos_epd_2013}. Since then, a major effort to provide a customizable visualization plateform has been undertaken \citep{dreos_eukaryotic_2017}. Currently, each specie is provided with a minimal track hub \citep{raney_track_2014} containing at least 3 tracks : i) the combined TSS mapping samples used to create the EPDnew annotation for this specie, ii) the EPDnew TSSs on the + strand and iii) the EPDnew TSSs on the - strand. Other tracks are often available depending on the specie such as a gene track, a CpG island track and so one. Finally, a tool to create and upload custom MGA derived tracks (ChIP-Track available at \url{https://ccg.epfl.ch/chipseq/chip_track.php}) on the UCSC Genome Browser, in a few mouse clicks, have been developed to fully exploit the possible synergies between the UCSC Genome Browser, EPDnew and the MGA repository.

	\subsection{Conclusions}

	EPDnew is a valuable resource for the research community. It offers an unprecedented TSS mapping effort in terms of precision and species covered. Additionally, its interconnection with other resources and tools maintained by the laboratory makes it easy to be integrated in different genomic analyses. Finally, the effort made to allow the visualization of EPDnewm TSSs and of any data present on the MGA is a powerful tool that allows to create "on demand" visualizations

	\subsection{Methods}

	\subsubsection{Motif occurrence profiles}

	The motif occurrence profiles in Figure \ref{lab_projects_epd_motifs} have been generated using Oprof from the SSA server \url{https://ccg.epfl.ch/ssa/oprof.php}. To create the TATA-box profiles, the input data were H. sapiens (Feb 2009 GRCh37/hg19) / Genome Annotation / EPDnew / release 004 or 006 or UCSC, TSS for known Genes / TSS from UCSC known genes. The motifs were choosen from PWMs from libraries / Promoter Motifs and then TATA-box (length=15) or Initiator (length=8). The borders were set to -50 to +50bp, the window size to 20bp for the TATA-box and 10bp for the Initiator and forward search mode was enabled. All other parameters were left to default.

	\clearpage

	\section{PWMScan}

	This section describes PWMScan, a software that has been developed to predict TF binding sites given a binding specificity model. This work has been conducted in close collaboration with Giovana Ambrosini, a senior scientist of the laboratory.

	My contribution to this worked is substantial. Among other things, I contributed to the development of the matrix\_scan program, set up the entire benchmark procedure, run it and finally developed the necessary backend Python code for an efficient parallelization of matrix\_scan.

	The content of this section is taken and adapted - with the author permission - from the original article \citep{ambrosini_pwmscan:_2018}. I produced all the figures presented in this section.

	% Because TF regulate gene expression by binding to short specific DNA sequences of 5–20 bp, it is of prime importance to be able to predict the binding sites of a TF given its binding specificity.

	% To fulfill this task, I participated in the development a genome scanner that uses a PWM as TF specificity model to scan a genome and returns candidate binding sites.

	% A web-based version of this software has been made available to scan server-resident genomes. Additionally, the user can provide a PWM or select one among the different hosted public databases such as JASPAR \citep{khan_jaspar_2018}, HOCOMOCO \citep{kulakovskiy_hocomoco:_2018} and many others.

	\subsection{Introduction}

	% article Figure 1
	% \begin{figure}[!htbp]
	% \begin{center}
	% \includegraphics[scale=0.4]{images/ch_group_projects/pwmscan_fig1.png}
	% \captionof{figure}{Screen shot of the PWMScan graphical user interface.}
	% \label{lab_projects_gui}
	% \end{center}
	% \end{figure}

	% flowchart present in supplemental material
	\begin{figure}
	\begin{center}
	\includegraphics[scale=0.1]{images/ch_group_projects/pwmscan_flowchart.png}
	\captionof{figure}{\textbf{PWMScan workflow :} the input is composed of a PWM and a score threshold specifying the minimum score for a sequence to achieved to be considered as a match. Letter probability matrices or count matrices are also accepted and are converted into PWMs. The score threshold can also be given as a p-value or a percentage of the maximum score, in which case it is converted into a threshold score. Based on the length of the PWM, Bowtie or pwm\_scan can be used to find the matches on the genome. If Bowtie is used, the set of k-mers achieving a better score than the threshold score is computed using branch-and-bound algorithm (mba) and mapped on the genome. On the other hand, if matrix\_scan is used, the PWM is used to score every possible sub-sequence in the genome. The regions corresponding to the sequences achieving a score at least as good as the threshold score are then returned under BED format.
	Figure and legend taken and adapted from \citep{ambrosini_pwmscan:_2018}.}
	\label{lab_projects_pwmscan_pipeline}
	\end{center}
	\end{figure}

	Knowing where transcription factors (TFs) bind to the genome is the key to understanding gene regulation. The binding specificity of a TF is commonly represented by a numerical matrix, either as a position weight matrix (PWM), a position frequency matrix (PFM) or a letter probability matrix (LPM). The three representations are information-wise equivalent and inter-convertible. A PWM contains weights for each base at each motif position. By summing up weights at corresponding positions, a binding score can be computed for any base sequence of the same length as the PWM. Large collections of TF specificity matrices are nowadays available frpublishedom public libraries such as JASPAR \citep{khan_jaspar_2018} or HOCOMOCO \citep{kulakovskiy_hocomoco:_2018}.

	PWMScan is a web-server for rapid scanning of large genomes for high-scoring matches to a user-supplied or server-resident PWM. Compared to other web-based PWM scanning tools, PWMScan is unique in that it scans server-resident whole genomes rather than user-uploaded DNA sequences. Other key features are: (i) menu-driven access to genomes of >30 model organisms; (ii) menu-driven access to >300 public PWM libraries; (iii) support of various PWM representations and formats; (iv) cut-off values can be specified as match scores or P-values; (v) output in BEDdetail format with match scores and P-values; (vi) links to UCSC genome browser for visualization of results; and (vii) action buttons to transfer match lists to analysis tools.

	A short description of the PWMScan server follows. Technical details about algorithms, programs and data are provided under Supplementary Material in the original article \citep{ambrosini_pwmscan:_2018}.

	\subsection{Data and methods}

	Genome sequences were downloaded from the NCBI in FASTA format. Indexed versions for rapid scanning were generated for Bowtie \citep{langmead_ultrafast_2009}. The motif databases offered by PWMScan have been downloaded from the MEME Suite website \citep{bailey_meme_2009}. LPMs have been converted to integer PWMs (see Supplementary Material in \citep{ambrosini_pwmscan:_2018}).

	% The input form of the PWMScan server is shown in Figure \ref{lab_projects_pwmscan_gui}. The user chooses a genome assembly from a menu. Optionally, a BED file may be uploaded to restrict the search to genomic regions of particular interest, e.g. open chromatin regions. The right side of the form offers several ways to specify a DNA motif. PWMs from a server-resident database are chosen from a pull-down menu. Alternatively, matrices can be entered into a text area or uploaded. Accepted motif types are: PFMs, LPMs, real or integer PWMs and IUPAC consensus sequences. PFMs can be entered in several formats, including TRANSFAC and JASPAR.

	All motif types have to be converted into integer PWMs for input to the genome search engines (see Supplementary Material in \citep{ambrosini_pwmscan:_2018}). Default conversion parameters are proposed and can be changed by the user.

	For instance, real PWMs can be rescaled on input by a multiplication factor to ensure sufficient resolution after integer conversion. IUPAC consensus sequences are converted into binary matrices consisting of 0 and 1.

	For all matrix formats, the cut-off value can be specified as PWM score, as P-value or as percentage of the score range (0\% = minimal score and 100\% = maximal score). For IUPAC consensus sequences, the cut-off value is specified as a maximal number of mismatches allowed.

	The P-value of a PWM score x is defined as the probability that a random k-mer sequence of the length of the PWM has a binding score $\geq$ x given the base composition of the genome.

	The whole genome scan takes as input an integer PWM and a corresponding cut-off. The output is a list of sequence regions that match the PWM with a match score higher or equal to the cut-off value. Depending on the length of the PWM and the cut-off, one of the following search strategies is chosen: (i) Bowtie, a fast memory-efficient short read aligner using indexed genomes and (ii) matrix\_scan, a C program developed by our group using a conventional search algorithm, as shown in Figure \ref{lab_projects_pwmscan_pipeline}.

	The first strategy is more efficient for short PWMs and high cut-off values. It requires as a first step the generation of a list of all k-mers that match the PWM with the given cut-off. The list of k-mers is then mapped to the genome using Bowtie.

	The second strategy takes genome sequences in FASTA format as input. Individual chromosomes are processed in parallel and distributed to multiple cores by a Python script. We empirically found that this approach becomes more efficient if the number of k-mers exceeds 10 5 sequences. matrix\_scan was benchmarked for speed together with five other matrix scanners and was found to be the fastest (see Supplementary Material in \citep{ambrosini_pwmscan:_2018}).

	The basic search step outputs a list of PWM matches, including the genomic coordinates, the DNA sequence and the match score. Post-processing of this list involves computation of the corresponding P-values, addition of the matrix name and, optionally, elimination of overlapping matches. The final match list is provided in BEDdetail format. The output page further shows the total number of PWM matches and a sequence logo reflecting the letter-probabilities of the input matrix. Action buttons are provided for: (i) sending the match lists to analysis tools of the ChIP-Seq and SSA servers \citep{ambrosini_chip-seq_2016}, (ii) extracting DNA sequences around the matches, (iii) sending the output to the UCSC genome browser for visualization and (iv) liftover of the match coordinates to other assemblies of the same or related species.

	PWMScan is meant to support many types of genomic data analysis and designed to be interoperable with other tools from our group and elsewhere. An example of a typical workflow involving ChIP-seq data is presented in Supplementary Material in \citep{ambrosini_pwmscan:_2018}.

	PWMScan is also available as a command-line software package from SourceForge, including a master script scheduling all computational steps running during a web job.

	\subsection{Benchmark}

	% supplemental figure 3 in article
	\begin{figure}
	\begin{center}
	\includegraphics[scale=0.2]{images/ch_group_projects/pwmscan_figure_s1.png}
	\captionof{figure}{\textbf{Benchmark :} PWMScan speed performances were measured and compared with 6 other well known genome scanners. In all cases, the h19 genome sequence was scanned with a 19bp CTCF matrix and a 11bp STAT1 matrix, 10 times. The run times are represented as boxplots. For PWMScan, both pwm\_scan and Bowtie strategies were run.
	Figure and legend taken and adapted from \citep{ambrosini_pwmscan:_2018}.}
	\label{lab_projects_pwmscan_benchmark}
	\end{center}
	\end{figure}

	\begin{table}
	\begin{center}
	\begin{tabular}{\| c \| c \| c \| c \| c \| c \| c \|}
	\hline
	Program & \multicolumn{3}{\|c\|}{CTCF} & \multicolumn{3}{\|c\|}{STAT1} \\
	\hline
	& \thead{Nb of\\matches} & \thead{Perc.\\overlap} & \thead{Score corr.} & \thead{Nb of\\matches} & \thead{Perc.\\overlap} & \thead{Score\\corr}. \\
	\hline
	matrix\_scan & 70607 & 100 & 1 & 138531 & 100 & 1 \\
	\hline
	Patser & 70607 & 100 & 1 & 138531 & 100 & 1 \\
	\hline
	PossumSearch & 70607 & 100 & 1 & 138531 & 100 & 1 \\
	\hline
	FIMO & 70131 & 99.16 & 0.9989 & 134901 & 100 & 1 \\
	\hline
	RSAT & 70701 & 99.85 & 0.9999 & 134901 & 100 & 1 \\
	\hline
	HOMER & 70701 & 99.95 & 0.9999 & 134901 & 100 & 1 \\
	\hline
	STORM & 70704 & 99.83 & 0.9999 & 134901 & 100 & 0.9999 \\
	\hline
	\end{tabular}
	\caption{\textbf{Motif scanning software comparison}. The performances of matrix\_scan were assessed by comparing how many of the regions listed by matrix\_scan were also returned by other programs and if the region scores were comparable. For the percentage of overlap with the match list returned by matrix\_scan, the shorter of the two lists always serves as the reference (100\%). For the score correlations with matrix\_scan scores, the Spearman correlation was used.
	Table and legend taken and adapted from \citep{ambrosini_pwmscan:_2018}.}
	\label{lab_projects_pwmscan_benchmark_table}
	\end{center}
	\end{table}

	% article Table 1
	% \begin{center}
	% \captionof{table}{Speed is expressed in seconds. The benchmarking tests have been run on a Linux/CentOS7/x86\_64 workstation with 48 CPU-cores and 256 GB of DRAM.
	% The performance measurements for matrix\_scan were achieved using 1 (left value) and 10 (right value) CPU-cores in parallel.}
	% \begin{tabular}{ c c c }
	% PWM/P-value & Bowtie speed & Matrix\_scan speed \\
	% \hline
	% STAT1(len = 11 bp)/$10^{-5}$ & 3 & 30/5 \\
	% STAT1(len = 11 bp)/$10^{-4}$ & 8 & 40/8 \\
	% STAT1(len = 11 bp)/$10^{-3}$ & 60 & 65/30 \\
	% CTCF(len = 19 bp)/$10^{-5}$ & 12 & 40/6 \\
	% CTCF(len = 19 bp)/$10^{-4}$ & 90 & 50/10 \\
	% CTCF(len = 19 bp)/$10^{-3}$ & 720 & 90/35 \\
	% \label{lab_projects_pwmscan_table}
	% \end{tabular}
	% \end{center}

	The problem of pattern-matching using a matrix is considered an important problem and many algorithms and program have been developed. The challenge comes from the fact that this problem should be solved in a time and memory efficient manner but also because it poses a more fundamental problem : most of the prediction turn to be false positives.

	The most straight forward approach to this problem is the sliding window approach. This way of doing is implemented by matrix\_scan, Patser \citep{hertz_identification_1990}, PossuMSearch \citep{beckstette_fast_2006}, RSAT matrix-scan (later on simply referred to as RSAT, \citep{turatsinze_using_2008}), HOMER \citep{heinz_simple_2010} or FIMO \citep{grant_fimo:_2011}. matrix\_scan implements a permutated lookahead (LA) search scheme \citep{pizzi_fast_2008}, allowing to eliminate sub-sequences that do not match the matrix faster than a brute force scoring method. PoSSuMSearch implements the use of LA together with the use of enhanced suffix arrays (ESA). This is a two step method. First the genome is indexed, to stored it using ESA and dumped to file. Then, the ESA are searched for matches using a matrix, allowing a sub-linear search time. However, PoSSuMsearch also allows a simple window sliding scoring using LA.

	These programs also generally compute a p-value associated with each score $s$. The p-value is the probability that a random sequence get at least a score as good as $s$. FIMO also perform an expensive False Discovery Rate (FDR) estimate in order to correct the pvalues for multiple testing. On its side, STORM \citep{schones_statistical_2007} uses a database of regulatory sequences to count the expected number of occurrences of kmers to compute a p-value. STORM creates a so called ($g$,$k$) table in order to store the number of occurrences of all possible kmers of size $k$ with a maximum gap of size $g$ present in a sequence database.

	Because both statistical and run-time efficiency aspects of algorithms have been subject of intense work, we decided to assessed the performances of PWMScan both in term of speed than in terms of scoring similarity with other well established methods.

	Also, for completness, it is worth mentioning that other programs have been released, such as MotifScanner \citep{aerts_toucan:_2003}, MotifViz \citep{fu_motifviz:_2004} or TRED \citep{zhao_tred:_2005}. However they could not be included in the benchmark for diverse reasons. MotifScanner could only be used through a GUI client connected to a (offline?) server, making it unsable for batch analyses. MotifViz relied on deprecated 32bits libraries and TRED input sequence size was limited to 10kb.

	The runtimes of PWMScan and of other competitor programs were measured by scanning the human genome (UCSC assembly hg19) with two different PWMs from JASPAR, STAT1 with length 11 bp and CTCF with length 19 bp, and different cut-off values expressed as P-values. Results are shown in Figure \ref{lab_projects_pwmscan_benchmark} and and Table \ref{lab_projects_pwmscan_benchmark_table}. PoSSuMSearch was used with its sliding window strategy because to use of ESA, on an already indexed genome, turned out to be slower (not shown). I hypothesize that it is due to intense I/O operations needed to search the ESA stored on the hard-drive rather than in memory. Also, note that for longer motifs and higher P-values, the Bowtie-based approach becomes inefficient, whereas matrix\_scan remains reasonably fast (not shown).

	\subsection{Conclusions}

	We developed PWMScan, a program that searches matches inside a genome using a PWM. We implemented two alternative search strategies : i) a traditional sliding window approach and ii) an innovative matching sub-sequences mapping using Bowtie. Both strategies are competitive in terms of speed and score similarity compared to other existing programs. Furthermore, our window sliding approach turned out to be fastest of all.

	Additionally, PWMScan is available as a standalone distribution as well as through a dedicated web-server with access to dozens of species genomes and thousand of PWMs from the most popular databases.

	To conclude, PWMScan is an asset for the research community because of its simplicity of use and of its high performances.

	\clearpage

	\section{SPar-K}

	This section describes SPar-K (Signal Partitioning with K-means), a modification of the K-means algorithm to cluster genomic regions based on their chromatin organization, defined by by their sequencing profiles.

	My contribution to this work is major. I developed the algorithm, implemented it and run all the benchmark and other analyses.

	The content of this section is taken an adapted from the original article \citep{groux_spar-k:_2019}. I produced all the figures presented in this section.

	\subsection{Introduction}
	The sequence-specific binding of transcription factors (TFs) to target sites and the associated changes in chromatin state are considered key events in gene regulatory processes that turn on or off genes. The advent of powerful assays to characterize chromatin features genome-wide (e.g. ChIP-seq, DNase-seq, MNase-seq) and the large-scale application of these technologies to many cell types by consortia such as ENCODE \citep{consortium_integrated_2012} have created a wealth of data. To gain knowledge from such data, it is common to analyze positional correlations between chromatin features, e.g. the position of nucleosomes (revealed by MNase-seq) relative to transcription factor binding regions (mapped by ChIP-seq). However, as noted in \citep{kundaje_ubiquitous_2012}, chromatin patterns tend to be heterogeneous and often asymmetric. The latter is a problem if the orientation of a given functional region is not defined by experiments. Furthermore, limited mapping precision leading to moderate misalignment between functionally equivalent regions can also obscure a chromatin pattern.\\
	Several methods and software tools have been developed for discovering chromatin pattern by clustering and/or realignment of signal profiles for genomic regions, including ChromaSig, CATCHProfiles, CAGT and ChIP-Partitioning, see \citep{nair_probabilistic_2014} and references therein. All of these programs have some limitations. Some do not realign, others are restricted to count data or lack an efficient implementation. To fill this gap, we developed SPar-K (Signal Partitioning with K-means).

	% figure 1 from article
	\begin{figure}
	\centerline{\includegraphics[scale=0.4]{images/ch_group_projects/spark_figure1.pdf}}
	\caption{Partitioning of DNaseI hypersensitivity profiles around SP1 binding sites in K562 cells. The optimal number of clusters was determined using the elbow method (Figure \ref{fig_s07}). \textbf{A.} Input data based on peak summits provided by ENCODE. \textbf{B.} Same regions clustered, re-aligned and oriented by SPar-K. Clusters 1, 2 and 3 are indicated by colored bars in red, blue, and green, respectively. \textbf{C.} MNase-seq read densities for the same regions, ordered, aligned and oriented as in B. \textbf{D.} Predicted SP1 binding motifs for the same regions, ordered, aligned and oriented as in B. \textbf{E.} Proportion of binding sites within each cluster having a confirmed promoter-associated TSS within +/- 300bp. \textbf{F.} Aggregations profiles for DNase-seq (red), MNase-seq (blue), promoter TSS (green) and CAGE-seq data (violet) for cluster 2 (aligned and oriented as in B). \textbf{G.} Motifs found by MEME-ChIP and Tomtom in the narrow footprints of each cluster. (*) known SP1 interactor, (c) central enrichment. Cluster 2 left and right refer to the left and right footprints seen in \textbf{B}.
	Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
	\label{lab_projects_spark_dnase}
	\end{figure}


	\subsection{Methods}
	SPar-K is a modified K-means algorithm which iteratively partitions, aligns and orients a set of genomic regions based on their signal density profiles obtained by a chromatin profiling assay (or in some other way). SPar-K is implemented as a stand-alone, fully multithreaded, C++ program. A brief description of its overall design follows. Details can be found in the Supplementary Material where all the algorithms are described.\\
	Input data are the signal over $N$ regions of length $L$ stored as a $N$ rows and $L$ columns matrix. The signal resolution may be at single-base or at a larger bin size. The regions are typically defined by relative positions to an anchor point, e.g. a ChIP-seq peak summit.\\
	The user sets via command-line options the number of clusters to find $K$, the shifting freedom $S$, whether flipping is allowed, the maximum number of iterations, whether outlier smoothing should be performed, the seeding strategy, and optionally a seed for the random number generator. The shifting freedom is specified as the maximal number of positions two regions may be shifted relative to each other ($S$), or pattern width ($W$=$L$-$S-1$). The partitioning is initialized by choosing $K$ rows from the input matrix according to the specified seeding strategy. These regions define the initial class centers (signal profiles). During iterative optimization, each region is compared to the current class centers in all possible alignments (shifting states) and orientations and assigned to the closest one in terms of correlation distance. The class centers are then updated by aggregation of the optimally aligned regions of its members. The process stops after a convergence criterion or the maximal number of iterations is reached.\\
	The program returns a table listing for each region the cluster assignment, the shift state and the orientation. The software distribution includes R scripts for visualizing the data as heatmaps as shown in Figure \ref{lab_projects_spark_dnase}.

	% supplemental figure 2 from article
	\begin{figure}
	\centerline{\includegraphics[scale=0.4]{images/ch_group_projects/spark_supplemental_figure2.pdf}}
	\caption{\textbf{Clustering accuracy using random seeding :} to compare the clustering accuracies of the different methods, several simulated dataset containing 3 classes, different coverages (10, 50 and 100 reads per region indicated as "cov10", "cov50" and "cov100") and noise proportions (no noise, 10\% noise, 50\% noise and 90\% noise indicated as "0.0", "0.1", "0.5" and "0.9") were generated. Each dataset was clustered 50 times with each method. The Adjusted Rand Index (ARI) was computed for each partition. The ARI values are displayed as boxplots. SPar-K and ChIPPartitioning were run allowing flipping and shifting. The ARI was measured on each of the resulting data partitions. For SPar-K, "smooth" indicates outlier smoothing. For the regular K-means, "eucl." and "corr." refer to the euclidean and correlation distances. "R" stands for "random" and indicates the ARI values obtained when comparing the true cluster labels with a randomly shuffled version of it, 100 times.
	Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
	\label{lab_projects_spark_ari}
	\end{figure}

	% supplemental figure 4 from article
	\begin{figure}
	\centerline{\includegraphics[scale=0.4]{images/ch_group_projects/spark_supplemental_figure4.pdf}}
	\caption{\textbf{Median SSE :} for the simulated ChIP-seq dataset containing 3 classes, with coverage 100 and no noise, partitioned into 2 to 5 clusters. To judge whether the elbow method could be used to estimate the optimal number of clusters, this dataset was partitioned with SPar-K, allowing flip and shifting, into 2 to 5 clusters, 50 times for each set of parameters. For each number of clusters, the median SSE is shown, +/- 1 standard deviation (bars). \textbf{A} Seeding done at random, \textbf{B} seeding done at random and outlier smoothing \textbf{C} seeding done with the K-means++ method \textbf{D} seeding done with the K-means++ method and outlier smoothing. In all cases, the optimal number of clusters seemed to be 3 (which was the expected value).
	Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
	\label{lab_projects_spark_sse}
	\end{figure}

	% supplemental figure 4 from article
	\begin{figure}
	\centerline{\includegraphics[scale=0.4]{images/ch_group_projects/spark_supplemental_figure5.pdf}}
	\caption{\textbf{Running times :} to compare the run times of each program, the synthetic dataset with coverage 100 and no noise was partitioned 20 times with each program. The run times (wall clock) in second were measured. For all SPar-K and the regular K-means, the partitions were initialized using a random and K-means++ (indicated as "k++"). For ChIPPartitioning, only a random seeding was used. The partitions were then optimized for 30 iterations at most. For SPar-K and ChIPPartitioning, a shifting of 71 bins and flipping were allowed. For SPar-K, only one thread was used and "smooth" indicates outlier smoothing. For the regular K-means, "eucl." and "corr." refer to the euclidean and correlation distances.
	Figure and legend taken and adapted from \citep{groux_spar-k:_2019}.}
	\label{lab_projects_spark_time}
	\end{figure}


	\subsection{Results}
	First we compared SPar-K, regular K-means and ChIP-partitioning on synthetic datasets (see Supplemental Figure 1 in \citep{groux_spar-k:_2019}) exhibiting properties that are plausible for ChIP-seq profiles for genomic regions: count vectors, 3 classes with different shapes, limited misalignment, random orientation and various amounts of background noise. Performance was assessed by the Adjusted Rand index (see Figure \ref{lab_projects_spark_ari} and Supplemental Figure 3 in \citep{groux_spar-k:_2019}) and the optimal number of classes was estimated by the elbow method (Figure \ref{lab_projects_spark_sse}). As expected, regular K-means performed poorly. On the contrary, SPar-K was equally accurate as ChIP-Partitioning except for the lowest coverage class. Considering speed, Spar-K outperformed ChIP-partitioning by a factor of at least 20 (Figure \ref{lab_projects_spark_time}).\\
	We applied SPar-K with $K=3$ to DNaseI accessibility profiles (2bp resolution) around 7'206 ChIP-seq SP1-binding peaks (+/-300bp relative to peak summit) in K562 cells (Figure \ref{lab_projects_spark_dnase}A). The results revealed the presence of clear footprints in all the clusters (Figure \ref{lab_projects_spark_dnase}B). To validate these footprints, we checked whether they are consistent with to location of nucleosomes (Figure \ref{lab_projects_spark_dnase}C) and SP1 binding motifs (Figure \ref{lab_projects_spark_dnase}D), which was indeed the case. De novo motif analysis of the narrow footprints seen in Figure \ref{lab_projects_spark_dnase}B with MEME-ChIP and Tomtom \citep{bailey_meme_2009} discovered SP1-related, NFYA/B and GATA motifs (Figure \ref{lab_projects_spark_dnase}G) the latter two reportedly being interaction partners of SP1. Taken together, these results suggest that SPar-K is able to precisely refocus initially misaligned DNaseI profiles around SP1 binding sites.\\
	The partitioning of SP1 binding regions reveals distinct chromatin landscapes. Cluster 1 (red) groups binding sites lying between two closely spaced nucleosomes. Cluster 2 (blue) shows strong asymmetry suggestive of promoter regions, an interpretation supported by the presence of promoter-associated transcription starts sites (TSSs) and CAGE tags (Figure \ref{lab_projects_spark_dnase}E and F). Finally, the symmetrical cluster 3 (green) contains binding sites located on large nucleosome-free regions reminiscent of enhancer regions.\\
	As a second example, we ran the same type of analysis on nucleosome profiles around CTCF binding sites (see Supplemental Figure 8 in \cite{groux_spar-k:_2019}). Overall, the results confirm observations published by \citep{kundaje_ubiquitous_2012}. Strong nucleosome arrays became visible in all classes after realignment, with three out of four showing strong asymmetry in addition.


	\subsection{Conclusion}
	SPar-K is a useful partitioning method for moderately misaligned and randomly oriented chromatin regions. Compared to existing methods, it is competitive in terms of accuracy, superior in speed, applicable to a wider range of input signals (not restricted to count data) and easier to use.

ch_group_projects.texNo OneTemporaryActions

File Metadata

ch_group_projects.texView Options

Event Timeline

ch_group_projects.tex
No OneTemporary
Actions

ch_group_projects.tex
View Options