Page MenuHomec4science

ch_pwmscan.tex
No OneTemporary

File Metadata

Created
Sun, Apr 28, 02:51

ch_pwmscan.tex

\cleardoublepage
\chapter{PWMScan}
\label{pwmscan}
\markboth{PWMScan}{PWMScan}
\addcontentsline{chapter}{toc}{PWMScan}
In this chapter, I describe PWMScan, a software that has been developed to predict TF binding sites in a genome, given a binding specificity model. The problem of pattern-matching using a specificity model is considered an important problem and many algorithms and programs have been developed. The challenge comes from the fact that this problem should be solved in a time and memory efficient manner in order to allow large eukaryotic genomes to be scanned with complex patterns.
PWMScan is a web-server for rapid scanning of large genomes for high-scoring matches to a user-supplied or server-resident PWM. Compared to other web-based PWM scanning tools, PWMScan is unique in that it scans server-resident whole genomes rather than user-uploaded DNA sequences. Other key features are: i) menu-driven access to genomes of >30 model organisms, ii) menu-driven access to >300 public PWM libraries, iii) support of various PWM representations and formats, iv) cut-off values can be specified as match scores or P-values, v) output in BEDdetail format with match scores and P-values, vi) links to UCSC genome browser for visualization of results, and vii) action buttons to transfer match lists to analysis tools.
This work has been conducted in close collaboration with Giovana Ambrosini, a senior scientist of the laboratory. Some parts of this chapter have been taken and adapted - with the author permission - from the original article that presents this work \citep{ambrosini_pwmscan:_2018}. For this project, I contributed to the development of the matrix\_scan program, the development of the necessary Python code for an efficient parallelization of matrix\_scan on the server backend and I set up and executed the entire benchmarking procedure. Furthermore, I produced all the figures presented in this chapter.
% article Figure 1
% \begin{figure}[!htbp]
% \begin{center}
% \includegraphics[scale=0.4]{images/ch_lab_resources/pwmscan_fig1.png}
% \captionof{figure}{Screen shot of the PWMScan graphical user interface.}
% \label{lab_resources_gui}
% \end{center}
% \end{figure}
\section{Algorithms}
From an algorithmic perspective, PWMScan is dual. It implements a regular genome scanner and also implements a novel approach that uses short read alignment to find PWM matches.
\subsection{Scanner algorithm}
The program matrix\_scan implements a scanner (or sliding window) searching algorithm. It takes as input a DNA sequence of length $L$, a PWM of dimensions $L'\times4$ where $L' \leq L$ and a threshold score $t$. For each possible offset $i = 1, 2, ..., L-L'+1$, the sub-sequence $S_{i} = S_{i,i+1,...,i+L'-1}$ is given a score $score(S_{i})$ using equation \ref{intro_eq_score_pwm}. If $score(S_{i}) \geq t$ then the sub-sequence $S_{i}$ is predicted to be a binding site. This naive approach is $\theta((L-L'+1)*L')$ in time. Even though this is not bad in terms of asymptotics analysis, many unnecessary computations are performed. Indeed, during the score computations for a sub-sequence $S_{i}$, it is possible to define a point after which the current score has become so bad that it will automatically turn to be $<t$. From that moment, the computations for the current sub-sequence $S_{i}$ can be dropped and the next sub-sequence $S_{i+1}$ can be tested.
This only requires to modify the PWM such that the maximum possible score achieved at each position is 0. To do so, the values on each column should be substracted the maximum for that column. The threshold score $t$ should modified into $t'$ by subtracting the sum of the column maxima. This scheme is called 'lookahead' (LA, \cite{pizzi_fast_2008}). LA can further be improved by redefining the position scoring order. The optimal position scoring order is to start first by the position in the PWM which can achieve the worst score, then by the PWM position achieving the second worst score and so one until the last PWM position that achieves the best possible score. By doing this, the score can drop under $t'$ faster. This scheme is called permutated LA (PLA, \cite{pizzi_fast_2008}). Even though PLA does not modify the algorithm complexity, in average this algorithm is faster than the naive one as it can avoid unnecessary computations.
% bowtie
\subsection{Matches enumeration and mapping}
The second approach used by PWMScan is drastically different. It also takes as input a DNA sequence $S$ of length $L$, a PWM of dimensions $L'\times4$ where $L' \leq L$ and a threshold score $t$. Then, all the k-mer of length $L'$ achieving a score higher than $t$ are enumerated and mapped against $S$ using a short read mapper - in this case Bowtie \citep{langmead_ultrafast_2009}.
Enumerating all possible L-mer using a naive algorithm is $\theta(4^{L'})$ in time, which really quickly becomes cumbersome to handle. However, here again, many unnecessary computations can be avoided.
Let us assume a LA scheme as before. When scoring a sub-sequence $S_{i}$ from left to right, the score monotonically decreases. If at some position $j$ the score falls under the threshold score $t'$ then the sub-sequence $S_{i} = S_{i,i+1,...,i+L'-1}$ is guaranteed to be rejected. By extension, this means that any sub-sequence that starts with a prefix $S_{i,i+1,...,i+j-1}$ are also guaranteed to be rejected. If the k-mers are stored in a trie, then it would be sufficient to ignore the sub-tree of $S_{i,i+1,...,i+j-1}$.
The program mba (for branch and bound) implements thisk-mer enumeration strategy. The k-mers are enumerated by alphabetical order which is the same as exploring the k-mer trie in breadth-first order. If a sub-sequence $S_{i}$ is rejected by dropping the computations at step $j$, the program jumps to the next sub-sequence $S_{i'}$ that does not start with the same prefix and continues the enumeration here. This is equivalent to ignore the entire prefix sub-tree in the trie.
Eventually, the enumerated k-mers are all the possible different matches that can be achieved with the given PWM and threshold score. Their locations in the genome are identified by a read mapping procedure using Bowtie, with $S$ as the reference genome.
\section{PMWScan architecture}
% flowchart present in supplemental material
\begin{figure}
\begin{center}
\includegraphics[scale=0.1]{images/ch_lab_resources/pwmscan_flowchart.png}
\captionof{figure}{\textbf{PWMScan workflow :} the input is composed of a PWM and a score threshold specifying the minimum score for a sequence to achieve to be considered as a match. LPMs or LFMs are also accepted and are converted into PWMs. The score threshold can also be given as a p-value or a percentage of the maximum score, in which case it is converted into a threshold score. Based on the length of the PWM, Bowtie or pwm\_scan can be used to find the matches on the genome. If Bowtie is used, the set of k-mers achieving a better score than the threshold score is computed using branch-and-bound algorithm (mba) and mapped on the genome. On the other hand, if matrix\_scan is used, the PWM is used to score every possible sub-sequence in the genome. The regions corresponding to the sequences achieving a score at least as good as the threshold score are then returned under BED format.
Figure and legend taken and adapted from \citep{ambrosini_pwmscan:_2018}.}
\label{lab_resources_pwmscan_pipeline}
\end{center}
\end{figure}
PWMScan architecture and design is presented in Figure \ref{lab_resources_pwmscan_pipeline}.
% genomes
Several genomes are supported by PWMScan and are offered to the user on the web interface. The different specie genome sequences were downloaded from the NCBI in FASTA format and indexed for rapid scanning using Bowtie \citep{langmead_ultrafast_2009}.
% matrix collections
PWMScan web interface offers several PWM collections to scan the available genomes. Each collection was downloaded from the MEME Suite website \citep{bailey_meme_2009} and all matrices, of any type, were converted to integer PWMs (see Supplementary Material in \citep{ambrosini_pwmscan:_2018}).
% score
For all matrix formats, the cut-off value can be specified as a PWM score, as a percentage of the score range (0\% = minimal score and 100\% = maximal score) or as a p-value. The p-value of a PWM score $X$ is defined as the probability that a random k-mer sequence of the same length as the PWM has a binding score $\geq X$ given the base composition of the genome. In the two latter cases, the corresponding cut-off value is computed.
It should be noted that IUPAC consensus sequences are also accepted instead of a matrix. In that case, they are converted into binary matrices consisting of 0 and 1. However, in this case, the threshold score is specified as the maximal number of mismatches allowed.
Once the appropriate format conversions have been applied to the inputs, the integer PWM and the corresponding cut-off score are given to the search engine. The choice between the scanner approach and the read mapping approach depends on the length of the PWM and the cut-off. As a matter of fact, the read mapping strategy is more efficient for short PWMs and high cut-off values. Indeed, in this case, the k-mers enumeration step is really fast. Empirically, we found that the limit was around $10^{5}$ kmers. Above this limit, the scanner strategy is used. In this case, the individual chromosomes are processed in parallel with each chromosome sequence being distributed to a core by a Python script.
% output
Eventually, the output is a list of sequence regions that match the PWM with a score higher or equal to the cut-off value. Post-processing of this list involves computation of the corresponding p-values, addition of the matrix name and, optionally, elimination of overlapping matches. The final match list is provided in BEDdetail format.
On the web interface, PWMScan is interconnected with i) the ChIP-seq \citep{ambrosini_chip-seq_2016} and ii) the SSA \citep{ambrosini_signal_2003} servers which allows to proceed to downstream analyses. Additionally, different action buttons are present on the result page and allows to i) extract the DNA sequences around the matches, ii) send the match coordinates to UCSC genome browser for visualization and iii) liftover the match coordinates to other assemblies of the same or related species.
PWMScan is also available as a standalone software from SourceForge. This includes a master script scheduling all computational steps running during a web job.
\section{Benchmark}
% supplemental figure 3 in article
\begin{figure}
\begin{center}
\includegraphics[scale=0.2]{images/ch_lab_resources/pwmscan_figure_s1.png}
\captionof{figure}{\textbf{Benchmark :} PWMScan speed performances were measured and compared with 6 other well known genome scanners. In all cases, the h19 genome sequence was scanned with a 19bp CTCF matrix and a 11bp STAT1 matrix, 10 times. The run times are represented as boxplots. For PWMScan, both pwm\_scan and Bowtie strategies were run.
Figure and legend taken and adapted from \citep{ambrosini_pwmscan:_2018}.}
\label{lab_resources_pwmscan_benchmark}
\end{center}
\end{figure}
\begin{table}
\begin{center}
\begin{tabular}{| c | c | c | c | c | c | c |}
\hline
Program & \multicolumn{3}{|c|}{CTCF} & \multicolumn{3}{|c|}{STAT1} \\
\hline
& \thead{Nb of\\matches} & \thead{Perc.\\overlap} & \thead{Score corr.} & \thead{Nb of\\matches} & \thead{Perc.\\overlap} & \thead{Score\\corr}. \\
\hline
matrix\_scan & 70607 & 100 & 1 & 138531 & 100 & 1 \\
\hline
Patser & 70607 & 100 & 1 & 138531 & 100 & 1 \\
\hline
PossumSearch & 70607 & 100 & 1 & 138531 & 100 & 1 \\
\hline
FIMO & 70131 & 99.16 & 0.9989 & 134901 & 100 & 1 \\
\hline
RSAT & 70701 & 99.85 & 0.9999 & 134901 & 100 & 1 \\
\hline
HOMER & 70701 & 99.95 & 0.9999 & 134901 & 100 & 1 \\
\hline
STORM & 70704 & 99.83 & 0.9999 & 134901 & 100 & 0.9999 \\
\hline
\end{tabular}
\caption{\textbf{Motif scanning software comparison}. The performances of matrix\_scan were assessed by comparing how many of the regions returned by matrix\_scan were also returned by other programs and if the region scores were comparable. For the percentage of overlap with the match list returned by matrix\_scan, the shortest of the two lists always served as the reference (100\%). For the score correlations with matrix\_scan scores, the Spearman correlation was used.
Table and legend taken and adapted from \citep{ambrosini_pwmscan:_2018}.}
\label{lab_resources_pwmscan_benchmark_table}
\end{center}
\end{table}
% article Table 1
% \begin{center}
% \captionof{table}{Speed is expressed in seconds. The benchmarking tests have been run on a Linux/CentOS7/x86\_64 workstation with 48 CPU-cores and 256 GB of DRAM.
% The performance measurements for matrix\_scan were achieved using 1 (left value) and 10 (right value) CPU-cores in parallel.}
% \begin{tabular}{ c c c }
% PWM/P-value & Bowtie speed & Matrix\_scan speed \\
% \hline
% STAT1(len = 11 bp)/$10^{-5}$ & 3 & 30/5 \\
% STAT1(len = 11 bp)/$10^{-4}$ & 8 & 40/8 \\
% STAT1(len = 11 bp)/$10^{-3}$ & 60 & 65/30 \\
% CTCF(len = 19 bp)/$10^{-5}$ & 12 & 40/6 \\
% CTCF(len = 19 bp)/$10^{-4}$ & 90 & 50/10 \\
% CTCF(len = 19 bp)/$10^{-3}$ & 720 & 90/35 \\
% \label{lab_resources_pwmscan_table}
% \end{tabular}
% \end{center}
% The problem of pattern-matching using a matrix is considered an important problem and many algorithms and program have been developed. The challenge comes from the fact that this problem should be solved in a time and memory efficient manner but also because it poses a more fundamental problem : most of the prediction turn to be false positives.
Many algorithms have been developed to find matches in genome using matrices. The most straight forward approach to this problem is the scanner approach. This way of doing is implemented by matrix\_scan, Patser \citep{hertz_identification_1990}, PossuMSearch \citep{beckstette_fast_2006}, RSAT matrix-scan (later on simply referred to as RSAT, \citep{turatsinze_using_2008}), HOMER \citep{heinz_simple_2010} or FIMO \citep{grant_fimo:_2011}. PoSSuMSearch implements both the use of LA scheme and the use of enhanced suffix arrays (ESA) to store the indexed genome and search matches in sub-linear time.
These programs also generally compute a p-value associated with each score. FIMO also performs an expensive false discovery rate (FDR) estimate in order to correct the p-values for multiple testing. On its side, STORM \citep{schones_statistical_2007} uses a database of regulatory sequences to count the expected number of occurrences of kmers to compute a p-value. STORM creates a so called ($g$,$k$) table in order to store the number of occurrences of all possible kmers of size $k$ with a maximum gap of size $g$ present in a sequence database. Thus, it can handled gaped matches.
Also, for completness, it is worth mentioning that other programs have been released, such as MotifScanner \citep{aerts_toucan:_2003}, MotifViz \citep{fu_motifviz:_2004} or TRED \citep{zhao_tred:_2005}. However they could not be included in the benchmark for diverse reasons. MotifScanner could only be used through a GUI client connected to a (offline?) server, making it unsable for batch analyses. MotifViz relied on deprecated 32bits libraries and TRED input sequence size was limited to 10kb.
To assess how these programs perform compared to each other, I considered the run-time efficiency aspect but also, as important, the match consistency between programs.
In order to perform meaningful comparisons, each program options were set such that each program returned a number of matches as close as possible to PWMScan. A match was considered common between two programs if the starting positions reported were equal. FIMO, Patser, RSAT and HOMER positions were modified by '-1' to account for a constitutive shift with PWMScan reported positions. The proportion of overlap was then computed by dividing the number of common match by the length of the longest match list from either programs. Then,
a Spearman correlation was computed between the scores of the matches common to both programs to assess the scoring consistency.
The runtimes of PWMScan and of other competitor programs were measured by scanning the human genome (UCSC assembly hg19) with two different PWMs from JASPAR, STAT1 with length 11 bp and CTCF with length 19 bp, and different cut-off values expressed as p-values. Results are shown in Figure \ref{lab_resources_pwmscan_benchmark} and and Table \ref{lab_resources_pwmscan_benchmark_table}. PoSSuMSearch was used with its sliding window strategy. To our surprise, PWMScan was the fastest of the method. On the other hand, STORM was by far the slowest. This may be caused by STORM's ability to handle gaped matches. PWMScan also achieved a high similarity in terms of matches with the other programs both in terms of overlap between matches than in terms of score similarity. The differences observed cannot be explained with certainty. They could be explained, for instance, by differences between integer and double arithmetic or by differences in the handling of corner cases between different programs.
\section{Conclusions}
We developed PWMScan, a program that searches matches inside a genome using a PWM. We implemented two alternative search strategies : i) a traditional sliding window approach and ii) an innovative match mapping using Bowtie. Both strategies are competitive in terms of speed and score similarity compared to other existing programs. Furthermore, our window sliding approach turned out to be fastest of all.
PWMScan is meant to support many types of genomic data analysis and designed to be interoperable with other tools from our group and elsewhere. An example of a typical workflow involving ChIP-seq data is presented in Supplementary Material in \citep{ambrosini_pwmscan:_2018}. Additionally, PWMScan is available as a standalone distribution.
In short, PWMScan is an asset for the research community because of its simplicity of use and of its high performances.

Event Timeline