The Computational Cancer Genomics (CCG) laboratory developed and maintains in-house two important databases.
The Mass Genome Annotation (MGA) repository is the first database. It contains a collection of publicly available NGS datasets. Because publishing nowadays in a peer-reviewed journal requires to release the data, the authors have to deposit them to primary repository, such as GEO \citep{barrett_ncbi_2013} or ArrayExpress \citep{athar_arrayexpress_2019}. As a matter of fact, tremendous amounts of data are released. As part of an effort to enhance reproducibility and re-usability of these data, the CCG laboratory developed and maintains the MGA repository. The MGA repository has been designed to store publicly available NGS data, in a highly standardized manner with a high quality data annotation (metadata).
The second database is the Eukaryotic Promoter Database (EPD). EPD is an old resource containing a catalog of curated eukaryotic RNAPII TSSs. EPD was initiated by Bucher and Trifonov from the manual curration of experimental data \citep{bucher_compilation_1986}. Since its beginning, EPD was designed as a sequence annotation that indicates, for a given sequence, which positions are used to initiate transcription. With the advent of high throughput transcription profiling assays - such as CAGE and GRO-seq - TSS identification and mapping has drastically changed. In consequence, EPDnew - a new dedicated branch of EPD - was created several years ago \citep{dreos_epd_2013} to make full use of these new data.
\section{Mass Genome Annotation repository}
\label{section_mga}
This section describes the organization and the content of the MGA repository. The MGA has been described in \citep{dreos_mga_2018}. This work was mostly undertaken by René Dreos, a postdoctoral fellow of the CCG laboratory. My involvement in this project was related to the processing, curration and annotation of some datasets, such as the zinc finger ChIP-seq dataset released by Imbeault and colleagues \citep{imbeault_krab_2017}.
The content of this section has been taken and adapted, with the author permissions, from \citep{dreos_mga_2018}.
\captionof{figure}{\textbf{Content of the MGA repository by 2018} \textbf{A} Proportion of samples in the database grouped by type. \textbf{B} Proportion of samples grouped by organism. Assemblies belonging to the same organism are merged together. \textbf{C} Samples numbers stratified by type and organism. Dot areas are proportional to the total number of samples in that category. The corresponding numbers can be found in a weakly updated table posted on the MGA home page at \url{http://ccg.vital-it.ch/mga}.
Figure and legend taken and adapted from \citep{dreos_mga_2018}.}
\label{lab_resources_mga_stats}
\end{center}
\end{figure}
Currently, the MGA contains more than 24'000 samples in 15 different species : human, mouse, rat, macaque, dog, chicken, zebrafish, worm, fruit fly, bee, Arabidopsis, corn, Plasmodium, baker's yeast and fission yeast. In all species, except in human, mouse, fruit fly and worm the data are mapped to a single genome assembly, called primary assembly. Among the hosted samples, landmark datasets such as the ENCODE \citep{consortium_integrated_2012}, RoadMap \citep{roadmap_epigenomics_consortium_integrative_2015} or Fantom5 \citep{lizio_gateways_2015} datasets are present. Each sample in the MGA belongs to one of the 13 mandatory data categories :
\begin{enumerate}
\item ChIP-seq : raw data (reads mapping coordinates) from classical ChIP-seq experiments targeting transcription factors, protein-DNA intraction, histone variants and modifications, etc.
\item ChIP-seq-invitro: raw data (reads mapping coordinates) from in-vitro ChIP-seq experiments such ad DAP-seq.
\item ChIP-seq-peak: peak regions provided by the authors of the data.
\item Transcript Profiling: raw data from experiments aimed at profiling transcripts initiation such as CAGE, GRO-cap, GRO-seq, PEAT, etc.
\item DNase FAIRE etc.: raw data from chromatin and chromatin accessibility studies such as MNase-seq, DNase-seq, DNase-hypersensitivity, etc.
\item DNA methylation: raw data from methylation studies.
\item Sequence derived: PWM matches, Natural Variants, Conservation scores from PhastCons \citep{siepel_evolutionarily_2005} and PhyloP \citep{pollard_detection_2010}, etc.
\end{enumerate}
All the data available on the MGA are stored in Simple Genome Annotation (SGA, \cite{ambrosini_chip-seq_2016}). SGA is a single coordinate format. In essence, all data are represented as a single coordinate along the genome. ChIP-seq peaks and paired-end sequenced fragments are represented by their middle position, single-end sequenced reads by their 5' end, TSSs are single base coordinates anyway, and so one. Additionally, this minimizes the disk space required to store these data. However, in any case, SGA formatted data can easily be converted to BED or to GFF using dedicated conversion tools \citep{ambrosini_chip-seq_2016}.
In order to enhance original results reproducibility as much as possible, read alignment files in bed or bam format, if available on the primary repositories, are always preferred. Otherwise, the raw sequencing data are downloaded and processed using a general pipeline comprising i) read mapping using Bowtie \citep{langmead_ultrafast_2009} or Bowtie 2 \citep{langmead_fast_2012}, ii) conversion to BED using the SAMTools \citep{li_sequence_2009} and BEDTools \citep{quinlan_bedtools:_2010} suits and a final conversion to SGA using ChIP-seq server conversion tools \cite{ambrosini_chip-seq_2016}. As in GEO, data (called samples) for a given study/article belong to a same serie. Finally, the metadata are created. A full description of the data, their biological significance, their processing is available in HTML format. Two additional machine readable text files are available with i) the sample information ii) the serie information.
Importantly, it should be highlighted that the MGA is fully interconnected with in-house developed analysis tools hosted on the ChIP-seq \citep{ambrosini_chip-seq_2016} and Signal Analysis Search (SSA, \cite{ambrosini_signal_2003}) servers. These servers contains tool to perform peak-calling, correlation analyses, sequence analysis, format conversions and much more. All the data hosted can thus be readily analyzed using any of these in-house developed tools.
\subsection{Conclusions}
The MGA repository is an important asset to the scientific community. It allows anybody to undertake quickly a wide range of data analyses, together with the ChIP-seq and SSA servers. Additionally, all these data are readily available and can be downloaded in MGA or BED format. Furthermore, on demand visualization tracks can be created for all the datasets hosted on the MGA. These tracks can then easily be uploaded to UCSC genome browser. Finally the MGA is so convenient that I used MGA hosted data in the projects described in the chapters \ref{encode_peaks}, \ref{smile_seq}, \ref{pwmscan} and \ref{spark}.
\caption{\textbf{Schematic representation of the EPDnew pipeline} \textbf{A} Download of authoritative gene catalogs and primary TSS mapping data from public databases, data repositories and consortium websites. \textbf{B} Quality control (QC) of incoming data (e.g. read mapping efficiency, contaminations, etc.). \textbf{C} Data passing QC are reformatted and incorporated into the MGA repository. \textbf{D} Selection of a subset of TSS mapping experiments for generating a new organism-specific TSS collection. \textbf{E} Input data for a new module of EPDnew. \textbf{F} Organism-specific automatic database assembly pipeline tailored to the input data, see \citep{dreos_epd_2013} for a detailed description of the human EPDnew assembly pipeline. \textbf{G} Preliminary or final TSS collection \textbf{H} Manual sanity checks of individual randomly selected promoter entries using the corresponding entry viewer. \textbf{I} Automatic quality evaluation of the TSS collections as a whole by motif enrichment tests, see Figure \ref{lab_resources_epd_motifs} for an example. \textbf{L} Feedback is collected from quality evaluation steps H and I. This may lead to the exclusion, replacement or addition of source data sets or modifications (e.g. program parameter fine-tuning) of the computational database generation pipeline. Note that the development of a final, publicly released EPDnew module typically involves several evaluation-modification cycles. Figure and legend taken and adapted from \citep{dreos_eukaryotic_2017}.}
\label{lab_resources_epd_pipeline}
\end{center}
\end{figure}
This section recapitulates some of the results published in \citep{dreos_eukaryotic_2017} and in \citep{meylan_epd_2020}. Most of the work presented in this section should be credited to René Dreos and Patrick Meylan, two former post-doctoral fellows of the CCG laboratory.
In essence, each EPDnew release is created using a semi-automated computational pipeline that identifies genomic regions showin high mRNA initiations at the beginning of annotated genes. The input data are subjected to a severe quality control checks before entering the EPDnew pipeline. Additionally, the results are manually verified throughout the entire process, ensuring high curration standards. The entire pipeline is depicted in Figure \ref{lab_resources_epd_pipeline}.
EPDnew database is dedicated to provide an accurate TSS mapping to the research community. EPDnew was firstly focused on the annotation of animal genomes \cite{dreos_eukaryotic_2015}. However, with the increasing availability and origins of relevant datasets, the database could be extended to least common - often neglected - species. EPDnew currently includes plant and fungi species. Even if not fully in line with the rest of this work, these species are fully part of EPDnew and should be presented as such.
EPDnew contains computational annotations of genome assemblies from publicly available high throughput 5' mRNA sequencing data. The following sections contain a description of the current state of EPD and of the recent novelties introduced.
The content of this section was taken and adapted, with the author permissions, from \cite{dreos_eukaryotic_2017} and \cite{meylan_epd_2020}.
\subsection{EPDnew now annotates (some of) your mushrooms and vegetables}
\caption{\textbf{Current contents of EPDnew} 'Promoters' indicate the number of TSS entries in EPDnew. 'Genes' indicates the number of genes having at least one TSS annotated in EPDnew. 'Genes' indicates the number of protein coding genes contained in the genome annotation (except for nc species). 'nc' stands for non-coding and indicates the long non-coding gene annotations. For 'nc' entries, 'genes' refers to the number of long non-coding genes present in the annotation. In parenthesis are indicated the percentages of genes having a at least one TSS annotated in EPDnew.}
\label{lab_resources_epd_stats}
\end{center}
\end{table}
With years, EPDnew has substantially grown, from a promoter collection annotating protein coding genes in five animal model organisms (human, mouse, fruit fly, zebrafish and worm, \cite{dreos_eukaryotic_2015}), to 10 (human, mouse, fruit fly, zebrafish, worm, bee, arabidopsis, maize, brewer's yeast and fission yeast, \cite{dreos_eukaryotic_2017}) and now 16 organisms annotating protein coding and non coding genes (Table \ref{lab_resources_epd_stats}). The number of genes containing at least one annotated TSS in EPDnew is variable among species. However \textit{H. sapiens}, \textit{D. melanogaster} and \textit{S. pombe} are approaching a complete gene coverage of protein coding genes with 96\%, 98\% and 94\%.
\caption{\textbf{TSS Mapping precision} Occurrence of the TATA-box \textbf{A} and initiator \textbf{B} around \textit{H.sapiens} TSSs from EPDnew releases (004 and 006) and from a list of gene starts from UCSC Gene list, which was used as input for the generation of the EPDnew collection. This figure was created using Oprof from the SSA server \citep{ambrosini_signal_2003}. Detailed instructions to recreate the figure can be found in section \ref{lab_resources_epd_methods_oprof}.}
\label{lab_resources_epd_motifs}
\end{center}
\end{figure}
The human annotation has been generated with >1300 experiments, containing dozens of billions of reads. It is currently and by far the largest data collection among EPDnew. Importantly, even if the number of TSSs reported increased compared to what has been reported in \citep{dreos_eukaryotic_2017} (25 503 using 1088 datasets vs 29 598 using 1311 datasets) the gene coverage is reaching saturation (95\% vs 96\%). Thus most of the newly discovered TSSs are alternative TSSs. Nonetheless, the overall TSS mapping precision is still increasing, as shown in Figure \ref{lab_resources_epd_motifs}. This is illustrated using the positional distributions of the TATA-box and the initiator motifs which are both core promoter elements that are expected to appear at a fixed distance from the TSS. The increased frequencies seen in EPDnew release 006 compared the other TSS annotations indicate a better alignment of the TSSs.
\subsection{Integration of EPDnew with other resources}
EPDnew is also hosted in the MGA repository (see section \ref{section_mga} and \citep{dreos_mga_2018}). As such, EPDnew TSSs are available together with the ChIP-seq (TF binding, histone marks), the nucleosome occupancy, chromatin accessibility, SNP and sequence conservation data present. As a consequence, EPDnew can easily be integrated into diverse genomic analyses through the tools hosted on the ChIP-seq \citep{ambrosini_chip-seq_2016} and SSA \citep{ambrosini_signal_2003} servers.
Besides, EPD could be explored through a viewer relying on a selection of tracks to be visualized in the UCSC Genome Browser \citep{dreos_epd_2013}. Since then, a major effort to provide a customizable visualization plateform has been undertaken \citep{dreos_eukaryotic_2017}. Currently, each specie is provided with a minimal track hub \citep{raney_track_2014} containing at least 3 tracks : i) the combined TSS mapping samples used to create the EPDnew annotation for this specie, ii) the EPDnew TSSs on the + strand and iii) the EPDnew TSSs on the - strand. Other tracks are often available depending on the specie such as a gene track, a CpG island track and so one. Finally, a tool to create and upload custom MGA derived tracks (ChIP-Track available at \url{https://ccg.epfl.ch/chipseq/chip_track.php}) on the UCSC Genome Browser, in a few mouse clicks, have been developed to fully exploit the possible synergies between the UCSC Genome Browser, EPDnew and the MGA repository.
\subsection{Conclusions}
EPDnew is a valuable resource for the research community. It offers an unprecedented TSS mapping effort in terms of precision and species covered. To enhance and facilitate the integrative analyses, EPD is fully interconnected with the external resources and tools from the ChIP-seq \citep{ambrosini_chip-seq_2016} and SSA \citep{ambrosini_signal_2003} servers maintained by the laboratory. Currently, EPDnew tracks are available on UCSC Genome Browser. Furthermore, EPDnew is totally interconnected with the MGA which allows to easily integrate it in different genomic analyses. Additionally, it is possible to create "on demand" visualization tracks for any dataset hosted on the MGA to complement the tracks already available on UCSC Genome Browser.
Finally, it should be noted that EPDnew genome annotations have been used in the projects described in chapters \ref{encode_peaks} and \ref{spark}.
\subsection{Methods}
\subsubsection{Motif occurrence profiles}
\label{lab_resources_epd_methods_oprof}
The motif occurrence profiles in Figure \ref{lab_resources_epd_motifs} have been generated using Oprof from the SSA server \url{https://ccg.epfl.ch/ssa/oprof.php}. To create the TATA-box profiles, the input data were H. sapiens (Feb 2009 GRCh37/hg19) / Genome Annotation / EPDnew / release 004 or 006 or UCSC, TSS for known Genes / TSS from UCSC known genes. The motifs were choosen from PWMs from libraries / Promoter Motifs and then TATA-box (length=15) or Initiator (length=8). The borders were set to -50 to +50bp, the window size to 20bp for the TATA-box and 10bp for the Initiator and forward search mode was enabled. All other parameters were left to default.