Page MenuHomec4science

data_processing.tex
No OneTemporary

File Metadata

Created
Sat, Nov 16, 11:58

data_processing.tex

\chapter{Data Extraction and Processing}
The main part of this project is to extract the information contained in the participant lists of UNFCCC meetings.
We explain in this chapter how we extracted and processed the data of the PDF lists.
\section{Data Extraction}
The first step consisted of transforming the available PDF files into text files. It is important to keep some information
about the structure of the text to be able to find the relevant data in the resulting text file.
\subsection{Raw Dataset}
We download the participant lists from the document webpage of the UNFCCC secretariat. \cite{UNFCCC_docs}
The analyzed lists contain all the COP meetings and almost all the SB meetings. Note that during a COP, there is always an SB meeting
held in parallel for which there is no separate participant list. \\
A participant list has the following general structure: Participants are listed under the affiliation they belong to. A member of the
Swiss government is for example listed under the affiliation “Switzerland”. A participant is attributed a salutary address that contains
at least "Mr." or "Ms.", but may also contain some titles as "H.E." (i.e. "Her Excellence"). Some participants, but not all of them, are
attributed a description that explains their role within the delegation. This description could for example be "Minister of Foreign Affairs".
Affiliations are sorted according to their affiliation category and then alphabetically. The possible affiliation categories are:
\begin{itemize}
\item Parties
\item Observer States
\item United Nations Secretariat units and bodies
\item Specialized agencies and related organizations
\item Intergovernmental organizations
\item Non-governmental organizations
\end{itemize}
The category "Media" exists for newer participant lists, but the corresponding participants are not listed. We therefore exclude this category.
\\
The format of the participant lists varies over time. For the first meetings the participant lists are paper scans, what means that
we need to convert images to text. Furthermore, the manner in which affiliations are indicated varies, in the first meetings they are
always written in all uppercase letters, which was changed in later meetings.
\\
We choose the version of the participant lists that is published during the last days of a meeting.
We exclude the corrigenda, documents that are published later for some participant lists and contain corrections of the lists,
because their format varies a lot
and many of the listed corrections are rather small (change of order of participants within an affiliation, change of descriptions).
\subsection{Optical Character Recognition}
To extract the data from the scanned lists, we use Optical Character Recognition (OCR), more precisely Python-tesseract (pytesseract).
\cite{pytesseract} Python-Tesseract is a wrapper for the OCR-engine Tesseract that is developed by Google since 2006, open-source and
available under the Apache 2.0 license. \\
% TODO first check what options I finally use, then describe how tesseract works.
% TODO change title
\subsection{Well-formatted PDF Extraction}
To extract the data from the well-formatted PDF files, we use a PDF processing package called Pdfminer.six.\cite{pdfminer.six}
This python package is community-maintained and open-source.
Again, the main difficulty is to extract the text of the list in correct order. Especially for documents with three columns, this
becomes a difficult task. For this reason, we adapted the use of Pdfminer.six by rewriting one of the classes, the \texttt{PDFPageAggregator}. \\
First, we explain quickly how pdfminer extracts text from PDF files. Pdfminer.six perform a layout analysis on every page before
extracting the text. This analysis is done in three stages:
\begin{itemize}
\item Group characters to words and lines
\item Group lines to boxes
\item Group textboxes hierarchically
\end{itemize}
The output of the layout analysis is visualized in figure \ref{fig:pdfminer}.\\
\begin{figure}[ht]
\caption{Output of the layout analysis of pdfminer.six}
\centering
\includegraphics[width=0.9\textwidth]{pdfminer.png}
\label{fig:pdfminer}
\end{figure}
The class we want to modify, \texttt{PDFPageAggregator}, is responsible for outputing the text lines of a page in the determined order.
To be able to sort the text lines according to our rules later, we modify the function \texttt{receive\_layout} such that it outputs
for each LTTextLine the available x and y positions within the page. In our script that performs the extraction, we then define rules to
determine on which column a text line is situated. \\
A special case for the page layout are affiliation category titles. They break the column system in the middle of a page. We therefore
need to recognize them by their content and introduce special rules for pages that contain affiliation category titles.
\section{Data Processing}

Event Timeline