Page MenuHomec4science

data_processing.tex
No OneTemporary

File Metadata

Created
Sat, May 18, 03:49

data_processing.tex

\chapter{Data Extraction and Processing}
The main part of this project is to extract the information contained in the participant lists of UNFCCC meetings.
We explain in this chapter how we extracted and processed the data of the PDF lists.
\section{Data Extraction}
The first step consisted of transforming the available PDFs into text files.
\subsection{Raw Dataset}
We download the participant lists from the document webpage of the UNFCCC secretariat. % TODO reference
The analyzed lists contain all the COP meetings and almost all the SB meetings. Note that during a COP, there is always an SB meeting
held in parallel for which there is no separate participant list. \\
A participant list has the following general structure: Participants are listed under the affiliation they belong to. A member of the
Swiss government is for example listed under the affiliation “Switzerland”. A participant is attributed a salutary address that contains
at least "Mr." or "Ms.", but may also contain some titles as "H.E." (i.e. "Her Excellence"). Some participants, but not all of them, are
attributed a description that explains their role within the delegation. This description could for example be "Minister of Foreign Affairs".
Affiliations are sorted according to their affiliation category and then alphabetically. The possible affiliation categories are:
\begin{enumerate}
\item Parties
\item Observer States
\item United Nations Secretariat units and bodies
\item Specialized agencies and related organizations
\item Intergovernmental organizations
\item Non-governmental organizations
\end{enumerate}
The category "Media" exists for newer participant lists, but the corresponding participants are not listed. We therefore exclude this category.
\\
The format of the participant lists varies over time. For the first meetings the participant lists are paper scans, what means that
we need to convert images to text. Furthermore, the manner in which affiliations are indicated varies, in the first meetings they are
always written in all uppercase letters, which was changed in later meetings.
\\
% TODO mention corrigenda
\subsection{Optical Character Recognition}
% TODO change title
\subsection{Well-formatted PDF Extraction}
\section{Data Processing}

Event Timeline