We download the participant lists from the document webpage of the UNFCCC secretariat. \cite{UNFCCC_docs}
The lists we process contain all the COP meetings and almost all the SB meetings. Note that during a COP, there is usually an SB meeting
held in parallel for which there is no separate participant list. \\
A participant list has the following general structure: Participants are listed under the affiliation they belong to. A member of the
Swiss government is for example listed under the affiliation “Switzerland”. A participant is attributed a salutary address that contains
at least “Mr.” or “Ms.”, but may also contain some titles as “H.E.” (i.e., “Her Excellence”). Some participants, but not all of them, are
attributed a description that explains their role within the delegation. This description could for example be “Minister of Foreign Affairs”.
Affiliations are sorted according to their affiliation category and then alphabetically. The affiliation categories are:
\begin{itemize}
\item Parties
\item Observer States
\item United Nations Secretariat units and bodies
\item Specialized agencies and related organizations
\item Intergovernmental organizations
\item Non-governmental organizations
\end{itemize}
Furthermore, most of the lists contain an index that states the total number of participants per category.
The category “Media” exists in this index for newer participant lists, but the corresponding participants are not listed. We therefore exclude this category.
\\
The format of the participant lists varies over time. For the first meetings the participant lists are paper scans, which means that
we need to convert images to text. Furthermore, the manner in which affiliations are indicated varies, in the first meetings they are
always written in all uppercase letters, which was changed in later meetings. Figures \ref{fig:raw_scan} and \ref{fig:raw_well} show the first page of the
participant lists of COP 3 and COP 25 respectively, the one for COP 3 being a scan.
\begin{figure}
\centering
\begin{minipage}[ht]{.5\textwidth}
\captionsetup{width=.8\linewidth}
\captionof{figure}{Example page of participant list of COP 3}
We choose the version of the participant lists that is published during the last days of a meeting.
We exclude the corrigenda, documents that are published later for some participant lists and contain corrections of the lists,
because their format varies a lot
and many of the listed corrections are rather small (change of order of participants within an affiliation, change of descriptions).
\subsubsection{Optical Character Recognition}
To extract the data from the scanned lists, we use Optical Character Recognition (OCR), more precisely Python-Tesseract (pytesseract \cite{pytesseract}).
Python-Tesseract is a wrapper for the OCR engine Tesseract developed by Google since 2006. \\
Tesseract works as follows. First, it looks for regions in the image that contain dense elements to find connected components that are then organized as text lines.
This first step determines the format of a page that is extracted.
Then, a two-pass process for recognition is applied.
In the first pass, the program tries to recognize each word. If a word is recognized satisfactory, it is used as training data for
every word that follows. To make use of all the training data, the second pass goes over all unrecognized words for a second time. \cite{tesseract_expl} \\
The version of tesseract that we use introduces neural nets LSTM. % TODO LSTM neural nets
In the dataset of this project, the Tesseract OCR engine fails for some specific pages that contain only sparsely distributed participants
without descriptions and messes up the order. To lead to a recognition of more accurate connected components, we insert half-transparent boxes on pages
that encounter this problem. (See Figure \ref{fig:boxes}) This ensures the correct order of names in the resulting text file.
\begin{figure}[ht]
\caption{Page with an inserted half-transparent box before OCR}
We define a data sample $x_i$ as the participation of a party at a meeting. Note that we only consider parties and no other
affiliations as only parties are able to make interventions in the official negotiations. We thus define the features of
a data sample.
\begin{center}
\begin{tabularx}{\textwidth}{|c|c|X|}
\hline
Name & Value Range & Description \\
\hline\hline
year & $0 - 24$ & The year the meeting took place minus 1995 \\
\hline
number\_of\_delegates & $1 - 1589$ & Number of participants of this delegation \\
\hline
meeting\_type & 0 or 1 & 0 if the meeting is a COP, 1 if it's an SB \\
\hline
government\_rate & $0 - 1$ & Proportion of delegates with role "Government" \\
\hline
diplomacy\_rate & $0 - 1$ & Proportion of delegates with role "Diplomacy" \\
\hline
security\_rate & $0 - 1$ & Proportion of delegates with role "Security" \\
\hline
press\_rate & $0 - 1$ & Proportion of delegates with role "Press" \\
\hline
university\_rate & $0 - 1$ & Proportion of delegates with role "Universities" \\
\hline
no\_description\_rate & $0 - 1$ & Proportion of delegates with no description \\
\hline
no\_keyword\_rate & $0 - 1$ & Proportion of delegates with no detected keyword \\
\hline
nb\_fossil\_fuel\_industry\_associations & $0 - 26$ & Absolute number of delegates with association to the fossil fuel industry \\
\hline
woman\_proportion & $0 - 1$ & The proportion of female participants in the delegation \\
\hline
experience\_score\_cop & $0 - 18$ & The experience score on previous COPs of the delegation \\
\hline
experience\_score\_sb & $0 - 17$ & The experience score on previous COPs of the delegation \\
\hline
experience\_score\_parties\_rate & $0 - 1$ & The proportion of the total experience score that has been acquired in the category "Parties" \\
\hline
is\_Afghanistan & 0 or 1 & 1 if the delegation is Afghanistan \\
\hline
is\_Albania & 0 or 1 & 1 if the delegation is Albania \\
\hline
$\vdots$ & $\vdots$ & $\vdots$ \\
\hline
is\_Zimbabwe & 0 or 1 & 1 if the delegation is Zimbabwe \\
\hline
is\_unrecognized\_country & 0 or 1 & 1 if no party has been detected \\
\hline
\end{tabularx}
\end{center}
There is a total of $213$ features. The attribute \textit{year} is the year the meeting took place and is substracted 1995 which is the year of the first meeting (SB1)
to get values closer to zero.
The attributes \textit{government\_rate} to \textit{no\_keyword\_rate} correspond to the proportion of each role that we assign
(see \ref{roles}). For the experience score, we provide COP and SB experience in total numbers, they sum up to the total
experience score of an affiliation. The \textit{experience\_score\_parties\_rate} denotes the rate of the total experience score
that has been acquired in parties (see \ref{experience}).
The information about the parties are converted into 198 binary attributes, one for each of the 197 Parties to the Convention
and one for an invalid or unrecognized country. \\
In total, we have 9218 data samples. We randomly pick about 80\% of these samples, i.e. 7400 samples, as our training set.
The resting samples form our test set.
\subsubsection{Models}
% baseline models
We first build two \textbf{baseline models}, such that we are later able to compare our models to those simple models.
The first baseline model consists simply of always predicting zero interventions, as this is the most common label.
The second baseline model consists of computing the average number of interventions a party did over all included meetings
and always predict this average.
% linear model
TODO embed in linear model
For this reason, we normalize the attributes before training the model. For an attribute $x_i,j$ we compute
\begin{equation}
x_{i,j} ' = \frac{x_{i,j} - \mu}{|x_{i,j}|}
\end{equation}
% linear model with logarithmic transformation
TODO log transformation \\
% mixed model
A next approach is to try to handle the large amount of zero interventions better.
The massive count of zero labels makes it hard for linear models to succeed.
We therefore introduce a \textbf{two-step model} that works as follows: % TODO cite inspiration
\begin{enumerate}
\item Predict for each sample if the number of interventions will be zero or non-zero.
\item For the non-zero sample, apply a second model to predict the label.
\end{enumerate}
For the first step, we use a logistic regressor with regularization.
For the second step, we use a Poisson regressor with regularization.
\subsubsection{Results}
We will compare our models by the root-mean-square error (RMSE) between the predicted number of interventions $\hat{y_i}$ and the true values $y_i$.
The root-mean-square error is defined as the root of the mean squared error (MSE), i.e. for $n$ samples,
First, we consider the \textbf{baseline models}. When always predicting zero interventions, the test data yields $ RMSE = 9.54 $.
When we always predict the average number of interventions of the party in question during all samples in the training data,
the test data yields $ RMSE = 5.02 $. This shows that the information of the party already gives a lot of information about the behavior
during meetings. \\
The \textbf{ridge regression} model with all features yields an $ RMSE = 5.01 $.
The optimal solution was found with cross-validation at regularizer $\lambda = 0.0101 $.
% TODO insert the same notation W
We can analyze attributes with the strongest influence on the prediction.
The bias of the whole dataset is at $ w_0 = 3.281 $. The attributes with the strongest influence on the predictions are parties, as we expect seeing that the second
baseline model works pretty good. The highest tendency to many interventions per meetings is showed by the European Union ($ + 74.7 $), United States ($ + 53.3 $) and China ($ + 48.0 $).
Cote d'Ivoire ($ - 2.58 $), San Marino ($ - 2.56 $) and Greece ($ - 2.48 $) are the parties that bias the most towards little interventions.
When considering only non-party attributes, the top of the list towards more interventions are \textit{press\_rate} ($ + 2.51 $), \textit{university\_rate} ($ + 1.11 $)
and \textit{experience\_score\_parties\_rate} ($ + 0.70 $).
The non-party attributes that are lowering the predicted number of interventions the most are \textit{no\_description\_rate} ($ - 1.67 $), \textit{diplomacy\_rate} ($ - 0.82 $)
and \textit{no\_keyword\_rate} ($ - 0.43 $).
Interestingly, the year and the number of delegates are the attributes with the weakest influence on the prediction. Apparently, time and delegation size do
have a rather small influence on the activity of a party. \\
% TODO log transf
The \textbf{two-step model} doesn't improve the prediction, it yields an $ RMSE = 5.01 $.
The first step correctly classifies 79.2\% of the test samples into zero or non-zero, with an optimal regularizer of $\lambda = 1.035 $.
The second step predicts the number of interventions on the samples that have been classified as non-zero by the logistic regressor.
For only those samples, i.e., the ones that have been predicted to be non-zero, it yields an $ RMSE = 9.89 $.
When looking at the final prediction of all the test samples, the two-step model yields an $ RMSE = 4.94 $.
This is a slight improvement compared to the previous models.
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.2}Optical Character Recognition}{3}{subsubsection.2.1.2}\protected@file@percent }
\citation{pdfminer.six}
\@writefile{lof}{\contentsline {figure}{\numberline {1}{\ignorespaces Example page of participant list of COP 3\relax }}{4}{figure.caption.1}\protected@file@percent }
\newlabel{fig:raw_scan}{{1}{4}{Example page of participant list of COP 3\relax }{figure.caption.1}{}}
\@writefile{lof}{\contentsline {figure}{\numberline {2}{\ignorespaces Example page of participant list of COP 25\relax }}{4}{figure.caption.1}\protected@file@percent }
\newlabel{fig:raw_well}{{2}{4}{Example page of participant list of COP 25\relax }{figure.caption.1}{}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.3}Well-formatted PDF Extraction}{4}{subsubsection.2.1.3}\protected@file@percent }
\citation{coco}
\@writefile{lof}{\contentsline {figure}{\numberline {3}{\ignorespaces Page with an inserted half-transparent box before OCR\relax }}{5}{figure.caption.2}\protected@file@percent }
\newlabel{fig:boxes}{{3}{5}{Page with an inserted half-transparent box before OCR\relax }{figure.caption.2}{}}
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.1.4}Extraction from Text Files}{5}{subsubsection.2.1.4}\protected@file@percent }
\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.1}Unification of Meetings}{5}{subsubsection.2.2.1}\protected@file@percent }
\@writefile{lof}{\contentsline {figure}{\numberline {4}{\ignorespaces Output of the layout analysis of pdfminer.six\relax }}{6}{figure.caption.3}\protected@file@percent }
-\newlabel{fig:pdfminer}{{4}{6}{Output of the layout analysis of pdfminer.six\relax }{figure.caption.3}{}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.2.2}Gender and Title}{6}{subsubsection.2.2.2}\protected@file@percent }
-\@writefile{lof}{\contentsline {figure}{\numberline {5}{\ignorespaces Overview of the extracted participants of COP meetings\relax }}{8}{figure.caption.4}\protected@file@percent }
-\newlabel{fig:cop_overall}{{5}{8}{Overview of the extracted participants of COP meetings\relax }{figure.caption.4}{}}
-\@writefile{lof}{\contentsline {figure}{\numberline {6}{\ignorespaces Overview of the extracted participants of SB meetings\relax }}{8}{figure.caption.4}\protected@file@percent }
-\newlabel{fig:sb_overall}{{6}{8}{Overview of the extracted participants of SB meetings\relax }{figure.caption.4}{}}
-\@writefile{lof}{\contentsline {figure}{\numberline {10}{\ignorespaces Participants with fossil fuel industry association (COP)\relax }}{10}{figure.caption.7}\protected@file@percent }
-\newlabel{fig:cop_fossil}{{10}{10}{Participants with fossil fuel industry association (COP)\relax }{figure.caption.7}{}}
-\@writefile{lof}{\contentsline {figure}{\numberline {11}{\ignorespaces Participants with fossil fuel industry association (SB)\relax }}{10}{figure.caption.7}\protected@file@percent }
-\newlabel{fig:sb_fossil}{{11}{10}{Participants with fossil fuel industry association (SB)\relax }{figure.caption.7}{}}
-\@writefile{toc}{\contentsline {subsubsection}{\numberline {2.3.3}Association to Fossil Fuel Industry}{10}{subsubsection.2.3.3}\protected@file@percent }
-\@writefile{lof}{\contentsline {figure}{\numberline {12}{\ignorespaces Flow of participants between meetings between the most connected affiliations\relax }}{11}{figure.caption.8}\protected@file@percent }
-\newlabel{fig:exp_flow}{{12}{11}{Flow of participants between meetings between the most connected affiliations\relax }{figure.caption.8}{}}
-\@writefile{lof}{\contentsline {figure}{\numberline {13}{\ignorespaces Average Experience Score over time\relax }}{11}{figure.caption.9}\protected@file@percent }
-\newlabel{fig:expscore_overview}{{13}{11}{Average Experience Score over time\relax }{figure.caption.9}{}}
-\@writefile{lof}{\contentsline {figure}{\numberline {14}{\ignorespaces Distribution of the intervention labels\relax }}{12}{figure.caption.10}\protected@file@percent }
-\newlabel{fig:interv_distr}{{14}{12}{Distribution of the intervention labels\relax }{figure.caption.10}{}}
-\@writefile{lof}{\contentsline {figure}{\numberline {15}{\ignorespaces Resulting RMSE of the different models\relax }}{14}{figure.caption.11}\protected@file@percent }
-\newlabel{fig:RMSE_results}{{15}{14}{Resulting RMSE of the different models\relax }{figure.caption.11}{}}