diff --git a/report/data_processing.tex b/report/data_processing.tex index 29b1557..d0e71fe 100644 --- a/report/data_processing.tex +++ b/report/data_processing.tex @@ -1,436 +1,436 @@ \section{Data Extraction and Processing} The main part of this project is to extract the information contained in the participant lists of UNFCCC meetings. We explain in this section how we extract and process the data of the PDF lists. \subsection{Data Extraction} This section describes the first problem of the project, i.e., how we extract the data from the PDF participant lists. The first step consists of transforming the available PDF files into text files. It is important to keep some information about the structure of the text to be able to find the relevant data in the resulting text file. The second step consists of transforming the text files to comma-separated values (CSV) files. The result of this task is a CSV file for each processed participant list that contains the entries \textit{affiliation category, affiliation, name, description}. \subsubsection{Raw Dataset} \label{dataset} We download the participant lists from the document webpage of the UNFCCC secretariat. \cite{UNFCCC_docs} The lists we process contain all the COP meetings and almost all the SB meetings. Note that during a COP, there is usually an SB meeting held in parallel for which there is no separate participant list. \\ A participant list has the following general structure: Participants are listed under the affiliation they belong to. A member of the Swiss government is for example listed under the affiliation “Switzerland”. A participant is attributed a salutary address that contains at least “Mr.” or “Ms.”, but may also contain some titles as “H.E.” (i.e., “Her Excellence”). Some participants, but not all of them, are attributed a description that explains their role within the delegation. This description could for example be “Minister of Foreign Affairs”. Affiliations are sorted according to their affiliation category and then alphabetically. The affiliation categories are: \begin{itemize} \item Parties \item Observer States \item United Nations Secretariat units and bodies \item Specialized agencies and related organizations \item Intergovernmental organizations \item Non-governmental organizations \end{itemize} Furthermore, most of the lists contain an index that states the total number of participants per category. The category “Media” exists in this index for newer participant lists, but the corresponding participants are not listed. We therefore exclude this category. \\ The format of the participant lists varies over time. For the first meetings the participant lists are paper scans, which means that we need to convert images to text. Furthermore, the manner in which affiliations are indicated varies, in the first meetings they are always written in all uppercase letters, which was changed in later meetings. Figures \ref{fig:raw_scan} and \ref{fig:raw_well} show the first page of the participant lists of COP 3 and COP 25 respectively, the one for COP 3 being a scan. \begin{figure} \centering \begin{minipage}[ht]{.5\textwidth} \centering \includegraphics[width=0.9\textwidth]{raw_scan.PNG} \captionsetup{width=.8\linewidth} \captionof{figure}{Example page of participant list of COP 3} \label{fig:raw_scan} \end{minipage}% \begin{minipage}[ht]{.5\textwidth} \centering \includegraphics[width=0.9\textwidth]{raw_well_formatted.PNG} \captionsetup{width=.8\linewidth} \captionof{figure}{Example page of participant list of COP 25} \label{fig:raw_well} \end{minipage} \end{figure} We choose the version of the participant lists that is published during the last days of a meeting. We exclude the corrigenda, documents that are published later for some participant lists and contain corrections of the lists, because their format varies a lot and many of the listed corrections are rather small (change of order of participants within an affiliation, change of descriptions). \subsubsection{Optical Character Recognition} To extract the data from the scanned lists, we use Optical Character Recognition (OCR), more precisely Python-Tesseract (pytesseract \cite{pytesseract}). Python-Tesseract is a wrapper for the OCR engine Tesseract developed by Google since 2006. \\ Tesseract works as follows. First, it looks for regions in the image that contain dense elements to find connected components that are then organized as text lines. This first step determines the format of a page that is extracted. Then, a two-pass process for recognition is applied. In the first pass, the program tries to recognize each word. If a word is recognized satisfactory, it is used as training data for every word that follows. To make use of all the training data, the second pass goes over all unrecognized words for a second time. \cite{tesseract_expl} \\ In the dataset of this project, the Tesseract OCR engine fails for some specific pages that contain only sparsely distributed participants without descriptions and messes up the order. To lead to a recognition of more accurate connected components, we insert half-transparent boxes on pages that encounter this problem. (See Figure \ref{fig:boxes}) This ensures the correct order of names in the resulting text file. \begin{figure}[ht] \centering \includegraphics[width=0.4\textwidth]{boxes_tesseract.png} \caption{Page with an inserted half-transparent box before OCR} \label{fig:boxes} \end{figure} % TODO change title \subsubsection{Well-formatted PDF Extraction} To extract the data from the well-formatted PDF files, we use a PDF processing package called Pdfminer.six.\cite{pdfminer.six} Again, the main difficulty is to extract the text of the list in correct order. Especially for documents with three columns, this becomes a difficult task. For this reason, we adapted the use of Pdfminer.six by rewriting one of the classes, the \texttt{PDFPageAggregator}. \\ First, we explain quickly how Pdfminer.six extracts text from PDF files. Pdfminer.six performs a layout analysis on every page before extracting the text. This analysis is done in three stages: \begin{itemize} \item Group characters to words and lines \item Group lines to boxes \item Group text boxes hierarchically \end{itemize} The output of the layout analysis is visualized in Figure \ref{fig:pdfminer}.\\ \begin{figure}[ht] \centering \includegraphics[width=0.9\textwidth]{pdfminer.png} \caption{Output of the layout analysis of pdfminer.six} \label{fig:pdfminer} \end{figure} The class we want to modify, \texttt{PDFPageAggregator}, is responsible for outputting the text lines of a page in the determined order. To be able to sort the text lines according to our rules later, we modify the function \texttt{receive\_layout} such that it outputs for each LTTextLine the available $x$ and $y$ positions within the page. In our script that performs the extraction, we then define rules to determine in which column a text line is situated. \\ -A special case for the page layout are affiliation category titles. They break the column system in the middle of a page. We therefore +Special cases of the page layout are affiliation category titles. They break the column system in the middle of a page. We therefore need to recognize them by their text and introduce special rules for pages that contain affiliation category titles. Another difficulty is the recognition of new affiliations. Pdfminer.six is not able to get information about font style, so the only way to detect new affiliations is through line breaks and the fact that names are always started with a salutary address. As line breaks are automatically preserved with pdfminer, we encounter problems only in special situations: When a new affiliation is on top of a column and longer than two lines, we can't distinguish it from the description of a previous participant that is divided to two columns. \subsubsection{Extraction from Text Files} We now need to extract the information from the generated text files. We do this with the following procedure: \begin{enumerate} \item Clean the text file from unnecessary elements, e.g. page numbers and page headers. \item Iterate through the rows of the text file and repeatedly apply: \begin{enumerate} \item Check if the current line is the beginning of a new affiliation category. We do this by keyword checking. \item Check if the current line is a new affiliation. We look for format cues like a row in uppercase letters (for early meetings) or lines that are positioned after a double line break and don't start with a salutary address. \item Check if the current line is a new name for the current affiliation by detecting a salutary address. \item If none of the above is the case, add the line to the description of the last participant. \end{enumerate} \item Store the data structure to a CSV file. \end{enumerate} Note that this algorithm fails for participants that do not start with a salutary address. But as this case only happens a few times in all the processed lists, we can neglect this error. \subsection{Data Processing} This section describes the second problem of the project, i.e., to pull out insightful features from the extracted data. The goal is to post-process the CSV files and bring all the meetings together to one dataset that contains more features per participants. \subsubsection{Unification of Meetings} In order to make our complete dataset as consistent as possible, we need the same affiliation to be named the same throughout all the meetings. For some earlier meetings, e.g. COP 2, the English version of the participant lists were not available. We therefore processed the French versions of their participant lists. With the help of a dictionary, we translate the names of all the parties to English. Once all the country names are in English, we nevertheless need to unify them to the same country denotation. For example, the party Venezuela is named “Venezuela” in the participant list of COP 6, but “Venezuela (Bolivarian Republic of)” in COP 25. To unify the English country names, we use the python package country-converter. \cite{coco} We use it to change every country name to its “short” name. “Venezuela (Bolivarian Republic of)” then becomes “Venezuela”. This packet has limitations when misspelling occur. For example, an error in the OCR process caused Iran to be spelled “Tran” in COP 1, which provokes that the country-converter doesn't recognize it correctly. \\ Note that we apply translation and unification only to the affiliation names of the parties. The unification would be more difficult and more error-prone for other affiliations due to the larger number of possible names. Similarly, we didn't modify descriptions of participants, even if they're sometimes written in another language. \subsubsection{Gender and Title} The easiest additional features to extract are gender and title of participants. This is due to the very static structure of names in the UNFCCC participant lists: Each name starts with a salutary address (“Mr.”, “Ms.”, “Sr.”, “Sra.” etc.) that is associated to be either male or female. By simply checking this salutary address, we can extract the gender of each participant. Optionally, the salutary address contains some title like “H.E.” (“Her Excellence”), “Dr.” or “Prof.”. We set a binary attribute \textit{has\_title} to 1 if a participant is listed with such a title, and to 0 otherwise. \subsubsection{Roles} \label{roles} The description of a participant contains more information about them, but in a very inconsistent format. Every affiliation can decide what to provide as descriptions of their participants. We try to categorize descriptions by defining roles. These roles define the role of a participant within its affiliation. \\ We assign a role to a participant by looking for keywords in its description. The following list contains the -roles that we look for and some corresponding keywords in order of decreasing priority. If a description contains +roles that we look for and some of their corresponding keywords in order of decreasing priority. If a description contains keywords from more than one role, it's assigned the one with higher priority. \begin{itemize} \item Security (Security Officer, Security Service) \item Diplomacy (Ambassador, Embassy, Diplomatic) \item Government (Ministry, Minister, Government, Parliament, Agency, Department of, European Commission, Presidential Office) \item Press (Journalist, Reporter, Radio, Press) \item Universities (Professor, Researcher, Student, University) \end{itemize} The reason for “Security” having the highest priority is that security service is often provided to people of other roles. With our priority rule, the description “Security Officer of the Minister” would be assigned the role “Security” and not “Government”, which is the correct choice. On the other hand, we avoid the keyword “Security” for this role to prevent a description like “Minister for Politics, Law, and Security Affairs” being assigned to the role “Security”. \\ \subsubsection{Association to Fossil Fuel Industry} We also use keywords to determine whether a participant is associated to the fossil fuel industry or not. Examples of our keywords are “Petroleum”, “Oil”, “Gas”, “BP”, “Total”, etc. We separate this from the roles as we do not only use the description, but also the affiliation name to check for the keywords. For example, we want to detect all participants of the NGO “Canadian Association of Petroleum Producers” as associated to the fossil fuel industry, even if they don't have any description. \\ It is also an advantage that a participant that is associated with the fossil fuel industry can still have a role. For example, Saudi Arabia has a Ministry of Petroleum. The corresponding minister is assigned the role “Government” but still is associated with the fossil fuel industry. \subsubsection{Experience} \label{experience} When bringing together the data of all the different meetings, we are interested in the experience of participants. We define experience by the number of earlier UNFCCC meetings that the participant has visited. We differ between experience in SB meetings and COP meetings, as they have quite different characteristics. Furthermore, we differ between experience within a delegation of a Party to the Convention (i.e. affiliation category “parties”) and experience in another affiliation category. \\ To determine the experience, we have to compare names throughout different meetings. -There are some situations where a plain text comparison would fail, even if it's the same -person. +There are some situations where a plain text comparison will fail, even if it's the same +person: \begin{itemize} \item Different spellings of the name, simplification of a special character (e instead of é) \item Long names that span over more than one column are not entirely detected in the newer PDFs because there are three columns in the document. Hence, only a part of the name is detected. \item The order of names is swapped (e.g. “Obama Barack” instead of “Barack Obama”) \end{itemize} We decided to handle these cases in the following manner: \begin{itemize} \item Allow an edit distance of 1 (see below). \item Consider two names as the same when one starts with the other (“Alexander Van der Bellen” and “Alexander Van der” are considered to be the same person). We exclude names with less than 15 characters from this rule to guarantee that a line break is involved. \item If the set of words of two names are equal, the persons are considered to be the same. \end{itemize} We compute the \textbf{edit distance} between names. There exist several types of edit distances. All of them count the minimum number of operations to get from one string to the other. We need to keep the accepted distance very small to keep the error rate low. With over 130 000 distinct participants, the occurrence of very similar names is probable. To get the property that we want, we need substitution to be allowed, such that a missed special character or a typo can simply be replaced by the correct character. We compare the performance of two edit distances. \\ The \textbf{Hamming distance} only allows substitution, hence the compared strings need to have the same length. \cite{hamming} It is equal to the number positions at which the symbols differ in the two strings. The \textbf{Levenshtein distance} allows substitution, insertion and deletion. \cite{levenshtein} It is equal to the minimum number of single-character edits required to change one string into the other. Mathematically, \begin{equation} \label{levenshtein_eq} lev(a,b) = \begin{cases} \lvert a \rvert & \text{if } \lvert b \rvert = 0 \\ \lvert b \rvert & \text{if } \lvert a \rvert = 0 \\ lev(tail(a), tail(b)) & \text{if } a[0] = b[0] \\ 1 + min \begin{cases} lev(tail(a), b) \\ lev(a, tail(b)) \\ lev(tail(a), tail(b)) \\ \end{cases} & \text{otherwise} \end{cases} \end{equation} where for a string x, tail(x) is the string without the first character and $\lvert x \rvert$ is the length of the string. \\ When comparing the results of Levenshtein distance and Hamming distance on our data, the samples that were additionally found to be the same person by the Levenshtein distance were mostly correct ones. One common case is for example a forgotten apostrophe (e.g. “yaara peretz” and “ya'ara peretz”) or a missing empty space (e.g. “yong chul cho” and “yongchul cho”). There are some false positives that are inserted, but this is rather due to common names (e.g. “yan jia” and “yuan jia”). According to the results of this comparison, we choose the Levenshtein distance in the final implementation. \\ To mark false positives, we add another attribute to the dataset that is set to one if the name of a participant has been detected twice in one of the earlier meetings. When this flag is set, the experience features contain an error. \\ In addition to the features for experience that we can add to our dataset, we obtain the information for each participant to which meetings they have participated within which affiliation. \\ Note that a delegation is one instance of an affiliation. Each affiliation comes to a new meeting with a new delegation. To be able to compare delegations with respect to the experience of their participants, we need to define a metric for the experience of a delegation. We call this the \textbf{experience score} of an affiliation and define it as follows: \begin{equation} ExperienceScore(\text{delegation}) = avg(\text{total experience of the top 10 most experienced participants}) \end{equation} The reason for only choosing the top 10 is that delegations are sometimes very big with only a few participants actively involved in the negotiation process. \subsection{Results} We process the participant list of 54 UNFCCC meetings, 26 COPs and 28 SBs. We find in total 271,434 participants, among which we identify 138,940 different persons. In average, we find 8353 participants per COP meeting and 1949 participants per SB meeting. We show in Figures \ref{fig:cop_overall} and \ref{fig:sb_overall} the total numbers of extracted participants of all the COP and SB meetings respectively. The category with the most participants is “Parties”, followed by “Non-governmental organizations”. The absence of the category “Non-governmental organizations” in COP 2 and SB 4 is due to a formatting error of the OCR process. In general, the number of participants have increased for all meetings over time. The peak in number of participants have occurred at the COP 15 meeting in Copenhagen 2009, followed by COP 21 in Paris 2015. For COP 15, it was originally planned to make a major agreement, but this failed. The actual major agreement was then done in COP 21, the Paris Agreement, which explains the second peak. Also, the SB meetings that were held earlier in the years 2009 and 2015 had more participants than the other SB meetings. \\ For the latest meetings, there is a gap between the number of detected participants and the number of participants written in the index of the list. This gap is sometimes pretty large, the maximum being 7287 unlisted participants for COP 21. The UNFCCC explains this difference with the fact that only the participants that participated in the negotiation process are included in the list, the participants complementing the delegations are only included in the number but not in the list. \begin{figure} \centering \begin{minipage}[ht]{.5\textwidth} \centering \includegraphics[width=0.9\textwidth]{participants_per_cop.png} \captionsetup{width=.8\linewidth} \captionof{figure}{Overview of the extracted participants of COP meetings} \label{fig:cop_overall} \end{minipage}% \begin{minipage}[ht]{.5\textwidth} \centering \includegraphics[width=0.9\textwidth]{participants_per_sb.png} \captionsetup{width=.8\linewidth} \captionof{figure}{Overview of the extracted participants of SB meetings} \label{fig:sb_overall} \end{minipage} \end{figure} \subsubsection{Gender and Title} The proportion of women steadily increased since the first meetings. Starting at a rate of 21.4\% at the first SB meeting in 1995 it reached its temporary peak of 47.3\% at SB 50 in 2019. -Figure \ref{fig:gender} shows the continuously increasing trend of this measure, with a slight higher rate +Figure \ref{fig:gender} shows the continuously increasing trend of this measure, with a slightly higher rate of women at the SB meetings compared to the COP meetings. \begin{figure}[ht] \centering \includegraphics[width=0.8\textwidth]{gender.png} \caption{Proportion of female participants per meeting} \label{fig:gender} \end{figure} The UNFCCC secretariat publishes gender composition reports as their goal is to meet gender balance at their meetings, as this may lead to more gender-sensitive climate policies. They show in these reports that even if the numbers are almost reaching 50\% in the latest meetings, equality is not yet reached. The proportion of women is lower when only looking at the Parties to the Convention, and it is also significantly lower when considering the heads of delegations. For example for COP 24 we find an overall proportion of women of 42.8\%, the gender composition report -states a percentage of 38\% for party delegates and a percentage of 27\% in the heads of delegation \cite{UNFCCC_genderreport} \\ +states a percentage of 38\% for party delegates and a percentage of 27\% in the heads of delegation. \cite{UNFCCC_genderreport} \\ The number of participants with a title is generally rather low. For COP meetings, the average proportion of participants with a title is at 3.9\%, for SB meetings the average is at 1.8\%. \subsubsection{Roles} The assigned roles are mainly of interest for parties, as the descriptions are the most exhaustive for their delegates and also contain more keywords. The Figures \ref{fig:cop_roles} and \ref{fig:sb_roles} show which roles have been found to which percentage in the parties of the meetings for COP and SB meetings respectively. The main role is “Government” with a usual proportion of 40-60\% of all party delegates being assigned this role. The proportion of governmental participants is higher at SB meetings, being almost always over 60\% after SB 16. “Security” is the second most common role, with usually more than 10\% of the party participants at COPs being assigned this role and about 5\% at SB meetings. The role “Diplomacy” is more present at COP meetings, which makes sense considering that SB meetings are mainly negotiation focused and that Ambassadors often just represent the country officially. The role “no keyword found” in the plots shows the participants that did not match any keyword, the role “no description” contains the participants that didn't have a description in the source document. \begin{figure} \centering \begin{minipage}{.5\textwidth} \centering \includegraphics[width=1\linewidth]{roles_cop.png} \captionof{figure}{Assigned roles for COP meetings} \label{fig:cop_roles} \end{minipage}% \begin{minipage}{.5\textwidth} \centering \includegraphics[width=1\linewidth]{roles_sb.png} \captionof{figure}{Assigned roles for SB meetings} \label{fig:sb_roles} \end{minipage} \end{figure} \subsubsection{Association to Fossil Fuel Industry} The number of detected participants that openly represent the fossil fuel industry varies a lot from meeting to meeting. Figures \ref{fig:cop_fossil} and \ref{fig:sb_fossil} show the rate and absolute numbers of detected fossil fuel industry representatives for COP and SB meetings respectively. The average rate of participants with a fossil fuel industry association is 1.7\% for COP meetings and 2.7\% for SB meetings. These rates have decreased over the years as the number of participants has increased. \begin{figure} \centering \begin{minipage}[ht]{.5\textwidth} \centering \includegraphics[width=1\linewidth]{ff_cop.png} \captionsetup{width=.8\linewidth} \captionof{figure}{Participants with fossil fuel industry association (COP)} \label{fig:cop_fossil} \end{minipage}% \begin{minipage}[ht]{.5\textwidth} \centering \includegraphics[width=1\linewidth]{ff_sb.png} \captionsetup{width=.8\linewidth} \captionof{figure}{Participants with fossil fuel industry association (SB)} \label{fig:sb_fossil} \end{minipage} \end{figure} \subsubsection{Experience} Over all meetings, we find 138,940 distinct participants. We identify 193 persons that have participated to at least half of the 54 processed meetings. The most experienced participants and their affiliation in COP 25 are the following: \begin{enumerate} - \item Helmut Hojesky: Austria (26 COP, 27 SB) % TODO victor add more information? + \item Helmut Hojesky: Austria (26 COP, 27 SB) \item Norine Kennedy: United States Council for International Business (25 COP, 28 SB) \item Manfred Treber: Germanwatch (26 COP, 26 SB) \end{enumerate} We can investigate the different affiliations that participants have over time. Figure \ref{fig:exp_flow} shows an undirected graph in which nodes are affiliations and edges are participants changing from one affiliation to another between to meetings starting to track at COP 10. The weight of an edge increases by one for every detected interchange of a participant, regardless of the direction. We only show the 20 edges with maximum weight that are between a party and another affiliation. The maximum weight edge is between “South Korea” and the NG0 “Korea Chamber of Commerce and Industry” with a total of 49 participant exchanges. The smallest edges in this graph are of weight 12. \\ Other interesting connections that are not shown in this graph are for example exchanges between different NGO's. There is for example a strong exchange between the large NGO's “International Emissions Trading Association”, “World Business Council for Sustainable Development” and “International Chamber of Commerce”, which are all representing the interests of business and often linked with big companies operating in the fossil fuel industry. \begin{figure}[ht] \centering \includegraphics[width=1\textwidth]{partflow_bipartite_biggestedges.png} \caption{Largest flows of participants between parties and other organizations after COP 10} \label{fig:exp_flow} \end{figure} We consider our defined Experience Score to compare affiliations according to their experience. Figure \ref{fig:expscore_overview} shows the average Experience Score over all affiliations per meeting. The separation of the bars shows if the experience is more or less gained in COP or SB meetings. Affiliations at SB meetings have a higher experience score than on COP meetings, which can be explained through the fact that inexperienced participants go less to SB meetings, as these meetings are more technical and with less public attention. \begin{figure}[ht] \centering \includegraphics[width=1\textwidth]{experiencescore_overview.png} \caption{Average Experience Score over time} \label{fig:expscore_overview} \end{figure} diff --git a/report/introduction.tex b/report/introduction.tex index dcc81bc..aa14ffd 100644 --- a/report/introduction.tex +++ b/report/introduction.tex @@ -1,51 +1,51 @@ \section{Introduction} \subsection{International Climate Negotiations} -For decades, anthropogenic climate change is scientific consent, as well as the urge to introduce more political measures to fight -its principal cause, greenhouse gas emissions. In short, the Intergovernmental Panel on Climate Change (IPCC) states in its special report +For decades, anthropogenic climate change has been scientific consent, as well as the urge to introduce more political measures to fight +its principal cause: greenhouse gas emissions. In short, the Intergovernmental Panel on Climate Change (IPCC) states in its special report in 2018 that human activities have already caused a global warming of approximately 1 degree Celsius. To reduce the risks on natural and human systems the warming should optimally be limited to 1.5 degree Celsius, which would still need rapid and far-reaching transitions in many human-controlled systems. \cite{ipcc:2018} \\ The United Nations Framework Convention on Climate Change (UNFCCC) was opened for signature in 1992 at the UN Conference on Environment and Development in Rio de Janeiro. It entered into force in 1994 and is today signed by 196 countries and the European Union. The ultimate objective of the Convention is "the stabilization of greenhouse gas concentrations in the atmosphere at a level that would prevent dangerous anthropogenic interference with the climate system". \cite{UNFCCC} This goal is rather vague, but is to be understood as a political reaction to the first IPCC assessment report that was published in 1990. The Convention should provide a context to meet international agreements on concrete policies against climate change. The first major agreement that resulted from the Convention was the Kyoto Protocol in 1997 that engages industrialized countries to measure and limit their greenhouse gas emissions according to individual targets. The second major agreement, the Paris Agreement, was made in 2015. It is a legally binding treaty that aims to limit global warming to well below 2, preferably to 1.5 degrees Celsius. \cite{UNFCCC_process, evolution_UNFCCC} \\ The UNFCCC establishes different institutional arrangements for the negotiation process. The most impactful ones are the governing Supreme Bodies, to which belongs the Conference of the Parties (COP). At COP meetings, the Parties to the Convention meet to make decisions about the implementation of the Convention and other adopted legal instruments. Furthermore, there are Subsidiary Bodies (SB) that assist the governing bodies in their decision-making process. The Subsidiary Body for Scientific and Technological Advice (SBSTA) provides the latest research results on scientific and technological matters. The Subsidiary Body for Implementation (SBI) assists the governing bodies in questions related to the implementation of the Convention and the agreements. SB meetings are held twice a year, once at the same time as the COPs. \cite{UNFCCC_process} As there is less public attention on SB meetings, the actual negotiations are more important. Other bodies of the UNFCCC exist, e.g. the process management bodies and the secretariat, but are not relevant for our project. \subsection{Project} \subsubsection{Larger Project} \label{tatiana} This semester project is part of a larger research project that aims to study country delegation characteristics and patterns of international cooperation. It aims to quantify and qualify the gap between international promises concerning climate change and national implementation. We collaborate with the political scientists Marlene Kammerer (University of Bern) and Paula Castro (University of Zurich). \cite{larger_project} \\ In 2020, Victor Kristof and Tatiana Cogne processed data of the Earth Negotiation Bulletin. Their raw dataset included detailed summaries of international climate negotiation meetings organized by the UNFCCC. They extracted data about interventions and interactions of parties and coalitions on the UNFCCC meetings. An intervention of a party is when this party speaks during a meeting. An interaction is for example one party supporting another party, agreeing with another party or opposing another party. \cite{proj_tatiana} \subsubsection{Our Project} For many negotiation meetings, the UNFCCC secretariat publishes a list of participants. These lists are PDF files and therefore not easy to treat. Considering the quantity of participants per meeting, we want to extract the information contained in the original participant lists and bring them in a more convenient format. More information about the dataset is provided in section \ref{dataset}. Afterwards, we would like to process this information to get as much data as possible. The problems we try to solve could therefore be stated as follows: \begin{enumerate} \item Extract the data of the participant lists (PDF files) and convert it to a convenient format. \item Process and analyze the dataset to extract insightful delegation features. \item Create a predictive model for the existing intervention data (see section \ref{tatiana}). \end{enumerate} Due to time constraints, we weren't yet able to put as much effort as needed into the last problem. \ No newline at end of file diff --git a/report/predictive_modelling.tex b/report/predictive_modelling.tex index 07492c9..4792637 100644 --- a/report/predictive_modelling.tex +++ b/report/predictive_modelling.tex @@ -1,246 +1,252 @@ \section{Predictive Modelling} \label{predictive_modelling} Having extracted and processed the data contained in the participant lists, we use them to build predictive models for other data. First, we build linear models for the data on interventions at UNFCCC meetings collected by Tatiana Cogne and Victor Kristof (see \ref{tatiana}). Note that we can't go further on this topic due to time constraints, but there is more potential for creating models with our data, especially for the interaction dataset also collected by Tatiana Cogne and Victor Kristof. \subsection{Predict Interventions} The data on interventions lists for different UNFCCC meetings how many times a party intervenes in this meeting. We build a model that predicts for a party and a given meeting the number of interventions of the party at this meeting. Figure \ref{fig:interv_distr} plots the distribution of the interventions, i.e., the distribution of the labels of the complete dataset. -Most parties don't have any intervention or only one, while some parties intervene a lot more. +Most parties don't have any or only one interventions, while some parties intervene a lot more. \begin{figure}[ht] \centering \includegraphics[width=0.7\textwidth]{distr_interventions.png} \caption{Distribution of the intervention labels} \label{fig:interv_distr} \end{figure} \subsubsection{Data samples} We define a data sample $x_i$ as the participation of a party at a meeting. Note that we only consider parties and no other affiliations as only parties are able to make interventions in the official negotiations. We thus define the features of a data sample. \begin{center} \begin{tabularx}{\textwidth}{|c|c|X|} \hline Name & Value Range & Description \\ \hline\hline year & $\{0, 1, \dots, 24\}$ & The year the meeting took place minus 1995 \\ \hline number\_of\_delegates & $\{1, 2, \dots, 1589\}$ & Number of participants of this delegation \\ \hline meeting\_type & $\{0, 1\}$ & 0 if the meeting is a COP, 1 if it's an SB \\ \hline government\_rate & $[0,1]$ & Proportion of delegates with role "Government" \\ \hline diplomacy\_rate & $[0,1]$ & Proportion of delegates with role "Diplomacy" \\ \hline security\_rate & $[0,1]$ & Proportion of delegates with role "Security" \\ \hline press\_rate & $[0,1]$ & Proportion of delegates with role "Press" \\ \hline university\_rate & $[0,1]$ & Proportion of delegates with role "Universities" \\ \hline no\_description\_rate & $[0,1]$ & Proportion of delegates with no description \\ \hline no\_keyword\_rate & $[0,1]$ & Proportion of delegates with no detected keyword \\ \hline nb\_fossil\_fuel\_industry\_associations & $\{0, 1, \dots, 26\}$ & Absolute number of delegates with association to the fossil fuel industry \\ \hline woman\_proportion & $[0,1]$ & The proportion of female participants in the delegation \\ \hline experience\_score\_cop & $[0,18]$ & The experience score on previous COPs of the delegation \\ \hline experience\_score\_sb & $[0,17]$ & The experience score on previous COPs of the delegation \\ \hline experience\_score\_parties\_rate & $[0,1]$ & The proportion of the total experience score that has been acquired in the category "Parties" \\ \hline is\_Afghanistan & $\{0, 1\}$ & 1 if the delegation is Afghanistan \\ \hline is\_Albania & $\{0, 1\}$ & 1 if the delegation is Albania \\ \hline $\vdots$ & $\vdots$ & $\vdots$ \\ \hline is\_Zimbabwe & $\{0, 1\}$ & 1 if the delegation is Zimbabwe \\ \hline is\_unrecognized\_country & $\{0, 1\}$ & 1 if no party has been detected \\ \hline \end{tabularx} \end{center} There is a total of $213$ features. The feature \textit{year} is the year the meeting took place and is substracted 1995 which is the year of the first meeting (SB1) to get values closer to zero. The features \textit{government\_rate} to \textit{no\_keyword\_rate} correspond to the proportion of each role that we assign (see \ref{roles}). For the experience score, we provide COP and SB experience in total numbers, they sum up to the total experience score of an affiliation. The \textit{experience\_score\_parties\_rate} denotes the rate of the total experience score that has been acquired in parties (see \ref{experience}). The information about the parties are converted into 198 binary features, one for each of the 197 Parties to the Convention and one for an invalid or unrecognized country. \\ In total, we have 9218 data samples. We randomly pick about 80\% of these samples, i.e. 7400 samples, as our training set. The resting samples form our test set. \subsubsection{Models} % baseline models We first build two \textbf{baseline models}, such that we are later able to compare our models to those simple models. -The first baseline model consists simply of always predicting zero interventions, as this is the most common label. Hence, +The first baseline model is a majority-vote predictor which always predicts zero interventions, as this is the most common label. Hence, \begin{equation} \hat{y_i} = 0 \text{,} \end{equation} where $\hat{y_i}$ is the label that we predict for sample $\boldsymbol{x}_i$. The second, slightly more sophisticated baseline model consists in computing the average number of interventions a party did over all meetings in the training data -and always predict this average. Formally, +and always predicting this average. Formally, \begin{equation} \hat{y_i} = \frac{1}{N_{p(i)}} \sum_{k=1}^{N_{p(i)}} y_k^{p(i)} , \end{equation} where $p(i)$ is the party of sample $\boldsymbol{x}_i$ and $\{y_1^{p(i)}, y_2^{p(i)}, \dots, y_{N_{p(i)}}^{p(i)}\}$ are the number of interventions of party $p(i)$ in the training samples. \\ Let $N = 7400$ be the number of training samples and let $D = 213$ be the number of features. Let \begin{equation} - \boldsymbol{x}_i' = + \tilde{\boldsymbol{x}}_i = \begin{bmatrix} \boldsymbol{x}_i \\ 1 \end{bmatrix} \in \mathbb{R}^{D + 1} \end{equation} be a feature vector augmented with 1 to include the global bias. Let \begin{equation} \boldsymbol{X} = \begin{bmatrix} - \boldsymbol{x}_1'^T \\ - \boldsymbol{x}_2'^T \\ + \tilde{\boldsymbol{x}}_1^T \\ + \tilde{\boldsymbol{x}}_2^T \\ \vdots \\ - \boldsymbol{x}_N'^T + \tilde{\boldsymbol{x}}_N^T \end{bmatrix} \end{equation} be a matrix that contains all the augmented feature vectors. And let \begin{equation} \boldsymbol{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \in \mathbb{R}^N \end{equation} be the vector of training labels. \\ % linear model Now, we introduce a \textbf{ridge regression} model. For this reason, we normalize the features before training the model. For a feature $x_{i,j}$ we compute \begin{equation} x_{i,j} ' = \frac{x_{i,j} - \mu}{l}, \end{equation} -where $\mu$ is the mean and $l$ is the l2-norm of all the features $x_{k,j} \enspace \forall k \in \{1, 2, \hdots, N\}$. -To fit this model, we compute the optimal parameters $\boldsymbol{w}^* \in \mathbb{R}^{D + 1}$ +where $\mu \in \mathbb{R}$ is the mean of all the features $x_{k,j}$ and $l \in \mathbb{R}_{\ge 0}$ is the l2-norm of all the features $x_{k,j} \enspace \forall k \in \{1, 2, \hdots, N\}$. +To fit this model, we compute the optimal parameters $\boldsymbol{w}^* \in \mathbb{R}^{D + 1}$ as \begin{equation} - \boldsymbol{w}^* = (\boldsymbol{X}^T \boldsymbol{X} + \lambda \boldsymbol{I}_N)^{-1} \boldsymbol{X}^T \boldsymbol{y} \text{.} + \boldsymbol{w}^* = (\boldsymbol{X}^T \boldsymbol{X} + \alpha \boldsymbol{I}_N)^{-1} \boldsymbol{X}^T \boldsymbol{y} \text{,} \end{equation} +where $\alpha \in \mathbb{R}$ is a regularizer. The predictions are then computed as \begin{equation} - \hat{y_i} = \boldsymbol{x}_i'^T \boldsymbol{w}^* \text{.} + \hat{y_i} = \tilde{\boldsymbol{x}}_i^T \boldsymbol{w}^* \text{.} \end{equation}\\ % linear model with logarithmic transformation As our labels seem to be distributed exponentially, we try another approach where we perform a \textbf{logarithmic transformation} to our labels before applying the same linear model as before. We transform each label $y$ s.t. $y' = log(c + y)$ with $c > 0$. We arbitrarily choose $c = 1$ so that $y' = 0$ when $y = 0$. The fitting of the model and the predictions are then made exactly as for the ridge regression model. \\ % mixed model A next approach is to try to handle the large amount of zero interventions better. -Similar to others before us, we introduce a \textbf{two-step model} to better handle the massive count of zero labels. \cite{ridout1998models, fletcher2005modelling, martin2005modelling} +Similar to others before us \cite{ridout1998models, fletcher2005modelling, martin2005modelling}, +we introduce a \textbf{two-step model} to better handle the massive count of zero labels. Our two-step model works as follows: \begin{enumerate} - \item Predict for each sample if the number of interventions will be zero or non-zero. - \item For the non-zero sample, apply a second model to predict the label. + \item Classify each sample as zero interventions or non-zero. + \item For the non-zero samples, predict the number of interventions (at least 1). \end{enumerate} In the first step, we use a logistic regressor with regularization. In the second step, we use a Poisson regressor with regularization. Mathematically, the prediction is \begin{equation} \hat{y_i} = \begin{cases} 0 & \text{if } \sigma (\boldsymbol{x}^T \boldsymbol{w}^*) < 0.5 \\ \exp(\boldsymbol{x}^T \boldsymbol{z}^*) & \text{otherwise} \end{cases} \text{,} \end{equation} where $\boldsymbol{w}^*$ is learned as the parameters of a logistic regression and $\boldsymbol{z}^*$ is learned as the parameters of a Poisson regression on those training samples $\boldsymbol{x}_i$ for which $y_i > 0$. Formally, the fitting of the Poisson regressor is done as follows. We model the count data $ y_i $ as \begin{equation} P(y_i | \boldsymbol{x}_i, \boldsymbol{z}^*) = \frac{\lambda^{y_i} e^{\lambda}}{y_i !} \text{,} \end{equation} -where $ \lambda = \exp(\boldsymbol{x}_i^T \boldsymbol{z}^*) $ is the rate of number of interventions. +where $ \lambda = \exp(\boldsymbol{x}_i^T \boldsymbol{z}^*) $ is the rate of interventions. Let $ \boldsymbol{X}' = [\boldsymbol{x}_i | y_i > 0] \in \mathbb{R}^{N' \times (D + 1)} $ be the training samples for which the number of interventions is not zero. Then the log-likelihood of this model is \begin{equation} - \Likeh (\boldsymbol{z}^* | \boldsymbol{X}', \boldsymbol{y}) = \sum_{i=1}^{N'} [y_i \log(\lambda) - \lambda - \log(y_i!)] + \Likeh (\boldsymbol{z}^* | \boldsymbol{X}', \boldsymbol{y}) = \sum_{i=1}^{N'} [y_i \log(\lambda) - \lambda - \log(y_i!)], \end{equation} and the optimal parameters are obtained as \begin{equation} - \boldsymbol{z}^* = \underset{\boldsymbol{z}^*}{\mathrm{argmin}} [\Likeh (\boldsymbol{z}^* | \boldsymbol{X}', \boldsymbol{y}) + \alpha \| \boldsymbol{z}^* \|] \text{,} + \boldsymbol{z}^* = \underset{\boldsymbol{z}}{\mathrm{argmin}} [\Likeh (\boldsymbol{z} | \boldsymbol{X}', \boldsymbol{y}) + \alpha \| \boldsymbol{z} \|] \text{,} \end{equation} -where $\alpha \in \mathbb{R}$ is the regularizor. +where $\alpha \in \mathbb{R}$ is the regularizer. + + + + \subsubsection{Results} We will compare our models by the root-mean-square error (RMSE) between the predicted number of interventions $\hat{y_i}$ and the true values $y_i$. -The root-mean-square error is defined as the root of the mean squared error (MSE), i.e. for $n$ samples, +The root-mean-square error for $n$ samples is computed as \begin{equation} - RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y_i} - y_i)^2} + RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (\hat{y_i} - y_i)^2} \end{equation} Figure \ref{fig:RMSE_results} shows the RMSE of the different models. \begin{figure}[ht] \centering \includegraphics[width=0.9\textwidth]{resultsRMSE.png} \caption{Resulting RMSE of the different models} \label{fig:RMSE_results} \end{figure} First, we consider the \textbf{baseline models}. When always predicting zero interventions, the test data yields $ RMSE = 9.54 $. When we always predict the average number of interventions of the party in question during all samples in the training data, the test data yields $ RMSE = 5.02 $. This shows that the information of the party already gives a lot of information about the behavior during meetings. \\ The \textbf{ridge regression} model with all features yields an $ RMSE = 5.01 $. -The optimal solution was found with cross-validation at regularizer $\lambda = 0.0101 $. +The optimal solution was found with cross-validation at regularizer $\alpha = 0.0101 $. We can analyze features with the strongest influence on the prediction. The bias of the whole dataset is at $ w^*_0 = 3.281 $. The features with the strongest influence on the predictions are parties, as we expect seeing that the second baseline model works pretty good. The highest tendency to many interventions per meetings is showed by the European Union ($ + 74.7 $), United States ($ + 53.3 $) and China ($ + 48.0 $). Cote d'Ivoire ($ - 2.58 $), San Marino ($ - 2.56 $) and Greece ($ - 2.48 $) are the parties that bias the most towards little interventions. When considering only non-party features, the top of the list towards more interventions are \textit{press\_rate} ($ + 2.51 $), \textit{university\_rate} ($ + 1.11 $) and \textit{experience\_score\_parties\_rate} ($ + 0.70 $). The non-party features that are lowering the predicted number of interventions the most are \textit{no\_description\_rate} ($ - 1.67 $), \textit{diplomacy\_rate} ($ - 0.82 $) and \textit{no\_keyword\_rate} ($ - 0.43 $). Interestingly, the year and the number of delegates are the features with the weakest influence on the prediction. Apparently, time and delegation size do have a rather small influence on the activity of a party. \\ The \textbf{logarthmic transformation} unfortunately doesn't help the model to make better predictions. -It yields an $ RMSE = 5.76 $. The problem is that even with a logarithmic transformation, the label still don't follow a Gaussian distribution +It yields an $ RMSE = 5.76 $. The problem is that even with a logarithmic transformation, the labels still don't follow a Gaussian distribution at all, as Figure \ref{fig:interv_distr_logtransf} shows. \\ \begin{figure}[ht] \centering \includegraphics[width=0.7\textwidth]{distr_interventions_logtransf.png} \caption{Distribution of the intervention labels after the logarithmic transformation} \label{fig:interv_distr_logtransf} \end{figure} The \textbf{two-step model} only slightly improves the prediction, it yields an $ RMSE = 4.94 $. -The first step correctly classifies 79.2\% of the test samples into zero or non-zero, with an optimal regularizer of $\lambda = 1.035 $. +The first step correctly classifies 79.2\% of the test samples into zero or non-zero, with an optimal regularizer of $\alpha = 1.035 $. The second step predicts the number of interventions on the samples that have been classified as non-zero by the logistic regressor. For only those samples, i.e., the ones that have been predicted to be non-zero, it yields an $ RMSE = 9.89 $. When looking at the final prediction of all the test samples, the two-step model yields an $ RMSE = 4.94 $. This is a slight improvement compared to the previous models.