Recent Commits
Commit | Author | Details | Committed | ||||
---|---|---|---|---|---|---|---|
5f2fedab201e | Jan Linder | Get rid of typo in example usage | Feb 26 2021 | ||||
b26ea4bf1b49 | Jan Linder | Move the readme to base and add example for extract_participants.py | Feb 26 2021 | ||||
f3a3b0f515a6 | Jan Linder | Add a description/documentation on top of every script | Feb 26 2021 | ||||
400d505940ea | Jan Linder | Fix the formatting error on c4science | Feb 26 2021 | ||||
5304499a84a6 | Jan Linder | Add level 2 header overview | Feb 26 2021 | ||||
080703aa39de | Jan Linder | Add one line for formatting reasons. | Feb 26 2021 | ||||
41b5d3c78eeb | Jan Linder | Correct format issue in readme | Feb 26 2021 | ||||
e90bd05eb2bc | Jan Linder | Add readme for the library | Feb 26 2021 | ||||
eb127c4fc1b8 | Jan Linder | Improve the readme file(s) | Feb 26 2021 | ||||
4171a26f38e6 | Jan Linder | Remove unused hamming experience file | Feb 26 2021 | ||||
6b1b73715ac6 | Jan Linder | Remove todo | Feb 26 2021 | ||||
54ebea072ef3 | Jan Linder | Add file to sort the columns of dataset | Feb 4 2021 | ||||
1b65f23eeae2 | Jan Linder | Add final presentation | Jan 25 2021 | ||||
c622d6fd5582 | Jan Linder | Fix experience distribution plots | Jan 25 2021 | ||||
7b75ded24cce | Jan Linder | Add numbers to the result plot | Jan 24 2021 |
readme.md
Bachelor project, Jan Linder & Viktor Kristof
This is a project from INDY lab at EPFL that is part of a larger research project on the topic of international climate negotiations. (More information about the larger project can be found here) This repository contains the code to extract the information out of participant lists to UNFCCC meetings. The code directory is structured in the following directories:
- data: contains the raw data, i.e., the PDF participant lists (either in the directory data/COP or data/SB, depending on whether the meeting is a Conference of the Party (COP) or a Subsidiary Bodies (SB) meeting). Note that every new meeting to process must also be added in the data/meetings_metadata.csv file.
- lib: contains the library to extract the data from the PDF files to .txt and then to CSV files. To use this, go to the scripts folder.
- results: contains all the results once generated.
- scripts: contains all the important scripts that use the lib code to extract the data.
Furthermore, you can find report and presentations of the project.
Important scripts & how to use them
In the following, we explain how to use the most important scripts in this repository. They allow to use the library and it's functionalities.
extract_participants.py
extract_participants.py <meetingLabel> (<intermediateFilename> <outputFilename>) @meetingLabel (int): which meeting to handle. Example: "cop1" or "sb40" @intermediateFilename (str), optional: the filename .txt where the extracted text is stored. If it already exists, the pdf to txt will not be performed again. @outputFilename (str), optional: the output filename (.csv) Extracts the data from the raw data in two steps (the PDF participant list must be provided in the data folder (either COP or SB)). The first step is from PDF to a text file, the second step is from the text file to a CSV file. If a text file already exists for this COP (in results/participants-txt), it does not perform the PDF to txt extraction. Results in a CSV file for _one_ meeting.
Example usage: python extract_participants.py cop25 cop25.txt cop25.csv or simply python extract_participants.py cop25.
do_all.py
calls the function extract_participants.py for all the meetings listed in the metadata file. Optimally, you should call generate_complete_dataset.pyafter this to update the complete dataset
generate_complete_dataset.py
Generates one complete dataset of all meetings specified in data/meetings_metadata.csv with more features. The CSV files for all the meetings must be provided in the results/participants-csv directory. Generates a CSV file called complete_dataset.csv.
sort_final_dataset.py
Takes the complete dataset and sorts the columns to a specified order.
find_experience.py
This script finds the experience features, i.e., it links the different instances of the same persons in different lists. It requires the file complete_dataset.csv to exist. This script is long-running and can take about 10 hours to complete. More information about the different features can be found in the folder results.
generate_plots.py
Generates plots using matplotlib using the code specified in scripts/plots.
prepare_intervention_data.py
To predict the intervention data collected by Tatiana Cogne and Victor Kristof, you need to prepare their output for our prediction model. This script looks for the data (interventions.csv and list_meetings.csv) in the folder code/data/data_tatiana and processes them to create the files dataset_interventions, interventions_prepared.csv and interventions_aff.csv. Note that this needs to be rerun every time the complete dataset changes or when the intervention data changes. --> this data is then used in the notebook predict_interventions.ipynb
predict_interventions.ipynb
Jupyter notebook that contains models that try to predict the numbers of interventions based on our complete_dataset.csv.
Informations about the original lists
The participant list are taken from the following official website in September 2020: https://unfccc.int/process/bodies/supreme-bodies/conference-of-the-parties-cop Please note the following issues in the source: COP's 1 - 4, 7 & 8 are scans and are extracted with Optical Character Recognition using the package _pytesseract_, a wrapper for Google's tesseract-ocr machine (version 5.0.0). The results for those lists are expected to contain typos.
- COP 2: The whole list is written in French.
- COP 3: Officially stated 710 "overflow participants" that are not in the list.
- COP 4: Very bad scan quality on p. 86 and p. 92. At least one participant unreadable.
- COP 8: Generally bad scan quality. Might contain more errors.
The other lists are extracted using the package _pdfminer.six_.
- COP 5: Problem with special characters in the source. For example consider the names Sr. Ra™l CASTELLINI or Sra. MarÌa Fernanda CA‹¡S.
- COP6: Note that there were held two COP6, seperated by half a year. This is why the meeting label "cop6b" exists.