## Bachelor project, Jan Linder


### extract_participants_copX.py : How to use
`extract_participants.py <meetingLabel> (<intermediateFilename> <outputFilename>)`
`@meetingLabel` (int): which meeting to handle. Example: "cop1" or "sb40"
`@intermediateFilename` (str), optional: the filename .txt where the extracted text is stored. If it already exists, the pdf to txt will not be performed again.
`@outputFilename` (str), optional: the output filename (.csv)


### How it works

#### Analyzer
The Analyzer is responsible for parsing the txt file into a list of participants / a dataframe. There are three categories of analyzers:
- COP 1 - COP 5: The titles of the affiliations are written in uppercase letters
- COP 7 and COP 8: These lists are scans, which makes the txt file having a less consistent structure. We use the list of affiliations pulled of all the other cops to find the affiliations here.
- the rest (COPnewer): pdfs extracted with pdftotxt, no scans.


### Informations about the original lists
The participant list are taken from the following official website in September 2020: https://unfccc.int/process/bodies/supreme-bodies/conference-of-the-parties-cop
Please note the following issues in the source:
COP's 1 - 4, 7 & 8 are scans and are extracted with **Optical Character Recognition** using the package _**pytesseract**_, a wrapper for Google's tesseract-ocr machine (version 5.0.0). The results for those lists are expected to contain typos.
- COP 2: The whole list is written in French.
- COP 3: Officially stated 710 "overflow participants" that are not in the list.
- COP 4: Very bad scan quality on p. 86 and p. 92. At least one participant unreadable.
- COP 8: Generally bad scan quality. Might contain more errors.

The other lists are extracted using the package _**pdfminer.six**_.
- COP 5: Problem with special characters in the source. For example consider the names Sr. Ra™l CASTELLINI or Sra. MarÌa Fernanda CA‹¡S.
- COP6: Note that there were held two COP6, seperated by half a year. To extract the data from the second COP6, please use TODO.