## Bachelor project, Jan Linder ### extract_participants_copX.py : How to use `extract_participants.py ( )` `@meetingLabel` (int): which meeting to handle. Example: "cop1" or "sb40" `@intermediateFilename` (str), optional: the filename .txt where the extracted text is stored. If it already exists, the pdf to txt will not be performed again. `@outputFilename` (str), optional: the output filename (.csv) ### How it works #### Analyzer The Analyzer is responsible for parsing the txt file into a list of participants / a dataframe. There are three categories of analyzers: - COP 1 - COP 5: The titles of the affiliations are written in uppercase letters - COP 7 and COP 8: These lists are scans, which makes the txt file having a less consistent structure. We use the list of affiliations pulled of all the other cops to find the affiliations here. - the rest (COPnewer): pdfs extracted with pdftotxt, no scans. ### Informations about the original lists The participant list are taken from the following official website in September 2020: https://unfccc.int/process/bodies/supreme-bodies/conference-of-the-parties-cop Please note the following issues in the source: COP's 1 - 4, 7 & 8 are scans and are extracted with **Optical Character Recognition** using the package _**pytesseract**_, a wrapper for Google's tesseract-ocr machine (version 5.0.0). The results for those lists are expected to contain typos. - COP 2: The whole list is written in French. - COP 3: Officially stated 710 "overflow participants" that are not in the list. - COP 4: Very bad scan quality on p. 86 and p. 92. At least one participant unreadable. - COP 8: Generally bad scan quality. Might contain more errors. The other lists are extracted using the package _**pdfminer.six**_. - COP 5: Problem with special characters in the source. For example consider the names Sr. Ra™l CASTELLINI or Sra. MarÌa Fernanda CA‹¡S. - COP6: Note that there were held two COP6, seperated by half a year. To extract the data from the second COP6, please use TODO.