cop-mining-participants/code/lib/partlistproc5f2fedab201emaster
cop-mining-participants/code/lib/partlistproc
5f2fedab201emaster
partlistproc
partlistproc
readme.md
readme.md
Participant list processing library
This library can be used to extract the information of lists of participants to UNFCCC meetings (in PDF file format). The extraction is done in two steps:
- First, convert the PDF file to a text file (.txt)
- Second, convert the text file to a CSV file with four columns: Affiliation category, affiliation, name, description.
The first step is done by the files that are called *Extractor*. They are two different extractors, one for lists that are scans (OcrExtractor), hence need OCR to extract the data, the other one for well-formatted PDFs (DigitalPdfExtractor). The second step is done by the files that are called *MeetingAnalyzer*. There exist three different *MeetingAnalyzer*'s, namely:
- UppercaseAffiliationMeetingAnalyzer: Mostly early meetings have each affiliation in uppercase letters, that's what we use to detect them.
- DigitalMeetingAnalyzer: Works for all meetings that have NOT been extracted via OCR.
- AffiliationListMeetingAnalyzer: Use this for OCR-extracted lists that don't have uppercase affiliations (e.g. cop7 and cop8). It uses a list of possible affiliations (provided by the script extract_affiliations.py) to find new affiliations. This leads to more errors than the other *MeetingAnalyzer*'s.
c4science · Help