**TODO** week2: - modularize ocr script - adapt cop1-script to be stand alone (page start - page end) - cop 2 - 4: explore how to handle week3: - continue modularizing - explore tesseract documentation -> try to optimize result - explore textract documentation, try different parameters - try tesseract on newer pdfs week4: - use pdfminer on all the other cops - understand tesseract better for report/presentation and project week5: - tesseract for cop7 and cop8 (bold) or use list of other cops - detect "Parties", "NGO", ... - barplot for accurancy, ... - corrigendum - peewee -> database (think about structure first) week6: - mail Paula and Marlene with lists - one big .csv (with gender, titre, role) + language - cop 6b - other meetings: SB Questions Paula and Marlene: - Corrigenda -> could be nice - What metadata to extract (gender, title, ...) -> governmental part of - error rate (especially cop7 and cop8) -> fine for older data - COP 6 and 6b -> extract it seperately - where does the difference of participants come from? -> Marlene contacts UNFCCC - did someone try to get better data directly from UNFCCC? No Summary meeting Paula and Marlene: - metadata: government, press, NGO - experience of the participants - extract other meetings (SB meetings) week7: - mail Paula/Marlene for missing SB participant lists - government description: limit to parties, look what "director" and "president" implies - give the code repo a structure - gender, title, ... week8: - inspect corrigendum changes - rename the module files - gender, title, government - extract all sb's - slides midterm presentation (first: structure in bullet points -> send to victor) week9: - finalize slides - continue on experience of participants -> edit distances - continue on roles (fossil fuels) - normalize country names (python library) week11: - continue on experience of participants - fossil fuels -> seperate column - have a look at tatiana's data & brainstorm about what to predict week12: - different plots for experience (boxplots, ...) - plot absolute number for fossil fuel industry - experience: clearly define why I choose what -> write it down somewhere - Marlene & Paula: seperate experience COP & SB? What exactly do they want? - linear regression for interventions Meeting 2 Marlene & Paula: - Experience: extract the data more fine-grained (cop/sb, party/NGO) - Experience score: Only consider top '10' for the score - Chair of G77, AOSIS, etc. - Role: maybe add a role "NGO" - Role: distinct between "no description" and "no keyword found" week13: - do the linear regression model - rest of meeting with Paula and Marlene - not found keywords -> word frequency - hamming vs. levensthein week14: - plot distribution of number of interventions - analyze parameters (maybe normalize) - rest of meeting with Paula and Marlene - not found keywords -> word frequency - hamming vs. levensthein