**TODO** week2: - modularize ocr script - adapt cop1-script to be stand alone (page start - page end) - cop 2 - 4: explore how to handle week3: - continue modularizing - explore tesseract documentation -> try to optimize result - explore textract documentation, try different parameters - try tesseract on newer pdfs week4: - use pdfminer on all the other cops - understand tesseract better for report/presentation and project week5: - tesseract for cop7 and cop8 (bold) or use list of other cops - detect "Parties", "NGO", ... - barplot for accurancy, ... - corrigendum - peewee -> database (think about structure first) week6: - mail Paula and Marlene with lists - one big .csv (with gender, titre, role) + language - cop 6b - other meetings: SB Questions Paula and Marlene: - Corrigenda -> could be nice - What metadata to extract (gender, title, ...) -> governmental part of - error rate (especially cop7 and cop8) -> fine for older data - COP 6 and 6b -> extract it seperately - where does the difference of participants come from? -> Marlene contacts UNFCCC - did someone try to get better data directly from UNFCCC? No Summary meeting Paula and Marlene: - metadata: government, press, NGO - experience of the participants - extract other meetings (SB meetings) week7: - mail Paula/Marlene for missing SB participant lists - government description: limit to parties, look what "director" and "president" implies - give the code repo a structure - gender, title, ... week8: - inspect corrigendum changes - rename the module files - gender, title, government - extract all sb's - slides midterm presentation (first: structure in bullet points -> send to victor) week9: - finalize slides - continue on experience of participants -> edit distances - continue on roles (fossil fuels) - normalize country names (python library)