**TODO**

week2:
- modularize ocr script
- adapt cop1-script to be stand alone (page start - page end)
- cop 2 - 4: explore how to handle

week3:
- continue modularizing
- explore tesseract documentation -> try to optimize result
- explore textract documentation, try different parameters
- try tesseract on newer pdfs

week4:
- use pdfminer on all the other cops
- understand tesseract better for report/presentation and project

week5:
- tesseract for cop7 and cop8 (bold) or use list of other cops
- detect "Parties", "NGO", ...
- barplot for accurancy, ...
- corrigendum
- peewee -> database (think about structure first)

week6:
- mail Paula and Marlene with lists
- one big .csv (with gender, titre, role) + language
- cop 6b
- other meetings: SB

Questions Paula and Marlene:
- Corrigenda -> could be nice
- What metadata to extract (gender, title, ...) -> governmental part of 
- error rate (especially cop7 and cop8) -> fine for older data
- COP 6 and 6b -> extract it seperately
- where does the difference of participants come from? -> Marlene contacts UNFCCC
- did someone try to get better data directly from UNFCCC? No

Summary meeting Paula and Marlene:
- metadata: government, press, NGO
- experience of the participants
- extract other meetings (SB meetings)

week7:
- mail Paula/Marlene for missing SB participant lists
- government description: limit to parties, look what "director" and "president" implies
- give the code repo a structure
- gender, title, ...

week8:
- inspect corrigendum changes
- rename the module files
- gender, title, government
- extract all sb's
- slides midterm presentation (first: structure in bullet points -> send to victor)

week9:
- finalize slides
- continue on experience of participants -> edit distances
- continue on roles (fossil fuels)
- normalize country names (python library)