code

COP_Analyzer.py
COP_Extractor.py
extract_participants_copX.py
files/
index_analysis_cop1.py
ocr-list.py
process_copX.py
raw_cop1.txt
raw_cop25.txt
raw_cop3.txt
readme.md
test_indexCOP1.txt

readme.md

*Bachelor project, Jan Linder*

process_copX.py : How to use process_copX <numberOfCop> <intermediateFilename> <outputFilename> (<startPage> <endPage>) @numberOfCop (int): which cop to handle @intermediateFilename (str): the filename .txt where the extracted text is stored. If it already exists, the pdf to txt will not be performed again. @outputFilename (str): the output filename (.csv) (optional) @startPage (int): start page for the pdf to txt process (if this process is still to be done), first page is page 1 (optional) @endPage (int): end page for the pdf to txt process (if this process is still to be done) (inclusive)

*How it is done:* COP 1 -4: As these pdfs are scans, OCR is done with the help of tesseract to extract the text. COP 5+ : The text is extracted with the tool PyPDF2, which is not working perfectly and will probably be replaced later (standing of 27.09.20)

cop-mining-participants/code26f03fd369c3master

code

readme.md

cop-mining-participants/code
26f03fd369c3master