cop-mining-participants/code26f03fd369c3master
code
readme.md
*Bachelor project, Jan Linder*
process_copX.py : How to use process_copX <numberOfCop> <intermediateFilename> <outputFilename> (<startPage> <endPage>) @numberOfCop (int): which cop to handle @intermediateFilename (str): the filename .txt where the extracted text is stored. If it already exists, the pdf to txt will not be performed again. @outputFilename (str): the output filename (.csv) (optional) @startPage (int): start page for the pdf to txt process (if this process is still to be done), first page is page 1 (optional) @endPage (int): end page for the pdf to txt process (if this process is still to be done) (inclusive)
*How it is done:* COP 1 -4: As these pdfs are scans, OCR is done with the help of tesseract to extract the text. COP 5+ : The text is extracted with the tool PyPDF2, which is not working perfectly and will probably be replaced later (standing of 27.09.20)