Phriction Projects Wikis Der Späte Nietzsche Data Collection Workflow For SpN How To Extract Information From SVG Text Files
How To Extract Information From SVG Text Files
How To Extract Information From SVG Text Files
Extract Information from SVG text files to xml
Repository: nietzsche-python
Process files:
$ python3 svgscripts/process_files.py <PDFDIR> <TEXT_SVG_DIR>
<PDFDIR> is the directory containing single page pdf files, while <TEXT_SVG_DIR> is the directory containing the converted svg text files.
The script creates or updates
- a page-xml file for each svg text file in the (newly created) directory 'xml'. Use option --xml-target-dir=xml-target-dir to specify a different xml directory. Naming scheme: TITLE_page[0-9][0-9][0-9].xml
- a manuscript-xml file that contains information about the manuscript. Naming scheme: TITLE.xml.
- a svg path file that can be displayed on the web for each pdf file in the (newly created) directory 'svg'. Use option --svg-target-dir=svg-target-dir to specify a different svg directory.
Fix missing glyphs (if needed):
$ python3 svgscripts/fix_missing_glyphs.py <MANUSCRIPT.xml>
<MANSCRIPT.xml> is a manuscript-xml file.
Convert the word positions to HTML, SVG, PDF or TEXT for testing purposes:
$ python3 svgscripts/convert_wordPositions.py <PAGE.xml>
<PAGE.xml> is a page-xml file. The script will show the HTML output in a browser (default).
Use options --HTML, --SVG, --PDF, --TEXT to specify a format or --output=outputFile to specify both format and output file name.
Tags
None
Subscribers
None
- Last Author
- steinech
- Last Edited
- Oct 18 2019, 17:22