Page MenuHomec4science

How To Extract Information From SVG Text Files
Updated 1,645 Days AgoPublic

Extract Information from SVG text files to xml

Repository: nietzsche-python

Process files:

$ python3 svgscripts/process_files.py <PDFDIR> <TEXT_SVG_DIR>

<PDFDIR> is the directory containing single page pdf files, while <TEXT_SVG_DIR> is the directory containing the converted svg text files.
The script creates or updates

  • a page-xml file for each svg text file in the (newly created) directory 'xml'. Use option --xml-target-dir=xml-target-dir to specify a different xml directory. Naming scheme: TITLE_page[0-9][0-9][0-9].xml
  • a manuscript-xml file that contains information about the manuscript. Naming scheme: TITLE.xml.
  • a svg path file that can be displayed on the web for each pdf file in the (newly created) directory 'svg'. Use option --svg-target-dir=svg-target-dir to specify a different svg directory.

Fix missing glyphs (if needed):

$ python3 svgscripts/fix_missing_glyphs.py <MANUSCRIPT.xml>

<MANSCRIPT.xml> is a manuscript-xml file.

Convert the word positions to HTML, SVG, PDF or TEXT for testing purposes:

$ python3 svgscripts/convert_wordPositions.py <PAGE.xml>

<PAGE.xml> is a page-xml file. The script will show the HTML output in a browser (default).
Use options --HTML, --SVG, --PDF, --TEXT to specify a format or --output=outputFile to specify both format and output file name.

Last Author
steinech
Last Edited
Oct 18 2019, 17:22

Event Timeline

steinech moved this document from Restricted Phriction Wiki DocumentOct 18 2019, 16:38
steinech edited the content of this document. (Show Details)Oct 18 2019, 17:20
steinech edited the content of this document. (Show Details)
steinech edited the content of this document. (Show Details)Oct 18 2019, 17:22
steinech changed the visibility from "Restricted Project (Project)" to "Public (No Login Required)".Apr 15 2020, 17:32