Index: Notiz4Dominique.txt
===================================================================
--- Notiz4Dominique.txt (revision 87)
+++ Notiz4Dominique.txt (revision 88)
@@ -1,8 +0,0 @@
-Ich habe im geteilten SwitchDrive Ordner einen Datenordner 'DATA' erstellt. Darin sind folgende Dateien enthalten:
-
-pdf
-svg
-text_svg
-xml
-
-Diese kannst Du in diesen Ordner verlinken, um die Skripte mit den aktuellen Daten zu verwenden.
Index: TODO.md
===================================================================
--- TODO.md (revision 87)
+++ TODO.md (revision 88)
@@ -1,111 +1,111 @@
# TASKS
-+ Data
+- Data
- Frontend
-- Funding
++ Funding
- Organization
# Wortsuche:
- Die Wortsuche sollte über die topologische Nähe der Wörter zueinander gewichtet werden.
- Wortpfade, d.h. Abfolgen der Wörter sollen vermieden werden, da dies nicht automatisch generiert werden kann und
höchst fehleranfällig ist.
- Daher sollen die Worteinfügungen auch nicht dafür verwendet werden, alternative Textverläufe aufzuzeichnen.
# TODO
## Faksimile data input
- word boxes on faksimile by drawing rects with inkscape [IN PROGRESS, see "Leitfaden.pdf"]
- naming word boxes by using title of rects [IN PROGRESS, see "Leitfaden\_Kontrolle\_und\_Beschriftung\_der\_Wortrahmen.pdf"]
- correcting faksimile svg or transkription xml if words do not correspond
## Processing
### faksimile data input, i.e. svg-file resulting from drawing boxes etc. with inkscape
- process faksimile words:
- join\_faksimileAndTranskription.py [DONE]
- create faksimile position of line [TODO]
- create a data input task for words that do not correspond [DONE]
### transkription, i.e. svg-file resulting from pdf-file ->created with InDesign
- fix:
- xml/W\_II\_1\_page131.xml:
- eigentlichste: [TODO]
has two parts (['eigentlich', 'ste']), both are deleted, one part ('ste') has a box ('e')
-> expected result: earlier_version = eigentlichste, earlier_version.earlier_version = eigentliche
- xml/N\_VII\_1\_page138.xml:
- AufBau [DONE]
- Verschiedenes [DONE]
- process text field:
- Word [DONE]
- SpecialWord
- MarkForeignHands [DONE]
- TextConnectionMark [DONE]
- WordInsertionMark [DONE]
- all paths -> page.categorize\_paths [TODO]
- word-deletion -> Path [DONE]
- make parts of word if only parts of a word are deleted, also introduce earlier version of word [DONE]
- correction concerning punctuations in words that are deleted, script does not recognize parts of deleted
words as deleted if they consist of punctuation marks. [TODO]
- word-undeletion (e.g. N VII 1, 18,6 -> "mit")
- underline
- text-area-deletion
- text-connection-lines
- boxes
- process footnotes:
- Return footnotes with standoff [DONE]
- TextConnectionMark [DONE]
- TextConnection with uncertainty [TODO]
- "Fortsetzung [0-9]+,[0-9]+?"
- "Fortsetzung von [0-9]+,[0-9]+?"
- concerning Word:
- uncertain transcription: "?" / may have bold word parts
- atypical writting: "¿" and bold word parts
- clarification corrections ("Verdeutlichungskorrekturen"): "Vk" and bold word parts
- correction: "word>" and ">?" (with uncertainty)
- concerning word deletion:
- atypical writting: "¿" and "Durchstreichung" (see N VII 1, 11,2)
- process margins:
- MarkForeignHands [DONE]
- ForeignHandTextAreaDeletion [TODO]
- boxes: make earlier version of a word [TODO]
- TextConnection [TODO]
- from: ([0-9]+,)*[0-9]+ -)
- to: -) ([0-9]+,)*[0-9]+
## Datatypes
- make datatypes:
- Page [ok] --> page orientation!!!
- SimpleWord
- SpecialWord
- MarkForeignHands ("Zeichen für Fremde Hand") [DONE]
- TextConnectionMark ("Anschlußzeichen") [DONE]
- has a Reference
- Word [ok] --> deal with non-horizontal text [DONE]
--> hyphenation [TODO]
--> add style info to word: font { German, Latin } [DONE]
--> pen color [DONE]
--> connect style with character glyph-id from svg path file
--> has parts [DONE]
--> versions: later version of earlier version [DONE]
- WritingProcess >>>> use only in connection with earlier versions of word
- correlates with font size:
- biggest font to biggest-1 font: stage 0
- font in between: stage 1
- smallest font to smallest+1 font: stage 2
- Style [DONE]
- WordPosition [ok]
- TranskriptionPosition [ok]
- FaksimilePosition [ok]
- LineNumber [reDo]
- change to Line
- Reference [TODO]+
- TextConnection
- needs change of LineNumber to Line
- ForeignHandTextAreaDeletion [TODO]
- Freehand:
- Deletion [DONE]
- make parts of word if only parts of a word are deleted, also introduce earlier version of word [DONE]
- WordInsertionMark [reDO]
- Underline [TODO]
Index: svgscripts/datatypes/manuscript.py
===================================================================
--- svgscripts/datatypes/manuscript.py (revision 87)
+++ svgscripts/datatypes/manuscript.py (revision 88)
@@ -1,137 +1,141 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
""" This class can be used to represent an archival unity of manuscript pages, i.e. workbooks, notebooks, folders of handwritten pages.
"""
# Copyright (C) University of Basel 2019 {{{1
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see 1}}}
__author__ = "Christian Steiner"
__maintainer__ = __author__
__copyright__ = 'University of Basel'
__email__ = "christian.steiner@unibas.ch"
__status__ = "Development"
__license__ = "GPL v3"
__version__ = "0.0.1"
from lxml import etree as ET
from os.path import isfile
import sys
from .page import Page, FILE_TYPE_XML_MANUSCRIPT, FILE_TYPE_SVG_WORD_POSITION
from .color import Color
sys.path.append('py2ttl')
from class_spec import SemanticClass
sys.path.append('shared_util')
from myxmlwriter import parse_xml_of_type, write_pretty, xml_has_type
class ArchivalManuscriptUnity(SemanticClass):
"""
This class represents an archival unity of manuscript pages (workbooks, notebooks and portfolios of handwritten pages).
@label archival unity of manuscript pages
Args:
title title of archival unity
manuscript_type type of manuscript: 'Arbeitsheft', 'Notizheft', 'Mappe'
manuscript_tree lxml.ElementTree
"""
XML_TAG = 'manuscript'
XML_COLORS_TAG = 'colors'
+ TYPE_DICTIONARY = { 'Mp': 'Mappe', 'N': 'Notizheft', 'W': 'Arbeitsheft' }
UNITTESTING = False
def __init__(self, title='', manuscript_type='', manuscript_tree=None):
self.colors = []
self.manuscript_tree = manuscript_tree
self.manuscript_type = manuscript_type
self.pages = []
self.styles = []
self.title = title
+ if self.manuscript_type == '' and self.title != ''\
+ and self.title.split(' ')[0] in self.TYPE_DICTIONARY.keys():
+ self.manuscript_type = self.TYPE_DICTIONARY[self.title.split(' ')[0]]
def get_name_and_id(self):
"""Return an identification for object as 2-tuple.
"""
return '', self.title.replace(' ', '_')
@classmethod
def create_cls(cls, xml_manuscript_file, page_status_list=None, page_xpath='', update_page_styles=False):
"""Create an instance of ArchivalManuscriptUnity from a xml file of type FILE_TYPE_XML_MANUSCRIPT.
:return: ArchivalManuscriptUnity
"""
manuscript_tree = parse_xml_of_type(xml_manuscript_file, FILE_TYPE_XML_MANUSCRIPT)
title = manuscript_tree.getroot().get('title') if bool(manuscript_tree.getroot().get('title')) else ''
manuscript_type = manuscript_tree.getroot().get('type') if bool(manuscript_tree.getroot().get('type')) else ''
manuscript = cls(title=title, manuscript_type=manuscript_type, manuscript_tree=manuscript_tree)
manuscript.colors = [ Color.create_cls(node=color_node) for color_node in manuscript_tree.xpath('.//' + cls.XML_COLORS_TAG + '/' + Color.XML_TAG) ]
if page_xpath == '':
page_status = ''
if page_status_list is not None\
and type(page_status_list) is list\
and len(page_status_list) > 0:
page_status = '[' + ' and '.join([ f'contains(@status, "{status}")' for status in page_status_list ]) + ']'
page_xpath = f'//pages/page{page_status}/@output'
manuscript.pages = [ Page(page_source)\
for page_source in manuscript_tree.xpath(page_xpath)\
if isfile(page_source) and xml_has_type(FILE_TYPE_SVG_WORD_POSITION, xml_source_file=page_source) ]
if update_page_styles:
for page in manuscript.pages: page.update_styles(manuscript=manuscript, add_to_parents=True)
return manuscript
def get_color(self, hex_color) -> Color:
"""Return color if it exists or None.
"""
if hex_color in [ color.hex_color for color in self.colors ]:
return [ color for color in self.colors if color.hex_color == hex_color ][0]
return None
@classmethod
def get_semantic_dictionary(cls):
""" Creates a semantic dictionary as specified by SemanticClass.
"""
dictionary = {}
class_dict = cls.get_class_dictionary()
properties = {}
properties.update(cls.create_semantic_property_dictionary('title', str, 1))
properties.update(cls.create_semantic_property_dictionary('manuscript_type', str, 1))
properties.update(cls.create_semantic_property_dictionary('styles', list))
properties.update(cls.create_semantic_property_dictionary('pages', list))
dictionary.update({cls.CLASS_KEY: class_dict})
dictionary.update({cls.PROPERTIES_KEY: properties})
return cls.return_dictionary_after_updating_super_classes(dictionary)
def update_colors(self, color):
"""Update manuscript colors if color is not contained.
"""
if self.get_color(color.hex_color) is None:
self.colors.append(color)
if self.manuscript_tree is not None:
if len(self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)) > 0:
self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)[0].getparent().remove(self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)[0])
colors_node = ET.SubElement(self.manuscript_tree.getroot(), self.XML_COLORS_TAG)
for color in self.colors:
color.attach_object_to_tree(colors_node)
if not self.UNITTESTING:
write_pretty(xml_element_tree=self.manuscript_tree, file_name=self.manuscript_tree.docinfo.URL,\
script_name=__file__, backup=True,\
file_type=FILE_TYPE_XML_MANUSCRIPT)
def update_styles(self, *styles):
"""Update manuscript styles.
"""
for style in styles:
if style not in self.styles:
self.styles.append(style)
Index: svgscripts/datatypes/super_page.py
===================================================================
--- svgscripts/datatypes/super_page.py (revision 87)
+++ svgscripts/datatypes/super_page.py (revision 88)
@@ -1,289 +1,290 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
""" This class can be used to represent a super page.
"""
# Copyright (C) University of Basel 2019 {{{1
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see 1}}}
__author__ = "Christian Steiner"
__maintainer__ = __author__
__copyright__ = 'University of Basel'
__email__ = "christian.steiner@unibas.ch"
__status__ = "Development"
__license__ = "GPL v3"
__version__ = "0.0.1"
from lxml import etree as ET
from os.path import isfile, basename, dirname
from progress.bar import Bar
from svgpathtools import svg2paths2, svg_to_paths
from svgpathtools.parser import parse_path
import sys
import warnings
from .image import Image, SVGImage
from .faksimile_image import FaksimileImage
from .mark_foreign_hands import MarkForeignHands
from .text_connection_mark import TextConnectionMark
from .text_field import TextField
from .writing_process import WritingProcess
class SuperPage:
"""
This super class represents a page.
Args:
xml_source_file (str): name of the xml file to be instantiated.
xml_target_file (str): name of the xml file to which page info will be written.
"""
FILE_TYPE_SVG_WORD_POSITION = 'svgWordPosition'
FILE_TYPE_XML_MANUSCRIPT = 'xmlManuscriptFile'
PAGE_RECTO = 'recto'
PAGE_VERSO = 'verso'
STATUS_MERGED_OK = 'faksimile merged'
STATUS_POSTMERGED_OK = 'words processed'
UNITTESTING = False
+ XML_TAG = 'page'
def __init__(self, xml_file, title=None, page_number='', orientation='North', page_type=PAGE_VERSO, should_xml_file_exist=False):
self.properties_dictionary = {\
'faksimile_image': (FaksimileImage.XML_TAG, None, FaksimileImage),\
'faksimile_svgFile': ('data-source/@file', None, str),\
'number': ('page/@number', str(page_number), str),\
'orientation': ('page/@orientation', orientation, str),\
'page_type': ('page/@pageType', page_type, str),\
'pdfFile': ('pdf/@file', None, str),\
'source': ('page/@source', None, str),\
'svg_file': ('svg/@file', None, str),\
'svg_image': (SVGImage.XML_TAG, None, SVGImage),\
'text_field': (FaksimileImage.XML_TAG + '/' + TextField.XML_TAG, None, TextField),\
'title': ('page/@title', title, str),\
}
self.online_properties = []
self.line_numbers = []
self.lines = []
self.mark_foreign_hands = []
self.page_tree = None
self.sonderzeichen_list = []
self.style_dict = {}
self.text_connection_marks = []
self.word_deletion_paths = []
self.word_insertion_marks = []
self.words = []
self.writing_processes = []
self.xml_file = xml_file
if not self.is_page_source_xml_file():
msg = f'ERROR: xml_source_file {self.xml_file} is not of type "{FILE_TYPE_SVG_WORD_POSITION}"'
raise Exception(msg)
self._init_tree(should_xml_file_exist=should_xml_file_exist)
def add_style(self, sonderzeichen_list=[], letterspacing_list=[], style_dict={}, style_node=None):
"""Adds a list of classes that are sonderzeichen and a style dictionary to page.
"""
self.sonderzeichen_list = sonderzeichen_list
self.letterspacing_list = letterspacing_list
self.style_dict = style_dict
if style_node is not None:
self.style_dict = { item.get('name'): { key: value for key, value in item.attrib.items() if key != 'name' } for item in style_node.findall('.//class') }
self.sonderzeichen_list = [ item.get('name') for item in style_node.findall('.//class')\
if bool(item.get('font-family')) and 'Sonderzeichen' in item.get('font-family') ]
self.letterspacing_list = [ item.get('name') for item in style_node.findall('.//class')\
if bool(item.get('letterspacing-list')) ]
elif bool(self.style_dict):
style_node = ET.SubElement(self.page_tree.getroot(), 'style')
if len(self.sonderzeichen_list) > 0:
style_node.set('Sonderzeichen', ' '.join(self.sonderzeichen_list))
if len(self.letterspacing_list) > 0:
style_node.set('letterspacing-list', ' '.join(self.letterspacing_list))
for key in self.style_dict.keys():
self.style_dict[key]['name'] = key
ET.SubElement(style_node, 'class', attrib=self.style_dict[key])
fontsize_dict = { key: float(value.get('font-size').replace('px','')) for key, value in self.style_dict.items() if 'font-size' in value }
fontsizes = sorted(fontsize_dict.values(), reverse=True)
# create a mapping between fontsizes and word stages
self.fontsizekey2stage_mapping = {}
for fontsize_key, value in fontsize_dict.items():
if value >= fontsizes[0]-1:
self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.FIRST_VERSION })
elif value <= fontsizes[len(fontsizes)-1]+1:
self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.LATER_INSERTION_AND_ADDITION })
else:
self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.INSERTION_AND_ADDITION })
def get_biggest_fontSize4styles(self, style_set={}):
"""Returns biggest font size from style_dict for a set of style class names.
[:returns:] (float) biggest font size OR 1 if style_dict is empty
"""
if bool(self.style_dict):
sorted_font_sizes = sorted( (float(self.style_dict[key]['font-size'].replace('px','')) for key in style_set if bool(self.style_dict[key].get('font-size'))), reverse=True)
return sorted_font_sizes[0] if len(sorted_font_sizes) > 0 else 1
else:
return 1
def get_line_number(self, y):
"""Returns line number id for element at y.
[:return:] (int) line number id or -1
"""
if len(self.line_numbers) > 0:
result_list = [ line_number.id for line_number in self.line_numbers if y >= line_number.top and y <= line_number.bottom ]
return result_list[0] if len(result_list) > 0 else -1
else:
return -1
def init_all_properties(self, overwrite=False):
"""Initialize all properties.
"""
for property_key in self.properties_dictionary.keys():
if property_key not in self.online_properties:
self.init_property(property_key, overwrite=overwrite)
def init_property(self, property_key, value=None, overwrite=False):
"""Initialize all properties.
Args:
property_key: key of property in self.__dict__
value: new value to set to property
overwrite: whether or not to update values from xml_file (default: read only)
"""
if value is None:
if property_key not in self.online_properties:
xpath, value, cls = self.properties_dictionary.get(property_key)
if len(self.page_tree.xpath('//' + xpath)) > 0:
value = self.page_tree.xpath('//' + xpath)[0]
if value is not None:
if cls.__module__ == 'builtins':
self.update_tree(value, xpath)
self.__dict__.update({property_key: cls(value)})
else:
value = cls(node=value)\
if type(value) != cls\
else value
self.__dict__.update({property_key: value})
self.__dict__.get(property_key).attach_object_to_tree(self.page_tree)
else:
self.__dict__.update({property_key: value})
self.online_properties.append(property_key)
elif overwrite or property_key not in self.online_properties:
xpath, default_value, cls = self.properties_dictionary.get(property_key)
if cls.__module__ == 'builtins':
self.__dict__.update({property_key: cls(value)})
self.update_tree(value, xpath)
else:
self.__dict__.update({property_key: value})
self.__dict__.get(property_key).attach_object_to_tree(self.page_tree)
self.online_properties.append(property_key)
def is_locked(self):
"""Return true if page is locked.
"""
return len(self.page_tree.xpath('//metadata/lock')) > 0
def is_page_source_xml_file(self, source_tree=None):
"""Return true if xml_file is of type FILE_TYPE_SVG_WORD_POSITION.
"""
if not isfile(self.xml_file):
return True
if source_tree is None:
source_tree = ET.parse(self.xml_file)
return source_tree.getroot().find('metadata/type').text == self.FILE_TYPE_SVG_WORD_POSITION
def lock(self, reference_file, message=''):
"""Lock tree such that ids of words etc. correspond to ids
in reference_file, optionally add a message that will be shown.
"""
if not self.is_locked():
metadata = self.page_tree.xpath('./metadata')[0]\
if len(self.page_tree.xpath('./metadata')) > 0\
else ET.SubElement(self.page_tree.getroot(), 'metadata')
lock = ET.SubElement(metadata, 'lock')
ET.SubElement(lock, 'reference-file').text = reference_file
if message != '':
ET.SubElement(lock, 'message').text = message
def unlock(self):
"""Lock tree such that ids of words etc. correspond to ids
in reference_file, optionally add a message that will be shown.
"""
if self.is_locked():
lock = self.page_tree.xpath('//metadata/lock')[0]
lock.getparent().remove(lock)
def update_and_attach_words2tree(self, update_function_on_word=None, include_special_words_of_type=[]):
"""Update word ids and attach them to page.page_tree.
"""
if not self.is_locked():
update_function_on_word = [ update_function_on_word ]\
if type(update_function_on_word) != list\
else update_function_on_word
for node in self.page_tree.xpath('.//word|.//' + MarkForeignHands.XML_TAG + '|.//' + TextConnectionMark.XML_TAG):
node.getparent().remove(node)
for index, word in enumerate(self.words):
word.id = index
for func in update_function_on_word:
if callable(func):
func(word)
word.attach_word_to_tree(self.page_tree)
for index, mark_foreign_hands in enumerate(self.mark_foreign_hands):
mark_foreign_hands.id = index
if MarkForeignHands in include_special_words_of_type:
for func in update_function_on_word:
if callable(update_function_on_word):
func(mark_foreign_hands)
mark_foreign_hands.attach_word_to_tree(self.page_tree)
for index, text_connection_mark in enumerate(self.text_connection_marks):
text_connection_mark.id = index
if TextConnectionMark in include_special_words_of_type:
for func in update_function_on_word:
if callable(update_function_on_word):
func(text_connection_mark)
text_connection_mark.attach_word_to_tree(self.page_tree)
else:
print('locked')
def update_property_dictionary(self, property_key, default_value):
"""Update properties_dictionary.
"""
content = self.properties_dictionary.get(property_key)
if content is not None:
self.properties_dictionary.update({property_key: (content[0], default_value, content[2])})
else:
msg = f'ERROR: properties_dictionary does not contain a key {property_key}!'
raise Exception(msg)
def update_tree(self, value, xpath):
"""Update tree.
"""
node_name = dirname(xpath)
node = self.page_tree.xpath('//' + node_name)[0]\
if len(self.page_tree.xpath('//' + node_name)) > 0\
else ET.SubElement(self.page_tree.getroot(), node_name)
node.set(basename(xpath).replace('@', ''), str(value))
def _init_tree(self, should_xml_file_exist=False):
"""Initialize page_tree from xml_file if it exists.
"""
if isfile(self.xml_file):
parser = ET.XMLParser(remove_blank_text=True)
self.page_tree = ET.parse(self.xml_file, parser)
elif not should_xml_file_exist:
self.page_tree = ET.ElementTree(ET.Element('page'))
self.page_tree.docinfo.URL = self.xml_file
else:
msg = f'ERROR: xml_source_file {self.xml_file} does not exist!'
raise FileNotFoundError(msg)
Index: svgscripts/create_manuscript.py
===================================================================
--- svgscripts/create_manuscript.py (revision 0)
+++ svgscripts/create_manuscript.py (revision 88)
@@ -0,0 +1,204 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+""" This program can be used to create a ArchivalManuscriptUnity.
+"""
+# Copyright (C) University of Basel 2020 {{{1
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see 1}}}
+
+__author__ = "Christian Steiner"
+__maintainer__ = __author__
+__copyright__ = 'University of Basel'
+__email__ = "christian.steiner@unibas.ch"
+__status__ = "Development"
+__license__ = "GPL v3"
+__version__ = "0.0.1"
+
+from colorama import Fore, Style
+import getopt
+import re
+import sys
+from os import listdir, sep, path
+from os.path import isfile, isdir, dirname, basename
+import lxml.etree as ET
+
+if dirname(__file__) not in sys.path:
+ sys.path.append(dirname(__file__))
+
+from datatypes.manuscript import ArchivalManuscriptUnity
+from datatypes.super_page import SuperPage
+
+sys.path.append('shared_util')
+from myxmlwriter import parse_xml_of_type, write_pretty, xml_has_type, FILE_TYPE_SVG_WORD_POSITION, FILE_TYPE_XML_MANUSCRIPT
+
+
+
+UNITTESTING = False
+
+class ManuscriptCreator:
+ """This class can be used to create a ArchivalManuscriptUnity.
+ """
+
+ def __init__(self, xml_target_dir):
+ self.xml_target_dir = xml_target_dir
+
+ def _get_or_create_element(self, node, xpath, create_id=False) ->ET.Element:
+ """Return a element with name == element_name, or create if it does not exist.
+ """
+ elements = node.xpath(xpath)
+ if len(elements) > 0:
+ return elements[0]
+ else:
+ if re.match(r'[a-z]+\[@[a-z-]+=', xpath):
+ element_name = re.match(r'(.+?)\[@[a-z]+.*', xpath).group(1)
+ num_elements = len(node.xpath(element_name))
+ element = ET.SubElement(node, element_name)
+ element_attribute = re.match(r'[a-z]+\[@(.+?)=.*', xpath).group(1)
+ element_value = re.match(r'[a-z]+\[@[a-z-]+="(.+?)"]', xpath).group(1)
+ element.set(element_attribute, element_value)
+ if create_id:
+ element.set('id', str(num_elements))
+ return element
+ else:
+ num_elements = len(node.xpath(xpath))
+ element = ET.SubElement(node, xpath)
+ if create_id:
+ element.set('id', str(num_elements))
+ return element
+
+ def _create_or_update_pages(self, pages_node, manuscript_page_url_mapping):
+ """Create or update pages.
+ """
+ for page_number, url in manuscript_page_url_mapping.items():
+ xpath = SuperPage.XML_TAG + f'[@number="{page_number}"]'
+ page_node = self._get_or_create_element(pages_node, xpath, create_id=True)
+ if not bool(page_node.get('alias')):
+ page_node.set('alias', basename(url))
+
+ def create_or_update_manuscripts(self, manuscript_files, page_url_mapping):
+ """Create or update manuscripts.
+ """
+ for key in page_url_mapping:
+ relevant_files = [ manuscript_file for manuscript_file in manuscript_files\
+ if basename(manuscript_file) == key.replace(' ', '_') + '.xml']
+ if len(relevant_files) == 0:
+ manuscript_files.append(key.replace(' ', '_') + '.xml')
+ for manuscript_file in manuscript_files:
+ target_file = self.xml_target_dir + sep + manuscript_file\
+ if dirname(manuscript_file) == ''\
+ else manuscript_file
+ title = basename(target_file).replace('.xml', '').replace('_', ' ')
+ manuscript = ArchivalManuscriptUnity(title=title)
+ if isfile(target_file):
+ manuscript = ArchivalManuscriptUnity.create_cls(target_file)
+ else:
+ manuscript.manuscript_tree = ET.ElementTree(ET.Element(ArchivalManuscriptUnity.XML_TAG))
+ manuscript.manuscript_tree.docinfo.URL = target_file
+ manuscript.manuscript_tree.getroot().set('title', manuscript.title)
+ manuscript.manuscript_tree.getroot().set('type', manuscript.manuscript_type)
+ if title in page_url_mapping.keys():
+ pages_node = self._get_or_create_element(manuscript.manuscript_tree.getroot(), 'pages')
+ self._create_or_update_pages(pages_node, page_url_mapping[title])
+ if not UNITTESTING:
+ write_pretty(xml_element_tree=manuscript.manuscript_tree, file_name=target_file,\
+ script_name=__file__, file_type=FILE_TYPE_XML_MANUSCRIPT)
+
+def create_page_url_mapping(input_file, mapping_dictionary, default_title=''):
+ """Create a page to url mapping from input file.
+
+ File content:
+
+ TITLE PAGENUMBER\nURL
+
+ See: 'tests_svgscripts/test_data/content.txt'
+ """
+ lines = []
+ with open(input_file, 'r') as f:
+ lines = f.readlines()
+ key = None
+ url = None
+ current_key = default_title
+ for content in lines:
+ if content.startswith('http')\
+ or content.startswith('www'):
+ url = content.replace('\n', '')\
+ if content.startswith('http')\
+ else 'http://' + content.replace('\n', '')
+ if current_key not in mapping_dictionary.keys():
+ mapping_dictionary.update({current_key: {}})
+ mapping_dictionary[current_key].update({key: url})
+ else:
+ key_parts = [ part.strip() for part in content.replace('\n', '').replace('S.', '').split(',') ]
+ key_index = 0
+ if len(key_parts) > 1:
+ title = key_parts[0]
+ if title not in mapping_dictionary.keys():
+ current_key = title
+ mapping_dictionary.update({current_key: {}})
+ key_index = 1
+ key = key_parts[key_index]
+
+def usage():
+ """prints information on how to use the script
+ """
+ print(main.__doc__)
+
+def main(argv):
+ """This program can be used to create or update one or more manuscripts.
+
+
+ svgscripts/create_manuscript.py [OPTIONS] [, ...] [, ...]
+
+ One or more files mapping pages to faksimile URLs, with 'txt'-suffix
+ manuscript file(s) (~ArchivalManuscriptUnity).
+
+ OPTIONS:
+ -h|--help: show help
+ -t|--title=title manuscript's title, e.g. "Mp XV".
+ -x|--xml-target-dir directory containing xmlManuscriptFile, default "./xml"
+
+ :return: exit code (int)
+ """
+ title = ''
+ xml_target_dir = ".{}xml".format(sep)
+ page_url_mapping = {}
+
+ try:
+ opts, args = getopt.getopt(argv, "ht:x:", ["help", "title=", "xml-target-dir="])
+ except getopt.GetoptError:
+ usage()
+ return 2
+
+ for opt, arg in opts:
+ if opt in ('-h', '--help'):
+ usage()
+ return 0
+ elif opt in ('-t', '--title'):
+ title = arg
+ elif opt in ('-x', '--xml-target-dir'):
+ xml_target_dir = arg
+
+ manuscript_files = [ arg for arg in args if arg.endswith('.xml')\
+ and '_page' not in arg ]
+ input_files = [ arg for arg in args if arg.endswith('.txt')\
+ and isfile(arg)]
+ for input_file in input_files:
+ create_page_url_mapping(input_file, page_url_mapping, default_title=title)
+ creator = ManuscriptCreator(xml_target_dir=xml_target_dir)
+ creator.create_or_update_manuscripts(manuscript_files, page_url_mapping)
+ return 0
+
+if __name__ == "__main__":
+ sys.exit(main(sys.argv[1:]))
Index: tests_svgscripts/test_create_manuscript.py
===================================================================
--- tests_svgscripts/test_create_manuscript.py (revision 0)
+++ tests_svgscripts/test_create_manuscript.py (revision 88)
@@ -0,0 +1,50 @@
+import unittest
+from os import sep, path, remove
+from os.path import isfile
+import lxml.etree as ET
+import warnings
+import sys
+
+sys.path.append('svgscripts')
+import create_manuscript
+from datatypes.manuscript import ArchivalManuscriptUnity
+
+class TestCreateManuscript(unittest.TestCase):
+
+ def setUp(self):
+ create_manuscript.UNITTESTING = True
+ DATADIR = path.dirname(__file__) + sep + 'test_data'
+ self.content_file = DATADIR + sep + 'content.txt'
+
+ def test_create_page_url_mapping(self):
+ mapping = {}
+ create_manuscript.create_page_url_mapping(self.content_file, mapping)
+ self.assertTrue('Mp XV' in mapping.keys())
+ #print(mapping)
+ #mapping = {}
+ #create_manuscript.create_page_url_mapping('content.txt', mapping, default_title='Mp XV')
+ #print(mapping)
+ creator = create_manuscript.ManuscriptCreator('')
+ pages_node = ET.Element('pages')
+ #creator._create_or_update_pages(pages_node, mapping['Mp XV'])
+ #print(ET.dump(pages_node))
+
+ def test_get_or_create_element(self):
+ creator = create_manuscript.ManuscriptCreator('')
+ manuscript_tree = ET.ElementTree(ET.Element(ArchivalManuscriptUnity.XML_TAG))
+ self.assertEqual(len(manuscript_tree.xpath('test')), 0)
+ node = creator._get_or_create_element(manuscript_tree.getroot(), 'test', create_id=True)
+ self.assertEqual(len(manuscript_tree.xpath('test')), 1)
+ node = creator._get_or_create_element(manuscript_tree.getroot(), 'test[@id="0"]')
+ self.assertEqual(len(manuscript_tree.xpath('test')), 1)
+ node = creator._get_or_create_element(manuscript_tree.getroot(), 'page[@number="10"]')
+ self.assertEqual(node.get('number'), '10')
+ node = creator._get_or_create_element(manuscript_tree.getroot(), 'page[@number="0"]', create_id=True)
+ self.assertEqual(node.get('id'), '1')
+ self.assertEqual(node.get('number'), '0')
+
+ def test_main(self):
+ create_manuscript.main(['-x', 'xml', '-t', 'Mp XV', self.content_file])
+
+if __name__ == "__main__":
+ unittest.main()
Index: tests_svgscripts/test_data/content.txt
===================================================================
--- tests_svgscripts/test_data/content.txt (revision 0)
+++ tests_svgscripts/test_data/content.txt (revision 88)
@@ -0,0 +1,76 @@
+Mp XV, S. 74r
+www.nietzschesource.org/DFGA/Mp-XV-2c,1
+Mp XV, S. 74v
+http://www.nietzschesource.org/DFGA/Mp-XV-2c,2
+Mp XV, S. 75r
+www.nietzschesource.org/DFGA/Mp-XV-2c,3
+Mp XV, S. 75v
+http://www.nietzschesource.org/DFGA/Mp-XV-2c,4
+Mp XV, S. 76r
+www.nietzschesource.org/DFGA/Mp-XV-2c,5
+Mp XV, S. 77r
+www.nietzschesource.org/DFGA/Mp-XV-2c,7
+Mp XV, S. 78r
+www.nietzschesource.org/DFGA/Mp-XV-2d,1
+Mp XV, S. 78v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,2
+Mp XV, S. 79r
+www.nietzschesource.org/DFGA/Mp-XV-2d,3
+Mp XV, S. 79v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,4
+Mp XV, S. 80r
+www.nietzschesource.org/DFGA/Mp-XV-2d,5
+Mp XV, S. 80v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,6
+Mp XV, S. 81r
+www.nietzschesource.org/DFGA/Mp-XV-2d,7
+Mp XV, S. 81v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,8
+Mp XV, S. 82r
+www.nietzschesource.org/DFGA/Mp-XV-2d,9
+Mp XV, S. 82v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,10
+Mp XV, S. 83r
+www.nietzschesource.org/DFGA/Mp-XV-2d,11
+Mp XV, S. 83v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,12
+Mp XV, S. 84r
+www.nietzschesource.org/DFGA/Mp-XV-2d,13
+Mp XV, S. 85v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,16
+Mp XV, S. 86r
+www.nietzschesource.org/DFGA/Mp-XV-2d,17
+Mp XV, S. 86v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,18
+Mp XV, S. 87r
+www.nietzschesource.org/DFGA/Mp-XV-2d,19
+Mp XV, S. 87v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,20
+Mp XV, S. 88r
+www.nietzschesource.org/DFGA/Mp-XV-2d,21
+Mp XV, S. 89r
+www.nietzschesource.org/DFGA/Mp-XV-2d,23
+Mp XV, S. 90r
+www.nietzschesource.org/DFGA/Mp-XV-2d,25
+Mp XV, S. 92r
+www.nietzschesource.org/DFGA/Mp-XV-2d,29
+Mp XV, S. 92v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,30
+Mp XV, S. 94r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,1
+Mp XV, S. 94v
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,2
+Mp XV, S. 95r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,3
+Mp XV, S. 96r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,5
+Mp XV, S. 97r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,7
+Mp XV, S. 98v
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,10
+Mp XV, S. 99r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,11
+Mp XV, S. 100r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,13
+Mp XV, S. 113r
+www.nietzschesource.org/DFGA/Mp-XV-3c,1
Index: shared_util/myxmlwriter.py
===================================================================
--- shared_util/myxmlwriter.py (revision 87)
+++ shared_util/myxmlwriter.py (revision 88)
@@ -1,203 +1,203 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
""" This program can be used to pretty-write a xml string to a xml file.
"""
# Copyright (C) University of Basel 2019 {{{1
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see 1}}}
import inspect
import xml.dom.minidom as MD
import xml.etree.ElementTree as ET
import lxml.etree as LET
from datetime import datetime
from rdflib import URIRef
from os import makedirs
-from os.path import sep, basename, dirname
+from os.path import sep, basename, dirname, isfile
import sys
import warnings
sys.path.append('svgscripts')
from datatypes.page import FILE_TYPE_SVG_WORD_POSITION, FILE_TYPE_XML_MANUSCRIPT
__author__ = "Christian Steiner"
__maintainer__ = __author__
__copyright__ = 'University of Basel'
__email__ = "christian.steiner@unibas.ch"
__status__ = "Development"
__license__ = "GPL v3"
__version__ = "0.0.1"
FILE_TYPE_SVG_WORD_POSITION = FILE_TYPE_SVG_WORD_POSITION
FILE_TYPE_XML_MANUSCRIPT = FILE_TYPE_XML_MANUSCRIPT
FILE_TYPE_XML_DICT = 'xml-dictionary'
def attach_dict_to_xml_node(dictionary, xml_node):
"""Create a xml tree from a dictionary.
"""
for key in dictionary.keys():
elem_type = type(dictionary[key])
if elem_type != dict:
node = LET.SubElement(xml_node, key, attrib={'type': elem_type.__name__})
node.text = str(dictionary[key])
else:
attach_dict_to_xml_node(dictionary[key], LET.SubElement(xml_node, key))
def dict2xml(dictionary, target_file_name):
"""Write dict 2 xml.
"""
xml_tree = LET.ElementTree(LET.Element('root'))
attach_dict_to_xml_node(dictionary, LET.SubElement(xml_tree.getroot(), 'dict'))
write_pretty(xml_element_tree=xml_tree, file_name=target_file_name,\
script_name=inspect.currentframe().f_code.co_name, file_type=FILE_TYPE_XML_DICT)
def get_dictionary_from_node(node):
"""Return dictionary from node.
:return: dict
"""
new_dict = {}
if len(node.getchildren()) > 0:
new_dict.update({ node.tag : {} })
for child_node in node.getchildren():
new_dict.get(node.tag).update(get_dictionary_from_node(child_node))
else:
elem_cls = eval(node.get('type')) if bool(node.get('type')) else str
value = elem_cls(node.text) if bool(node.text) else None
new_dict.update({ node.tag: value })
return new_dict
def lock_xml_tree(xml_element_tree, **locker_dict):
"""Lock xml_element_tree.
"""
if xml_element_tree is not None and not test_lock(xml_element_tree, silent=True):
message = locker_dict.get('message') if bool(locker_dict.get('message')) else ''
reference_file = locker_dict.get('reference_file') if bool(locker_dict.get('reference_file')) else ''
metadata = xml_element_tree.xpath('./metadata')[0]\
if len(xml_element_tree.xpath('./metadata')) > 0\
else LET.SubElement(xml_element_tree.getroot(), 'metadata')
lock = LET.SubElement(metadata, 'lock')
LET.SubElement(lock, 'reference-file').text = reference_file
if message != '':
LET.SubElement(lock, 'message').text = message
def parse_xml_of_type(xml_source_file, file_type):
"""Return a xml_tree from xml_source_file is file is of type file_type.
"""
parser = LET.XMLParser(remove_blank_text=True)
xml_tree = LET.parse(xml_source_file, parser)
if not xml_has_type(file_type, xml_tree=xml_tree):
msg = 'File {} is not of type {}!'.format(xml_source_file, file_type)
raise Exception(msg)
return xml_tree
def test_lock(xml_element_tree=None, silent=False):
"""Test if xml_element_tree is locked and print a message.
:return: True if locked
"""
if xml_element_tree is None:
return False
if len(xml_element_tree.findall('./metadata/lock')) > 0:
reference_file = xml_element_tree.findall('./metadata/lock/reference-file')
message = xml_element_tree.findall('./metadata/lock/message')
if not silent:
warning_msg = 'File {0} is locked!'.format(xml_element_tree.docinfo.URL)
if len(reference_file) > 0:
warning_msg = warning_msg.replace('!', ' ') + 'on {0}.'.format(reference_file[0].text)
if len(message) > 0:
warning_msg = warning_msg + '\n{0}'.format(message[0].text)
warnings.warn(warning_msg)
return True
return False
def update_metadata(xml_element_tree, script_name, file_type=None):
"""Updates metadata of xml tree.
"""
if len(xml_element_tree.getroot().findall('./metadata')) > 0:
if len(xml_element_tree.getroot().find('./metadata').findall('./modifiedBy[@script="{}"]'.format(script_name))) == 0:
LET.SubElement(xml_element_tree.getroot().find('./metadata'), 'modifiedBy', attrib={'script': script_name})
xml_element_tree.getroot().find('./metadata').findall('./modifiedBy[@script="{}"]'.format(script_name))[0].text = \
datetime.now().strftime('%Y-%m-%d %H:%M:%S')
else:
metadata = LET.SubElement(xml_element_tree.getroot(), 'metadata')
if file_type is not None:
LET.SubElement(metadata, 'type').text = file_type
createdBy = LET.SubElement(metadata, 'createdBy')
LET.SubElement(createdBy, 'script').text = script_name
LET.SubElement(createdBy, 'date').text = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
def write_backup(xml_element_tree: LET.ElementTree, file_type=None, bak_dir='./bak') -> str:
"""Back up a xml_source_file.
:return: target_file_name
"""
date_string = datetime.now().strftime('%Y-%m-%d_%H:%M:%S')
makedirs(bak_dir, exist_ok=True)
target_file_name = bak_dir + sep + basename(xml_element_tree.docinfo.URL) + '_' + date_string
reference_file = xml_element_tree.docinfo.URL
write_pretty(xml_element_tree=xml_element_tree, file_name=target_file_name,\
script_name=__file__ + '({0},{1})'.format(inspect.currentframe().f_code.co_name, reference_file),\
file_type=file_type)
return target_file_name
def write_pretty(xml_string=None, xml_element_tree=None, file_name=None, script_name=None, backup=False, file_type=None, **locker_dict):
"""Writes a xml string pretty to a file.
"""
if not bool(xml_string) and not bool(xml_element_tree):
raise Exception("write_pretty needs a string or a xml.ElementTree!")
if not test_lock(xml_element_tree):
if len(locker_dict) > 0 and bool(locker_dict.get('reference_file')):
lock_xml_tree(xml_element_tree, **locker_dict)
if script_name is not None and xml_element_tree is not None:
update_metadata(xml_element_tree, script_name, file_type=file_type)
if file_name is None and xml_element_tree is not None\
and xml_element_tree.docinfo is not None and xml_element_tree.docinfo.URL is not None:
file_name = xml_element_tree.docinfo.URL
if file_name is None:
raise Exception("write_pretty needs a file_name or a xml.ElementTree with a docinfo.URL!")
if backup and xml_element_tree is not None:
write_backup(xml_element_tree, file_type=file_type)
dom = MD.parseString(xml_string) if(bool(xml_string)) else MD.parseString(ET.tostring(xml_element_tree.getroot()))
f = open(file_name, "w")
dom.writexml(f, addindent="\t", newl='\n', encoding='utf-8')
f.close()
def xml2dict(xml_source_file):
"""Create dict from xml_source_file of Type FILE_TYPE_XML_DICT.
:return: dict
"""
new_dict = {}
xml_tree = LET.parse(xml_source_file)
if xml_has_type(FILE_TYPE_XML_DICT, xml_tree=xml_tree)\
and len(xml_tree.xpath('/root/dict')) > 0:
for node in xml_tree.xpath('/root/dict')[0].getchildren():
new_dict.update(get_dictionary_from_node(node))
else:
msg = 'File {} is not of type {}!'.format(xml_source_file, FILE_TYPE_XML_DICT)
raise Exception(msg)
return new_dict
def xml_has_type(file_type, xml_source_file=None, xml_tree=None):
"""Return true if xml_source_file/xml_tree has file type == file_type.
"""
if xml_tree is None and xml_source_file is None:
return False
- if xml_tree is None:
+ if xml_tree is None and isfile(xml_source_file):
xml_tree = LET.parse(xml_source_file)
if len(xml_tree.xpath('//metadata/type/text()')) < 1:
return False
return xml_tree.xpath('//metadata/type/text()')[0] == file_type