Index: Notiz4Dominique.txt
===================================================================
--- Notiz4Dominique.txt	(revision 87)
+++ Notiz4Dominique.txt	(revision 88)
@@ -1,8 +0,0 @@
-Ich habe im geteilten SwitchDrive Ordner einen Datenordner 'DATA' erstellt. Darin sind folgende Dateien enthalten:
-
-pdf
-svg
-text_svg
-xml
-
-Diese kannst Du in diesen Ordner verlinken, um die Skripte mit den aktuellen Daten zu verwenden.
Index: TODO.md
===================================================================
--- TODO.md	(revision 87)
+++ TODO.md	(revision 88)
@@ -1,111 +1,111 @@
 # TASKS
-+ Data
+- Data
 - Frontend
-- Funding
++ Funding
 - Organization
 
 # Wortsuche:
 
 - Die Wortsuche sollte über die topologische Nähe der Wörter zueinander gewichtet werden.
 - Wortpfade, d.h. Abfolgen der Wörter sollen vermieden werden, da dies nicht automatisch generiert werden kann und
    höchst fehleranfällig ist.
 - Daher sollen die Worteinfügungen auch nicht dafür verwendet werden, alternative Textverläufe aufzuzeichnen. 
 
 
 # TODO
 ## Faksimile data input
 - word boxes on faksimile by drawing rects with inkscape [IN PROGRESS, see "Leitfaden.pdf"]
 - naming word boxes by using title of rects [IN PROGRESS, see "Leitfaden\_Kontrolle\_und\_Beschriftung\_der\_Wortrahmen.pdf"]
 - correcting faksimile svg or transkription xml if words do not correspond 
 
 ## Processing
 ### faksimile data input, i.e. svg-file resulting from drawing boxes etc. with inkscape 
 - process faksimile words:
    - join\_faksimileAndTranskription.py [DONE]
    - create faksimile position of line [TODO]
    - create a data input task for words that do not correspond [DONE] 
 
 ### transkription, i.e. svg-file resulting from pdf-file ->created with InDesign
 - fix:
    - xml/W\_II\_1\_page131.xml:
       - eigentlichste: [TODO]
          has two parts (['eigentlich', 'ste']), both are deleted, one part ('ste') has a box ('e')
          -> expected result: earlier_version = eigentlichste, earlier_version.earlier_version = eigentliche
    - xml/N\_VII\_1\_page138.xml:
       - AufBau          [DONE]
       - Verschiedenes   [DONE]
 - process text field:
    - Word [DONE]
    - SpecialWord
       - MarkForeignHands [DONE]
       - TextConnectionMark [DONE]
    - WordInsertionMark [DONE]
    - all paths -> page.categorize\_paths [TODO]
       - word-deletion -> Path [DONE]
          - make parts of word if only parts of a word are deleted, also introduce earlier version of word [DONE]
          - correction concerning punctuations in words that are deleted, script does not recognize parts of deleted
             words as deleted if they consist of punctuation marks. [TODO]
       - word-undeletion (e.g. N VII 1, 18,6 -> "mit")
       - underline
       - text-area-deletion
       - text-connection-lines
       - boxes
    
 - process footnotes:
    - Return footnotes with standoff [DONE]
    - TextConnectionMark [DONE]
    - TextConnection with uncertainty [TODO]
       - "Fortsetzung [0-9]+,[0-9]+?"
       - "Fortsetzung von [0-9]+,[0-9]+?"
    - concerning Word:
       - uncertain transcription: "?" / may have bold word parts
       - atypical writting: "¿" and bold word parts
       - clarification corrections ("Verdeutlichungskorrekturen"): "Vk" and bold word parts
       - correction: "word>" and ">?" (with uncertainty)
    - concerning word deletion:
       - atypical writting: "¿" and "Durchstreichung" (see N VII 1, 11,2)
 
 - process margins:
    - MarkForeignHands [DONE]
    - ForeignHandTextAreaDeletion [TODO]
    - boxes: make earlier version of a word [TODO]
    - TextConnection [TODO]
       - from: ([0-9]+,)*[0-9]+ -)
       - to: -) ([0-9]+,)*[0-9]+  
 
 ## Datatypes
 - make datatypes:
    - Page [ok] --> page orientation!!!
    - SimpleWord
       - SpecialWord
          - MarkForeignHands ("Zeichen für Fremde Hand") [DONE]
          - TextConnectionMark ("Anschlußzeichen") [DONE]
             - has a Reference
       - Word [ok]  --> deal with non-horizontal text [DONE]
                    --> hyphenation [TODO]
                    --> add style info to word: font { German, Latin } [DONE]
                    --> pen color [DONE]
                    --> connect style with character glyph-id from svg path file 
                    --> has parts [DONE]
                    --> versions: later version of earlier version [DONE]
    - WritingProcess >>>> use only in connection with earlier versions of word
       - correlates with font size:
          - biggest font to biggest-1 font: stage 0
          - font in between: stage 1
          - smallest font to smallest+1 font: stage 2
    - Style [DONE]
    - WordPosition [ok]
       - TranskriptionPosition [ok] 
       - FaksimilePosition [ok]
    - LineNumber [reDo]
       - change to Line
    - Reference [TODO]+
    - TextConnection
       - needs change of LineNumber to Line
    - ForeignHandTextAreaDeletion [TODO]
    - Freehand: 
       - Deletion [DONE]
          - make parts of word if only parts of a word are deleted, also introduce earlier version of word [DONE]
       - WordInsertionMark [reDO]
       - Underline [TODO]
 
Index: svgscripts/datatypes/manuscript.py
===================================================================
--- svgscripts/datatypes/manuscript.py	(revision 87)
+++ svgscripts/datatypes/manuscript.py	(revision 88)
@@ -1,137 +1,141 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 
 """   This class can be used to represent an archival unity of manuscript pages, i.e. workbooks, notebooks, folders of handwritten pages.
 """
 #    Copyright (C) University of Basel 2019  {{{1
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
 #    the Free Software Foundation, either version 3 of the License, or
 #    (at your option) any later version.
 #
 #    This program is distributed in the hope that it will be useful,
 #    but WITHOUT ANY WARRANTY; without even the implied warranty of
 #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 #    GNU General Public License for more details.
 #
 #    You should have received a copy of the GNU General Public License
 #    along with this program.  If not, see <https://www.gnu.org/licenses/> 1}}}
 
 __author__ = "Christian Steiner"
 __maintainer__ = __author__
 __copyright__ = 'University of Basel'
 __email__ = "christian.steiner@unibas.ch"
 __status__ = "Development"
 __license__ = "GPL v3"
 __version__ = "0.0.1"
 
 from lxml import etree as ET
 from os.path import isfile
 import sys
 
 from .page import Page, FILE_TYPE_XML_MANUSCRIPT, FILE_TYPE_SVG_WORD_POSITION
 from .color import Color
 
 sys.path.append('py2ttl')
 from class_spec import SemanticClass
 
 sys.path.append('shared_util')
 from myxmlwriter import parse_xml_of_type, write_pretty, xml_has_type
 
 class ArchivalManuscriptUnity(SemanticClass):
     """
     This class represents an archival unity of manuscript pages (workbooks, notebooks and portfolios of handwritten pages).
     @label archival unity of manuscript pages
 
     Args:
         title               title of archival unity
         manuscript_type     type of manuscript: 'Arbeitsheft', 'Notizheft', 'Mappe' 
         manuscript_tree     lxml.ElementTree
     """
     XML_TAG = 'manuscript'
     XML_COLORS_TAG = 'colors'
+    TYPE_DICTIONARY = { 'Mp': 'Mappe', 'N': 'Notizheft', 'W': 'Arbeitsheft' }
     UNITTESTING = False
 
     def __init__(self, title='', manuscript_type='', manuscript_tree=None):
         self.colors = []
         self.manuscript_tree = manuscript_tree
         self.manuscript_type = manuscript_type
         self.pages = []
         self.styles = []
         self.title = title
+        if self.manuscript_type == '' and self.title != ''\
+                and self.title.split(' ')[0] in self.TYPE_DICTIONARY.keys():
+            self.manuscript_type = self.TYPE_DICTIONARY[self.title.split(' ')[0]]
 
     def get_name_and_id(self):
         """Return an identification for object as 2-tuple.
         """
         return '', self.title.replace(' ', '_')
 
     @classmethod
     def create_cls(cls, xml_manuscript_file, page_status_list=None, page_xpath='', update_page_styles=False):
         """Create an instance of ArchivalManuscriptUnity from a xml file of type FILE_TYPE_XML_MANUSCRIPT.
 
             :return: ArchivalManuscriptUnity
         """
         manuscript_tree = parse_xml_of_type(xml_manuscript_file, FILE_TYPE_XML_MANUSCRIPT)
         title = manuscript_tree.getroot().get('title') if bool(manuscript_tree.getroot().get('title')) else ''
         manuscript_type = manuscript_tree.getroot().get('type') if bool(manuscript_tree.getroot().get('type')) else ''
         manuscript = cls(title=title, manuscript_type=manuscript_type, manuscript_tree=manuscript_tree)
         manuscript.colors = [ Color.create_cls(node=color_node) for color_node in manuscript_tree.xpath('.//' + cls.XML_COLORS_TAG + '/' + Color.XML_TAG) ]
         if page_xpath == '':
             page_status = '' 
             if page_status_list is not None\
                     and type(page_status_list) is list\
                     and len(page_status_list) > 0:
                 page_status = '[' + ' and '.join([ f'contains(@status, "{status}")' for status in page_status_list ]) + ']'
             page_xpath = f'//pages/page{page_status}/@output'
         manuscript.pages = [ Page(page_source)\
                 for page_source in manuscript_tree.xpath(page_xpath)\
                 if isfile(page_source) and xml_has_type(FILE_TYPE_SVG_WORD_POSITION, xml_source_file=page_source) ]
         if update_page_styles:
             for page in manuscript.pages: page.update_styles(manuscript=manuscript, add_to_parents=True)
         return manuscript
 
     def get_color(self, hex_color) -> Color:
         """Return color if it exists or None.
         """
         if hex_color in [ color.hex_color for color in self.colors ]: 
             return [ color for color in self.colors if color.hex_color == hex_color ][0]
         return None
 
     @classmethod
     def get_semantic_dictionary(cls):
         """ Creates a semantic dictionary as specified by SemanticClass.
         """
         dictionary = {} 
         class_dict = cls.get_class_dictionary()
         properties = {}
         properties.update(cls.create_semantic_property_dictionary('title', str, 1))
         properties.update(cls.create_semantic_property_dictionary('manuscript_type', str, 1))
         properties.update(cls.create_semantic_property_dictionary('styles', list))
         properties.update(cls.create_semantic_property_dictionary('pages', list))
         dictionary.update({cls.CLASS_KEY: class_dict})
         dictionary.update({cls.PROPERTIES_KEY: properties})
         return cls.return_dictionary_after_updating_super_classes(dictionary)
 
     def update_colors(self, color):
         """Update manuscript colors if color is not contained.
         """
         if self.get_color(color.hex_color) is None:
             self.colors.append(color)
             if self.manuscript_tree is not None:
                 if len(self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)) > 0:
                     self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)[0].getparent().remove(self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)[0])
                 colors_node = ET.SubElement(self.manuscript_tree.getroot(), self.XML_COLORS_TAG)
                 for color in self.colors:
                     color.attach_object_to_tree(colors_node)
                 if not self.UNITTESTING:
                     write_pretty(xml_element_tree=self.manuscript_tree, file_name=self.manuscript_tree.docinfo.URL,\
                     script_name=__file__, backup=True,\
                     file_type=FILE_TYPE_XML_MANUSCRIPT)
 
     def update_styles(self, *styles):
         """Update manuscript styles.
         """
         for style in styles:
             if style not in self.styles:
                 self.styles.append(style)
Index: svgscripts/datatypes/super_page.py
===================================================================
--- svgscripts/datatypes/super_page.py	(revision 87)
+++ svgscripts/datatypes/super_page.py	(revision 88)
@@ -1,289 +1,290 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 
 """   This class can be used to represent a super page.
 """
 #    Copyright (C) University of Basel 2019  {{{1
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
 #    the Free Software Foundation, either version 3 of the License, or
 #    (at your option) any later version.
 #
 #    This program is distributed in the hope that it will be useful,
 #    but WITHOUT ANY WARRANTY; without even the implied warranty of
 #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 #    GNU General Public License for more details.
 #
 #    You should have received a copy of the GNU General Public License
 #    along with this program.  If not, see <https://www.gnu.org/licenses/> 1}}}
 
 __author__ = "Christian Steiner"
 __maintainer__ = __author__
 __copyright__ = 'University of Basel'
 __email__ = "christian.steiner@unibas.ch"
 __status__ = "Development"
 __license__ = "GPL v3"
 __version__ = "0.0.1"
 
 from lxml import etree as ET
 from os.path import isfile, basename, dirname
 from progress.bar import Bar
 from svgpathtools import svg2paths2, svg_to_paths
 from svgpathtools.parser import parse_path
 import sys
 import warnings
 
 from .image import Image, SVGImage
 from .faksimile_image import FaksimileImage
 from .mark_foreign_hands import MarkForeignHands
 from .text_connection_mark import TextConnectionMark
 from .text_field import TextField
 from .writing_process import WritingProcess
 
 class SuperPage:
     """
     This super class represents a page.
 
     Args:
         xml_source_file (str): name of the xml file to be instantiated.
         xml_target_file (str): name of the xml file to which page info will be written.
 
     """
     FILE_TYPE_SVG_WORD_POSITION = 'svgWordPosition'
     FILE_TYPE_XML_MANUSCRIPT = 'xmlManuscriptFile'
     PAGE_RECTO = 'recto'
     PAGE_VERSO = 'verso'
     STATUS_MERGED_OK = 'faksimile merged'
     STATUS_POSTMERGED_OK = 'words processed'
     UNITTESTING = False
+    XML_TAG = 'page'
 
     def __init__(self, xml_file, title=None, page_number='', orientation='North', page_type=PAGE_VERSO, should_xml_file_exist=False):
         self.properties_dictionary = {\
             'faksimile_image': (FaksimileImage.XML_TAG, None, FaksimileImage),\
             'faksimile_svgFile': ('data-source/@file', None, str),\
             'number': ('page/@number', str(page_number), str),\
             'orientation': ('page/@orientation', orientation, str),\
             'page_type': ('page/@pageType', page_type, str),\
             'pdfFile': ('pdf/@file', None, str),\
             'source': ('page/@source', None, str),\
             'svg_file': ('svg/@file', None, str),\
             'svg_image': (SVGImage.XML_TAG, None, SVGImage),\
             'text_field': (FaksimileImage.XML_TAG + '/' + TextField.XML_TAG, None, TextField),\
             'title': ('page/@title', title, str),\
         }
         self.online_properties = []
         self.line_numbers = []
         self.lines = []
         self.mark_foreign_hands = []
         self.page_tree = None 
         self.sonderzeichen_list = []
         self.style_dict = {}
         self.text_connection_marks = []
         self.word_deletion_paths = []
         self.word_insertion_marks = []
         self.words = []
         self.writing_processes = []
         self.xml_file = xml_file
         if not self.is_page_source_xml_file():
             msg = f'ERROR: xml_source_file {self.xml_file} is not of type "{FILE_TYPE_SVG_WORD_POSITION}"'
             raise Exception(msg)
         self._init_tree(should_xml_file_exist=should_xml_file_exist)
 
     def add_style(self, sonderzeichen_list=[], letterspacing_list=[], style_dict={}, style_node=None):
         """Adds a list of classes that are sonderzeichen and a style dictionary to page.
         """
         self.sonderzeichen_list = sonderzeichen_list
         self.letterspacing_list = letterspacing_list
         self.style_dict = style_dict
         if style_node is not None:
             self.style_dict = { item.get('name'): { key: value for key, value in item.attrib.items() if key != 'name' } for item in style_node.findall('.//class') }
             self.sonderzeichen_list = [ item.get('name') for item in style_node.findall('.//class')\
                     if bool(item.get('font-family')) and 'Sonderzeichen' in item.get('font-family') ]
             self.letterspacing_list = [ item.get('name') for item in style_node.findall('.//class')\
                     if bool(item.get('letterspacing-list')) ]
         elif bool(self.style_dict):
             style_node = ET.SubElement(self.page_tree.getroot(), 'style')
             if len(self.sonderzeichen_list) > 0:
                 style_node.set('Sonderzeichen', ' '.join(self.sonderzeichen_list))
             if len(self.letterspacing_list) > 0:
                 style_node.set('letterspacing-list', ' '.join(self.letterspacing_list))
             for key in self.style_dict.keys():
                 self.style_dict[key]['name'] = key
                 ET.SubElement(style_node, 'class', attrib=self.style_dict[key])
         fontsize_dict = { key: float(value.get('font-size').replace('px','')) for key, value in self.style_dict.items() if 'font-size' in value }
         fontsizes = sorted(fontsize_dict.values(), reverse=True)
         # create a mapping between fontsizes and word stages 
         self.fontsizekey2stage_mapping = {}
         for fontsize_key, value in fontsize_dict.items():
             if value >= fontsizes[0]-1:
                 self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.FIRST_VERSION })
             elif value <= fontsizes[len(fontsizes)-1]+1:
                 self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.LATER_INSERTION_AND_ADDITION })
             else:
                 self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.INSERTION_AND_ADDITION })
 
     def get_biggest_fontSize4styles(self, style_set={}):
         """Returns biggest font size from style_dict for a set of style class names.
 
             [:returns:] (float) biggest font size OR 1 if style_dict is empty
         """
         if bool(self.style_dict):
             sorted_font_sizes = sorted( (float(self.style_dict[key]['font-size'].replace('px','')) for key in style_set if bool(self.style_dict[key].get('font-size'))), reverse=True)
             return sorted_font_sizes[0] if len(sorted_font_sizes) > 0 else 1
         else:
             return 1
 
     def get_line_number(self, y):
         """Returns line number id for element at y.
 
             [:return:] (int) line number id or -1
         """
         if len(self.line_numbers) > 0:
             result_list = [ line_number.id for line_number in self.line_numbers if y >= line_number.top and y <= line_number.bottom ]
             return result_list[0] if len(result_list) > 0 else -1
         else:
             return -1
 
     def init_all_properties(self, overwrite=False):
         """Initialize all properties.
         """
         for property_key in self.properties_dictionary.keys():
             if property_key not in self.online_properties:
                 self.init_property(property_key, overwrite=overwrite)
     
     def init_property(self, property_key, value=None, overwrite=False):
         """Initialize all properties.
 
             Args:
                 property_key: key of property in self.__dict__
                 value:        new value to set to property
                 overwrite:    whether or not to update values from xml_file (default: read only)
         """
         if value is None:
             if property_key not in self.online_properties:
                 xpath, value, cls = self.properties_dictionary.get(property_key)
                 if len(self.page_tree.xpath('//' + xpath)) > 0:
                     value = self.page_tree.xpath('//' + xpath)[0]
                 if value is not None:
                     if cls.__module__ == 'builtins':
                         self.update_tree(value, xpath)
                         self.__dict__.update({property_key: cls(value)})
                     else:
                         value = cls(node=value)\
                                 if type(value) != cls\
                                 else value
                         self.__dict__.update({property_key: value})
                         self.__dict__.get(property_key).attach_object_to_tree(self.page_tree)
                 else:
                     self.__dict__.update({property_key: value})
                 self.online_properties.append(property_key)
         elif overwrite or property_key not in self.online_properties:
             xpath, default_value, cls = self.properties_dictionary.get(property_key)
             if cls.__module__ == 'builtins':
                 self.__dict__.update({property_key: cls(value)})
                 self.update_tree(value, xpath)
             else:
                 self.__dict__.update({property_key: value})
                 self.__dict__.get(property_key).attach_object_to_tree(self.page_tree)
             self.online_properties.append(property_key)
 
     def is_locked(self):
         """Return true if page is locked.
         """
         return len(self.page_tree.xpath('//metadata/lock')) > 0
 
     def is_page_source_xml_file(self, source_tree=None):
         """Return true if xml_file is of type FILE_TYPE_SVG_WORD_POSITION.
         """
         if not isfile(self.xml_file):
             return True
         if source_tree is None:
             source_tree = ET.parse(self.xml_file)
         return source_tree.getroot().find('metadata/type').text == self.FILE_TYPE_SVG_WORD_POSITION
 
     def lock(self, reference_file, message=''):
         """Lock tree such that ids of words etc. correspond to ids 
             in reference_file, optionally add a message that will be shown.
         """
         if not self.is_locked():
             metadata = self.page_tree.xpath('./metadata')[0]\
                 if len(self.page_tree.xpath('./metadata')) > 0\
                 else ET.SubElement(self.page_tree.getroot(), 'metadata')
             lock = ET.SubElement(metadata, 'lock')
             ET.SubElement(lock, 'reference-file').text = reference_file
             if message != '':
                 ET.SubElement(lock, 'message').text = message
 
     def unlock(self):
         """Lock tree such that ids of words etc. correspond to ids 
             in reference_file, optionally add a message that will be shown.
         """
         if self.is_locked():
             lock = self.page_tree.xpath('//metadata/lock')[0]
             lock.getparent().remove(lock) 
             
     def update_and_attach_words2tree(self, update_function_on_word=None, include_special_words_of_type=[]):
         """Update word ids and attach them to page.page_tree.
         """
         if not self.is_locked():
             update_function_on_word = [ update_function_on_word ]\
                     if type(update_function_on_word) != list\
                     else update_function_on_word
             for node in self.page_tree.xpath('.//word|.//' + MarkForeignHands.XML_TAG + '|.//' + TextConnectionMark.XML_TAG): 
                 node.getparent().remove(node)
             for index, word in enumerate(self.words):
                 word.id = index
                 for func in update_function_on_word:
                     if callable(func):
                         func(word)
                 word.attach_word_to_tree(self.page_tree)
             for index, mark_foreign_hands in enumerate(self.mark_foreign_hands):
                 mark_foreign_hands.id = index
                 if MarkForeignHands in include_special_words_of_type:
                     for func in update_function_on_word:
                         if callable(update_function_on_word):
                             func(mark_foreign_hands)
                 mark_foreign_hands.attach_word_to_tree(self.page_tree)
             for index, text_connection_mark in enumerate(self.text_connection_marks):
                 text_connection_mark.id = index
                 if TextConnectionMark in include_special_words_of_type:
                     for func in update_function_on_word:
                         if callable(update_function_on_word):
                             func(text_connection_mark)
                 text_connection_mark.attach_word_to_tree(self.page_tree)
         else:
             print('locked') 
     
     def update_property_dictionary(self, property_key, default_value):
         """Update properties_dictionary.
         """
         content = self.properties_dictionary.get(property_key)
         if content is not None:
             self.properties_dictionary.update({property_key: (content[0], default_value, content[2])})
         else:
             msg = f'ERROR: properties_dictionary does not contain a key {property_key}!'
             raise Exception(msg)
 
     def update_tree(self, value, xpath):
         """Update tree.
         """
         node_name = dirname(xpath)
         node = self.page_tree.xpath('//' + node_name)[0]\
                 if len(self.page_tree.xpath('//' + node_name)) > 0\
                 else ET.SubElement(self.page_tree.getroot(), node_name)
         node.set(basename(xpath).replace('@', ''), str(value))
 
     def _init_tree(self, should_xml_file_exist=False):
         """Initialize page_tree from xml_file if it exists.
         """
         if isfile(self.xml_file):
             parser = ET.XMLParser(remove_blank_text=True)
             self.page_tree = ET.parse(self.xml_file, parser)
         elif not should_xml_file_exist:
             self.page_tree = ET.ElementTree(ET.Element('page'))
             self.page_tree.docinfo.URL = self.xml_file
         else:
             msg = f'ERROR: xml_source_file {self.xml_file} does not exist!'
             raise FileNotFoundError(msg)
 
Index: svgscripts/create_manuscript.py
===================================================================
--- svgscripts/create_manuscript.py	(revision 0)
+++ svgscripts/create_manuscript.py	(revision 88)
@@ -0,0 +1,204 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+"""   This program can be used to create a ArchivalManuscriptUnity.
+"""
+#    Copyright (C) University of Basel 2020  {{{1
+#
+#    This program is free software: you can redistribute it and/or modify
+#    it under the terms of the GNU General Public License as published by
+#    the Free Software Foundation, either version 3 of the License, or
+#    (at your option) any later version.
+#
+#    This program is distributed in the hope that it will be useful,
+#    but WITHOUT ANY WARRANTY; without even the implied warranty of
+#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#    GNU General Public License for more details.
+#
+#    You should have received a copy of the GNU General Public License
+#    along with this program.  If not, see <https://www.gnu.org/licenses/> 1}}}
+
+__author__ = "Christian Steiner"
+__maintainer__ = __author__
+__copyright__ = 'University of Basel'
+__email__ = "christian.steiner@unibas.ch"
+__status__ = "Development"
+__license__ = "GPL v3"
+__version__ = "0.0.1"
+
+from colorama import Fore, Style
+import getopt
+import re
+import sys
+from os import listdir, sep, path
+from os.path import isfile, isdir, dirname, basename
+import lxml.etree as ET
+
+if dirname(__file__) not in sys.path:
+    sys.path.append(dirname(__file__))
+
+from datatypes.manuscript import ArchivalManuscriptUnity
+from datatypes.super_page import SuperPage
+
+sys.path.append('shared_util')
+from myxmlwriter import parse_xml_of_type, write_pretty, xml_has_type, FILE_TYPE_SVG_WORD_POSITION, FILE_TYPE_XML_MANUSCRIPT
+
+
+
+UNITTESTING = False
+
+class ManuscriptCreator:
+    """This class can be used to create a ArchivalManuscriptUnity.
+    """
+
+    def __init__(self, xml_target_dir):
+        self.xml_target_dir = xml_target_dir
+
+    def _get_or_create_element(self, node, xpath, create_id=False) ->ET.Element:
+        """Return a element with name == element_name, or create if it does not exist.
+        """
+        elements = node.xpath(xpath)
+        if len(elements) > 0:
+            return elements[0]
+        else:
+            if re.match(r'[a-z]+\[@[a-z-]+=', xpath):
+                element_name = re.match(r'(.+?)\[@[a-z]+.*', xpath).group(1)
+                num_elements = len(node.xpath(element_name))
+                element = ET.SubElement(node, element_name)
+                element_attribute = re.match(r'[a-z]+\[@(.+?)=.*', xpath).group(1)
+                element_value = re.match(r'[a-z]+\[@[a-z-]+="(.+?)"]', xpath).group(1)
+                element.set(element_attribute, element_value)
+                if create_id:
+                    element.set('id', str(num_elements))
+                return element
+            else:
+                num_elements = len(node.xpath(xpath))
+                element = ET.SubElement(node, xpath)
+                if create_id:
+                    element.set('id', str(num_elements))
+                return element
+
+    def _create_or_update_pages(self, pages_node, manuscript_page_url_mapping):
+        """Create or update pages.
+        """
+        for page_number, url in manuscript_page_url_mapping.items():
+            xpath = SuperPage.XML_TAG + f'[@number="{page_number}"]'
+            page_node = self._get_or_create_element(pages_node, xpath, create_id=True)
+            if not bool(page_node.get('alias')):
+                page_node.set('alias', basename(url))
+
+    def create_or_update_manuscripts(self, manuscript_files, page_url_mapping):
+        """Create or update manuscripts.
+        """
+        for key in page_url_mapping:
+            relevant_files = [ manuscript_file for manuscript_file in manuscript_files\
+                    if basename(manuscript_file) == key.replace(' ', '_') + '.xml']
+            if len(relevant_files) == 0:
+                manuscript_files.append(key.replace(' ', '_') + '.xml')
+        for manuscript_file in manuscript_files:
+            target_file = self.xml_target_dir + sep + manuscript_file\
+                    if dirname(manuscript_file) == ''\
+                    else manuscript_file
+            title = basename(target_file).replace('.xml', '').replace('_', ' ')
+            manuscript = ArchivalManuscriptUnity(title=title)
+            if isfile(target_file):
+                manuscript = ArchivalManuscriptUnity.create_cls(target_file)
+            else:
+                manuscript.manuscript_tree = ET.ElementTree(ET.Element(ArchivalManuscriptUnity.XML_TAG))
+                manuscript.manuscript_tree.docinfo.URL = target_file
+                manuscript.manuscript_tree.getroot().set('title', manuscript.title)
+                manuscript.manuscript_tree.getroot().set('type', manuscript.manuscript_type)
+            if title in page_url_mapping.keys():
+                pages_node = self._get_or_create_element(manuscript.manuscript_tree.getroot(), 'pages')
+                self._create_or_update_pages(pages_node, page_url_mapping[title])
+            if not UNITTESTING:
+                write_pretty(xml_element_tree=manuscript.manuscript_tree, file_name=target_file,\
+                    script_name=__file__, file_type=FILE_TYPE_XML_MANUSCRIPT)
+
+def create_page_url_mapping(input_file, mapping_dictionary, default_title=''):
+    """Create a page to url mapping from input file.
+
+        File content:
+
+            TITLE PAGENUMBER\nURL
+            
+            See: 'tests_svgscripts/test_data/content.txt'
+    """
+    lines = []
+    with open(input_file, 'r') as f:
+        lines = f.readlines()
+    key = None
+    url = None
+    current_key = default_title
+    for content in lines:
+        if content.startswith('http')\
+                or content.startswith('www'):
+            url = content.replace('\n', '')\
+                    if content.startswith('http')\
+                    else 'http://' + content.replace('\n', '')
+            if current_key not in mapping_dictionary.keys():
+                mapping_dictionary.update({current_key: {}})
+            mapping_dictionary[current_key].update({key: url})
+        else:
+            key_parts = [ part.strip() for part in content.replace('\n', '').replace('S.', '').split(',') ]
+            key_index = 0
+            if len(key_parts) > 1:
+                title = key_parts[0]
+                if title not in mapping_dictionary.keys():
+                    current_key = title
+                    mapping_dictionary.update({current_key: {}})
+                key_index = 1
+            key = key_parts[key_index]
+
+def usage():
+    """prints information on how to use the script
+    """
+    print(main.__doc__)
+
+def main(argv):
+    """This program can be used to create or update one or more manuscripts.
+ 
+
+    svgscripts/create_manuscript.py [OPTIONS] [<input_fileA.txt>, ...] [<xmlManuscriptFile>, ...]
+
+        <input_file.txt>            One or more files mapping pages to faksimile URLs, with 'txt'-suffix
+        <xmlManuscriptFile>         manuscript file(s) (~ArchivalManuscriptUnity). 
+
+        OPTIONS:
+        -h|--help:                  show help
+        -t|--title=title            manuscript's title, e.g. "Mp XV".
+        -x|--xml-target-dir         directory containing xmlManuscriptFile, default "./xml"
+
+        :return: exit code (int)
+    """
+    title = ''
+    xml_target_dir = ".{}xml".format(sep)
+    page_url_mapping = {}
+
+    try:
+        opts, args = getopt.getopt(argv, "ht:x:", ["help", "title=", "xml-target-dir="])
+    except getopt.GetoptError:
+        usage()
+        return 2
+
+    for opt, arg in opts:
+        if opt in ('-h', '--help'):
+            usage()
+            return 0
+        elif opt in ('-t', '--title'):
+            title = arg
+        elif opt in ('-x', '--xml-target-dir'):
+            xml_target_dir = arg
+
+    manuscript_files = [ arg for arg in args if arg.endswith('.xml')\
+            and '_page' not in arg ]
+    input_files = [ arg for arg in args if arg.endswith('.txt')\
+            and isfile(arg)]
+    for input_file in input_files:
+        create_page_url_mapping(input_file, page_url_mapping, default_title=title)
+    creator = ManuscriptCreator(xml_target_dir=xml_target_dir)
+    creator.create_or_update_manuscripts(manuscript_files, page_url_mapping)
+    return 0
+
+if __name__ == "__main__":
+    sys.exit(main(sys.argv[1:]))
Index: tests_svgscripts/test_create_manuscript.py
===================================================================
--- tests_svgscripts/test_create_manuscript.py	(revision 0)
+++ tests_svgscripts/test_create_manuscript.py	(revision 88)
@@ -0,0 +1,50 @@
+import unittest
+from os import sep, path, remove
+from os.path import isfile
+import lxml.etree as ET
+import warnings
+import sys
+
+sys.path.append('svgscripts')
+import create_manuscript
+from datatypes.manuscript import ArchivalManuscriptUnity
+
+class TestCreateManuscript(unittest.TestCase):
+
+    def setUp(self):
+        create_manuscript.UNITTESTING = True
+        DATADIR = path.dirname(__file__) + sep + 'test_data'  
+        self.content_file = DATADIR + sep + 'content.txt'
+
+    def test_create_page_url_mapping(self):
+        mapping = {}
+        create_manuscript.create_page_url_mapping(self.content_file, mapping)
+        self.assertTrue('Mp XV' in mapping.keys()) 
+        #print(mapping)
+        #mapping = {}
+        #create_manuscript.create_page_url_mapping('content.txt', mapping, default_title='Mp XV')
+        #print(mapping)
+        creator = create_manuscript.ManuscriptCreator('')
+        pages_node = ET.Element('pages')
+        #creator._create_or_update_pages(pages_node, mapping['Mp XV'])
+        #print(ET.dump(pages_node))
+
+    def test_get_or_create_element(self):
+        creator = create_manuscript.ManuscriptCreator('')
+        manuscript_tree = ET.ElementTree(ET.Element(ArchivalManuscriptUnity.XML_TAG))
+        self.assertEqual(len(manuscript_tree.xpath('test')), 0)
+        node = creator._get_or_create_element(manuscript_tree.getroot(), 'test', create_id=True)
+        self.assertEqual(len(manuscript_tree.xpath('test')), 1)
+        node = creator._get_or_create_element(manuscript_tree.getroot(), 'test[@id="0"]')
+        self.assertEqual(len(manuscript_tree.xpath('test')), 1)
+        node = creator._get_or_create_element(manuscript_tree.getroot(), 'page[@number="10"]')
+        self.assertEqual(node.get('number'), '10')
+        node = creator._get_or_create_element(manuscript_tree.getroot(), 'page[@number="0"]', create_id=True)
+        self.assertEqual(node.get('id'), '1')
+        self.assertEqual(node.get('number'), '0')
+
+    def test_main(self):
+        create_manuscript.main(['-x', 'xml', '-t', 'Mp XV', self.content_file])
+
+if __name__ == "__main__":
+    unittest.main()
Index: tests_svgscripts/test_data/content.txt
===================================================================
--- tests_svgscripts/test_data/content.txt	(revision 0)
+++ tests_svgscripts/test_data/content.txt	(revision 88)
@@ -0,0 +1,76 @@
+Mp XV, S. 74r
+www.nietzschesource.org/DFGA/Mp-XV-2c,1
+Mp XV, S. 74v
+http://www.nietzschesource.org/DFGA/Mp-XV-2c,2
+Mp XV, S. 75r
+www.nietzschesource.org/DFGA/Mp-XV-2c,3
+Mp XV, S. 75v
+http://www.nietzschesource.org/DFGA/Mp-XV-2c,4
+Mp XV, S. 76r
+www.nietzschesource.org/DFGA/Mp-XV-2c,5
+Mp XV, S. 77r
+www.nietzschesource.org/DFGA/Mp-XV-2c,7
+Mp XV, S. 78r
+www.nietzschesource.org/DFGA/Mp-XV-2d,1
+Mp XV, S. 78v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,2
+Mp XV, S. 79r
+www.nietzschesource.org/DFGA/Mp-XV-2d,3
+Mp XV, S. 79v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,4
+Mp XV, S. 80r
+www.nietzschesource.org/DFGA/Mp-XV-2d,5
+Mp XV, S. 80v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,6
+Mp XV, S. 81r
+www.nietzschesource.org/DFGA/Mp-XV-2d,7
+Mp XV, S. 81v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,8
+Mp XV, S. 82r
+www.nietzschesource.org/DFGA/Mp-XV-2d,9
+Mp XV, S. 82v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,10
+Mp XV, S. 83r
+www.nietzschesource.org/DFGA/Mp-XV-2d,11
+Mp XV, S. 83v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,12
+Mp XV, S. 84r
+www.nietzschesource.org/DFGA/Mp-XV-2d,13
+Mp XV, S. 85v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,16
+Mp XV, S. 86r
+www.nietzschesource.org/DFGA/Mp-XV-2d,17
+Mp XV, S. 86v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,18
+Mp XV, S. 87r
+www.nietzschesource.org/DFGA/Mp-XV-2d,19
+Mp XV, S. 87v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,20
+Mp XV, S. 88r
+www.nietzschesource.org/DFGA/Mp-XV-2d,21
+Mp XV, S. 89r
+www.nietzschesource.org/DFGA/Mp-XV-2d,23
+Mp XV, S. 90r
+www.nietzschesource.org/DFGA/Mp-XV-2d,25
+Mp XV, S. 92r
+www.nietzschesource.org/DFGA/Mp-XV-2d,29
+Mp XV, S. 92v
+http://www.nietzschesource.org/DFGA/Mp-XV-2d,30
+Mp XV, S. 94r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,1
+Mp XV, S. 94v
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,2
+Mp XV, S. 95r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,3
+Mp XV, S. 96r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,5
+Mp XV, S. 97r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,7
+Mp XV, S. 98v
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,10
+Mp XV, S. 99r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,11
+Mp XV, S. 100r
+http://www.nietzschesource.org/DFGA/Mp-XV-2e,13
+Mp XV, S. 113r
+www.nietzschesource.org/DFGA/Mp-XV-3c,1
Index: shared_util/myxmlwriter.py
===================================================================
--- shared_util/myxmlwriter.py	(revision 87)
+++ shared_util/myxmlwriter.py	(revision 88)
@@ -1,203 +1,203 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 
 """   This program can be used to pretty-write a xml string to a xml file.
 """
 #    Copyright (C) University of Basel 2019  {{{1
 #
 #    This program is free software: you can redistribute it and/or modify
 #    it under the terms of the GNU General Public License as published by
 #    the Free Software Foundation, either version 3 of the License, or
 #    (at your option) any later version.
 #
 #    This program is distributed in the hope that it will be useful,
 #    but WITHOUT ANY WARRANTY; without even the implied warranty of
 #    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 #    GNU General Public License for more details.
 #
 #    You should have received a copy of the GNU General Public License
 #    along with this program.  If not, see <https://www.gnu.org/licenses/> 1}}}
 
 import inspect
 import xml.dom.minidom as MD
 import xml.etree.ElementTree as ET
 import lxml.etree as LET
 from datetime import datetime
 from rdflib import URIRef
 from os import  makedirs
-from os.path import sep, basename, dirname
+from os.path import sep, basename, dirname, isfile
 import sys
 import warnings
 
 sys.path.append('svgscripts')
 from datatypes.page import FILE_TYPE_SVG_WORD_POSITION, FILE_TYPE_XML_MANUSCRIPT
 
 __author__ = "Christian Steiner"
 __maintainer__ = __author__
 __copyright__ = 'University of Basel'
 __email__ = "christian.steiner@unibas.ch"
 __status__ = "Development"
 __license__ = "GPL v3"
 __version__ = "0.0.1"
 
 FILE_TYPE_SVG_WORD_POSITION = FILE_TYPE_SVG_WORD_POSITION
 FILE_TYPE_XML_MANUSCRIPT = FILE_TYPE_XML_MANUSCRIPT
 FILE_TYPE_XML_DICT = 'xml-dictionary'
 
 def attach_dict_to_xml_node(dictionary, xml_node):
     """Create a xml tree from a dictionary.
     """
     for key in dictionary.keys():
         elem_type = type(dictionary[key])
         if elem_type != dict:
             node = LET.SubElement(xml_node, key, attrib={'type': elem_type.__name__})
             node.text = str(dictionary[key])
         else:
             attach_dict_to_xml_node(dictionary[key], LET.SubElement(xml_node, key))
 
 def dict2xml(dictionary, target_file_name):
     """Write dict 2 xml.
     """
     xml_tree = LET.ElementTree(LET.Element('root'))
     attach_dict_to_xml_node(dictionary, LET.SubElement(xml_tree.getroot(), 'dict'))
     write_pretty(xml_element_tree=xml_tree, file_name=target_file_name,\
             script_name=inspect.currentframe().f_code.co_name, file_type=FILE_TYPE_XML_DICT)
 
 def get_dictionary_from_node(node):
     """Return dictionary from node.
 
         :return: dict
     """
     new_dict = {}
     if len(node.getchildren()) > 0:
         new_dict.update({ node.tag : {} })
         for child_node in node.getchildren():
             new_dict.get(node.tag).update(get_dictionary_from_node(child_node))
     else:
         elem_cls = eval(node.get('type')) if bool(node.get('type')) else str
         value = elem_cls(node.text) if bool(node.text) else None
         new_dict.update({ node.tag: value })
     return new_dict
 
 def lock_xml_tree(xml_element_tree, **locker_dict):
     """Lock xml_element_tree.
     """
     if xml_element_tree is not None and not test_lock(xml_element_tree, silent=True):
         message = locker_dict.get('message') if bool(locker_dict.get('message')) else ''
         reference_file = locker_dict.get('reference_file') if bool(locker_dict.get('reference_file')) else ''
         metadata = xml_element_tree.xpath('./metadata')[0]\
                 if len(xml_element_tree.xpath('./metadata')) > 0\
                 else LET.SubElement(xml_element_tree.getroot(), 'metadata')
         lock = LET.SubElement(metadata, 'lock')
         LET.SubElement(lock, 'reference-file').text = reference_file
         if message != '':
             LET.SubElement(lock, 'message').text = message
 
 def parse_xml_of_type(xml_source_file, file_type):
     """Return a xml_tree from xml_source_file is file is of type file_type.
     """
     parser = LET.XMLParser(remove_blank_text=True)
     xml_tree = LET.parse(xml_source_file, parser)
     if not xml_has_type(file_type, xml_tree=xml_tree):
         msg = 'File {} is not of type {}!'.format(xml_source_file, file_type)
         raise Exception(msg)
     return xml_tree
 
 def test_lock(xml_element_tree=None, silent=False):
     """Test if xml_element_tree is locked and print a message.
 
         :return: True if locked
     """
     if xml_element_tree is None:
         return False
     if len(xml_element_tree.findall('./metadata/lock')) > 0:
         reference_file = xml_element_tree.findall('./metadata/lock/reference-file')
         message = xml_element_tree.findall('./metadata/lock/message')
         if not silent:
             warning_msg = 'File {0} is locked!'.format(xml_element_tree.docinfo.URL)
             if len(reference_file) > 0:
                 warning_msg = warning_msg.replace('!', ' ') + 'on {0}.'.format(reference_file[0].text)
             if len(message) > 0:
                 warning_msg = warning_msg + '\n{0}'.format(message[0].text)
             warnings.warn(warning_msg)
         return True
     return False
 
 def update_metadata(xml_element_tree, script_name, file_type=None):
     """Updates metadata of xml tree.
     """
     if len(xml_element_tree.getroot().findall('./metadata')) > 0:
         if len(xml_element_tree.getroot().find('./metadata').findall('./modifiedBy[@script="{}"]'.format(script_name))) == 0: 
             LET.SubElement(xml_element_tree.getroot().find('./metadata'), 'modifiedBy', attrib={'script': script_name})
         xml_element_tree.getroot().find('./metadata').findall('./modifiedBy[@script="{}"]'.format(script_name))[0].text = \
                 datetime.now().strftime('%Y-%m-%d %H:%M:%S')
     else:
         metadata = LET.SubElement(xml_element_tree.getroot(), 'metadata')
         if file_type is not None:
             LET.SubElement(metadata, 'type').text = file_type 
         createdBy = LET.SubElement(metadata, 'createdBy')
         LET.SubElement(createdBy, 'script').text = script_name
         LET.SubElement(createdBy, 'date').text = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
 
 def write_backup(xml_element_tree: LET.ElementTree, file_type=None, bak_dir='./bak') -> str:
     """Back up a xml_source_file.
 
         :return: target_file_name
     """
     date_string = datetime.now().strftime('%Y-%m-%d_%H:%M:%S')
     makedirs(bak_dir, exist_ok=True)
     target_file_name = bak_dir + sep + basename(xml_element_tree.docinfo.URL) + '_' + date_string
     reference_file = xml_element_tree.docinfo.URL
     write_pretty(xml_element_tree=xml_element_tree, file_name=target_file_name,\
                 script_name=__file__ + '({0},{1})'.format(inspect.currentframe().f_code.co_name, reference_file),\
                 file_type=file_type)
     return target_file_name
 
 def write_pretty(xml_string=None, xml_element_tree=None, file_name=None, script_name=None, backup=False, file_type=None, **locker_dict):
     """Writes a xml string pretty to a file.
     """
     if not bool(xml_string) and not bool(xml_element_tree):
         raise Exception("write_pretty needs a string or a xml.ElementTree!")
     if not test_lock(xml_element_tree):
         if len(locker_dict) > 0 and bool(locker_dict.get('reference_file')):
             lock_xml_tree(xml_element_tree, **locker_dict)
         if script_name is not None and xml_element_tree is not None:
             update_metadata(xml_element_tree, script_name, file_type=file_type)
         if file_name is None and xml_element_tree is not None\
                 and xml_element_tree.docinfo is not None and xml_element_tree.docinfo.URL is not None:
             file_name = xml_element_tree.docinfo.URL
         if file_name is None:
             raise Exception("write_pretty needs a file_name or a xml.ElementTree with a docinfo.URL!")
         if backup and xml_element_tree is not None:
             write_backup(xml_element_tree, file_type=file_type)
         dom = MD.parseString(xml_string) if(bool(xml_string)) else MD.parseString(ET.tostring(xml_element_tree.getroot()))
         f = open(file_name, "w")
         dom.writexml(f, addindent="\t", newl='\n', encoding='utf-8')
         f.close()
 
 def xml2dict(xml_source_file):
     """Create dict from xml_source_file of Type FILE_TYPE_XML_DICT.
 
         :return: dict
     """
     new_dict = {}
     xml_tree = LET.parse(xml_source_file)
     if xml_has_type(FILE_TYPE_XML_DICT, xml_tree=xml_tree)\
         and len(xml_tree.xpath('/root/dict')) > 0:
         for node in xml_tree.xpath('/root/dict')[0].getchildren():
             new_dict.update(get_dictionary_from_node(node))
     else:
         msg = 'File {} is not of type {}!'.format(xml_source_file, FILE_TYPE_XML_DICT)
         raise Exception(msg)
     return new_dict
 
 def xml_has_type(file_type, xml_source_file=None, xml_tree=None):
     """Return true if xml_source_file/xml_tree has file type == file_type.
     """
     if xml_tree is None and xml_source_file is None:
         return False
-    if xml_tree is None:
+    if xml_tree is None and isfile(xml_source_file):
         xml_tree = LET.parse(xml_source_file)
     if len(xml_tree.xpath('//metadata/type/text()')) < 1:
         return False
     return xml_tree.xpath('//metadata/type/text()')[0] == file_type