Index: Notiz4Dominique.txt =================================================================== --- Notiz4Dominique.txt (revision 87) +++ Notiz4Dominique.txt (revision 88) @@ -1,8 +0,0 @@ -Ich habe im geteilten SwitchDrive Ordner einen Datenordner 'DATA' erstellt. Darin sind folgende Dateien enthalten: - -pdf -svg -text_svg -xml - -Diese kannst Du in diesen Ordner verlinken, um die Skripte mit den aktuellen Daten zu verwenden. Index: TODO.md =================================================================== --- TODO.md (revision 87) +++ TODO.md (revision 88) @@ -1,111 +1,111 @@ # TASKS -+ Data +- Data - Frontend -- Funding ++ Funding - Organization # Wortsuche: - Die Wortsuche sollte über die topologische Nähe der Wörter zueinander gewichtet werden. - Wortpfade, d.h. Abfolgen der Wörter sollen vermieden werden, da dies nicht automatisch generiert werden kann und höchst fehleranfällig ist. - Daher sollen die Worteinfügungen auch nicht dafür verwendet werden, alternative Textverläufe aufzuzeichnen. # TODO ## Faksimile data input - word boxes on faksimile by drawing rects with inkscape [IN PROGRESS, see "Leitfaden.pdf"] - naming word boxes by using title of rects [IN PROGRESS, see "Leitfaden\_Kontrolle\_und\_Beschriftung\_der\_Wortrahmen.pdf"] - correcting faksimile svg or transkription xml if words do not correspond ## Processing ### faksimile data input, i.e. svg-file resulting from drawing boxes etc. with inkscape - process faksimile words: - join\_faksimileAndTranskription.py [DONE] - create faksimile position of line [TODO] - create a data input task for words that do not correspond [DONE] ### transkription, i.e. svg-file resulting from pdf-file ->created with InDesign - fix: - xml/W\_II\_1\_page131.xml: - eigentlichste: [TODO] has two parts (['eigentlich', 'ste']), both are deleted, one part ('ste') has a box ('e') -> expected result: earlier_version = eigentlichste, earlier_version.earlier_version = eigentliche - xml/N\_VII\_1\_page138.xml: - AufBau [DONE] - Verschiedenes [DONE] - process text field: - Word [DONE] - SpecialWord - MarkForeignHands [DONE] - TextConnectionMark [DONE] - WordInsertionMark [DONE] - all paths -> page.categorize\_paths [TODO] - word-deletion -> Path [DONE] - make parts of word if only parts of a word are deleted, also introduce earlier version of word [DONE] - correction concerning punctuations in words that are deleted, script does not recognize parts of deleted words as deleted if they consist of punctuation marks. [TODO] - word-undeletion (e.g. N VII 1, 18,6 -> "mit") - underline - text-area-deletion - text-connection-lines - boxes - process footnotes: - Return footnotes with standoff [DONE] - TextConnectionMark [DONE] - TextConnection with uncertainty [TODO] - "Fortsetzung [0-9]+,[0-9]+?" - "Fortsetzung von [0-9]+,[0-9]+?" - concerning Word: - uncertain transcription: "?" / may have bold word parts - atypical writting: "¿" and bold word parts - clarification corrections ("Verdeutlichungskorrekturen"): "Vk" and bold word parts - correction: "word>" and ">?" (with uncertainty) - concerning word deletion: - atypical writting: "¿" and "Durchstreichung" (see N VII 1, 11,2) - process margins: - MarkForeignHands [DONE] - ForeignHandTextAreaDeletion [TODO] - boxes: make earlier version of a word [TODO] - TextConnection [TODO] - from: ([0-9]+,)*[0-9]+ -) - to: -) ([0-9]+,)*[0-9]+ ## Datatypes - make datatypes: - Page [ok] --> page orientation!!! - SimpleWord - SpecialWord - MarkForeignHands ("Zeichen für Fremde Hand") [DONE] - TextConnectionMark ("Anschlußzeichen") [DONE] - has a Reference - Word [ok] --> deal with non-horizontal text [DONE] --> hyphenation [TODO] --> add style info to word: font { German, Latin } [DONE] --> pen color [DONE] --> connect style with character glyph-id from svg path file --> has parts [DONE] --> versions: later version of earlier version [DONE] - WritingProcess >>>> use only in connection with earlier versions of word - correlates with font size: - biggest font to biggest-1 font: stage 0 - font in between: stage 1 - smallest font to smallest+1 font: stage 2 - Style [DONE] - WordPosition [ok] - TranskriptionPosition [ok] - FaksimilePosition [ok] - LineNumber [reDo] - change to Line - Reference [TODO]+ - TextConnection - needs change of LineNumber to Line - ForeignHandTextAreaDeletion [TODO] - Freehand: - Deletion [DONE] - make parts of word if only parts of a word are deleted, also introduce earlier version of word [DONE] - WordInsertionMark [reDO] - Underline [TODO] Index: svgscripts/datatypes/manuscript.py =================================================================== --- svgscripts/datatypes/manuscript.py (revision 87) +++ svgscripts/datatypes/manuscript.py (revision 88) @@ -1,137 +1,141 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ This class can be used to represent an archival unity of manuscript pages, i.e. workbooks, notebooks, folders of handwritten pages. """ # Copyright (C) University of Basel 2019 {{{1 # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see 1}}} __author__ = "Christian Steiner" __maintainer__ = __author__ __copyright__ = 'University of Basel' __email__ = "christian.steiner@unibas.ch" __status__ = "Development" __license__ = "GPL v3" __version__ = "0.0.1" from lxml import etree as ET from os.path import isfile import sys from .page import Page, FILE_TYPE_XML_MANUSCRIPT, FILE_TYPE_SVG_WORD_POSITION from .color import Color sys.path.append('py2ttl') from class_spec import SemanticClass sys.path.append('shared_util') from myxmlwriter import parse_xml_of_type, write_pretty, xml_has_type class ArchivalManuscriptUnity(SemanticClass): """ This class represents an archival unity of manuscript pages (workbooks, notebooks and portfolios of handwritten pages). @label archival unity of manuscript pages Args: title title of archival unity manuscript_type type of manuscript: 'Arbeitsheft', 'Notizheft', 'Mappe' manuscript_tree lxml.ElementTree """ XML_TAG = 'manuscript' XML_COLORS_TAG = 'colors' + TYPE_DICTIONARY = { 'Mp': 'Mappe', 'N': 'Notizheft', 'W': 'Arbeitsheft' } UNITTESTING = False def __init__(self, title='', manuscript_type='', manuscript_tree=None): self.colors = [] self.manuscript_tree = manuscript_tree self.manuscript_type = manuscript_type self.pages = [] self.styles = [] self.title = title + if self.manuscript_type == '' and self.title != ''\ + and self.title.split(' ')[0] in self.TYPE_DICTIONARY.keys(): + self.manuscript_type = self.TYPE_DICTIONARY[self.title.split(' ')[0]] def get_name_and_id(self): """Return an identification for object as 2-tuple. """ return '', self.title.replace(' ', '_') @classmethod def create_cls(cls, xml_manuscript_file, page_status_list=None, page_xpath='', update_page_styles=False): """Create an instance of ArchivalManuscriptUnity from a xml file of type FILE_TYPE_XML_MANUSCRIPT. :return: ArchivalManuscriptUnity """ manuscript_tree = parse_xml_of_type(xml_manuscript_file, FILE_TYPE_XML_MANUSCRIPT) title = manuscript_tree.getroot().get('title') if bool(manuscript_tree.getroot().get('title')) else '' manuscript_type = manuscript_tree.getroot().get('type') if bool(manuscript_tree.getroot().get('type')) else '' manuscript = cls(title=title, manuscript_type=manuscript_type, manuscript_tree=manuscript_tree) manuscript.colors = [ Color.create_cls(node=color_node) for color_node in manuscript_tree.xpath('.//' + cls.XML_COLORS_TAG + '/' + Color.XML_TAG) ] if page_xpath == '': page_status = '' if page_status_list is not None\ and type(page_status_list) is list\ and len(page_status_list) > 0: page_status = '[' + ' and '.join([ f'contains(@status, "{status}")' for status in page_status_list ]) + ']' page_xpath = f'//pages/page{page_status}/@output' manuscript.pages = [ Page(page_source)\ for page_source in manuscript_tree.xpath(page_xpath)\ if isfile(page_source) and xml_has_type(FILE_TYPE_SVG_WORD_POSITION, xml_source_file=page_source) ] if update_page_styles: for page in manuscript.pages: page.update_styles(manuscript=manuscript, add_to_parents=True) return manuscript def get_color(self, hex_color) -> Color: """Return color if it exists or None. """ if hex_color in [ color.hex_color for color in self.colors ]: return [ color for color in self.colors if color.hex_color == hex_color ][0] return None @classmethod def get_semantic_dictionary(cls): """ Creates a semantic dictionary as specified by SemanticClass. """ dictionary = {} class_dict = cls.get_class_dictionary() properties = {} properties.update(cls.create_semantic_property_dictionary('title', str, 1)) properties.update(cls.create_semantic_property_dictionary('manuscript_type', str, 1)) properties.update(cls.create_semantic_property_dictionary('styles', list)) properties.update(cls.create_semantic_property_dictionary('pages', list)) dictionary.update({cls.CLASS_KEY: class_dict}) dictionary.update({cls.PROPERTIES_KEY: properties}) return cls.return_dictionary_after_updating_super_classes(dictionary) def update_colors(self, color): """Update manuscript colors if color is not contained. """ if self.get_color(color.hex_color) is None: self.colors.append(color) if self.manuscript_tree is not None: if len(self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)) > 0: self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)[0].getparent().remove(self.manuscript_tree.xpath('.//' + self.XML_COLORS_TAG)[0]) colors_node = ET.SubElement(self.manuscript_tree.getroot(), self.XML_COLORS_TAG) for color in self.colors: color.attach_object_to_tree(colors_node) if not self.UNITTESTING: write_pretty(xml_element_tree=self.manuscript_tree, file_name=self.manuscript_tree.docinfo.URL,\ script_name=__file__, backup=True,\ file_type=FILE_TYPE_XML_MANUSCRIPT) def update_styles(self, *styles): """Update manuscript styles. """ for style in styles: if style not in self.styles: self.styles.append(style) Index: svgscripts/datatypes/super_page.py =================================================================== --- svgscripts/datatypes/super_page.py (revision 87) +++ svgscripts/datatypes/super_page.py (revision 88) @@ -1,289 +1,290 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ This class can be used to represent a super page. """ # Copyright (C) University of Basel 2019 {{{1 # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see 1}}} __author__ = "Christian Steiner" __maintainer__ = __author__ __copyright__ = 'University of Basel' __email__ = "christian.steiner@unibas.ch" __status__ = "Development" __license__ = "GPL v3" __version__ = "0.0.1" from lxml import etree as ET from os.path import isfile, basename, dirname from progress.bar import Bar from svgpathtools import svg2paths2, svg_to_paths from svgpathtools.parser import parse_path import sys import warnings from .image import Image, SVGImage from .faksimile_image import FaksimileImage from .mark_foreign_hands import MarkForeignHands from .text_connection_mark import TextConnectionMark from .text_field import TextField from .writing_process import WritingProcess class SuperPage: """ This super class represents a page. Args: xml_source_file (str): name of the xml file to be instantiated. xml_target_file (str): name of the xml file to which page info will be written. """ FILE_TYPE_SVG_WORD_POSITION = 'svgWordPosition' FILE_TYPE_XML_MANUSCRIPT = 'xmlManuscriptFile' PAGE_RECTO = 'recto' PAGE_VERSO = 'verso' STATUS_MERGED_OK = 'faksimile merged' STATUS_POSTMERGED_OK = 'words processed' UNITTESTING = False + XML_TAG = 'page' def __init__(self, xml_file, title=None, page_number='', orientation='North', page_type=PAGE_VERSO, should_xml_file_exist=False): self.properties_dictionary = {\ 'faksimile_image': (FaksimileImage.XML_TAG, None, FaksimileImage),\ 'faksimile_svgFile': ('data-source/@file', None, str),\ 'number': ('page/@number', str(page_number), str),\ 'orientation': ('page/@orientation', orientation, str),\ 'page_type': ('page/@pageType', page_type, str),\ 'pdfFile': ('pdf/@file', None, str),\ 'source': ('page/@source', None, str),\ 'svg_file': ('svg/@file', None, str),\ 'svg_image': (SVGImage.XML_TAG, None, SVGImage),\ 'text_field': (FaksimileImage.XML_TAG + '/' + TextField.XML_TAG, None, TextField),\ 'title': ('page/@title', title, str),\ } self.online_properties = [] self.line_numbers = [] self.lines = [] self.mark_foreign_hands = [] self.page_tree = None self.sonderzeichen_list = [] self.style_dict = {} self.text_connection_marks = [] self.word_deletion_paths = [] self.word_insertion_marks = [] self.words = [] self.writing_processes = [] self.xml_file = xml_file if not self.is_page_source_xml_file(): msg = f'ERROR: xml_source_file {self.xml_file} is not of type "{FILE_TYPE_SVG_WORD_POSITION}"' raise Exception(msg) self._init_tree(should_xml_file_exist=should_xml_file_exist) def add_style(self, sonderzeichen_list=[], letterspacing_list=[], style_dict={}, style_node=None): """Adds a list of classes that are sonderzeichen and a style dictionary to page. """ self.sonderzeichen_list = sonderzeichen_list self.letterspacing_list = letterspacing_list self.style_dict = style_dict if style_node is not None: self.style_dict = { item.get('name'): { key: value for key, value in item.attrib.items() if key != 'name' } for item in style_node.findall('.//class') } self.sonderzeichen_list = [ item.get('name') for item in style_node.findall('.//class')\ if bool(item.get('font-family')) and 'Sonderzeichen' in item.get('font-family') ] self.letterspacing_list = [ item.get('name') for item in style_node.findall('.//class')\ if bool(item.get('letterspacing-list')) ] elif bool(self.style_dict): style_node = ET.SubElement(self.page_tree.getroot(), 'style') if len(self.sonderzeichen_list) > 0: style_node.set('Sonderzeichen', ' '.join(self.sonderzeichen_list)) if len(self.letterspacing_list) > 0: style_node.set('letterspacing-list', ' '.join(self.letterspacing_list)) for key in self.style_dict.keys(): self.style_dict[key]['name'] = key ET.SubElement(style_node, 'class', attrib=self.style_dict[key]) fontsize_dict = { key: float(value.get('font-size').replace('px','')) for key, value in self.style_dict.items() if 'font-size' in value } fontsizes = sorted(fontsize_dict.values(), reverse=True) # create a mapping between fontsizes and word stages self.fontsizekey2stage_mapping = {} for fontsize_key, value in fontsize_dict.items(): if value >= fontsizes[0]-1: self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.FIRST_VERSION }) elif value <= fontsizes[len(fontsizes)-1]+1: self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.LATER_INSERTION_AND_ADDITION }) else: self.fontsizekey2stage_mapping.update({ fontsize_key: WritingProcess.INSERTION_AND_ADDITION }) def get_biggest_fontSize4styles(self, style_set={}): """Returns biggest font size from style_dict for a set of style class names. [:returns:] (float) biggest font size OR 1 if style_dict is empty """ if bool(self.style_dict): sorted_font_sizes = sorted( (float(self.style_dict[key]['font-size'].replace('px','')) for key in style_set if bool(self.style_dict[key].get('font-size'))), reverse=True) return sorted_font_sizes[0] if len(sorted_font_sizes) > 0 else 1 else: return 1 def get_line_number(self, y): """Returns line number id for element at y. [:return:] (int) line number id or -1 """ if len(self.line_numbers) > 0: result_list = [ line_number.id for line_number in self.line_numbers if y >= line_number.top and y <= line_number.bottom ] return result_list[0] if len(result_list) > 0 else -1 else: return -1 def init_all_properties(self, overwrite=False): """Initialize all properties. """ for property_key in self.properties_dictionary.keys(): if property_key not in self.online_properties: self.init_property(property_key, overwrite=overwrite) def init_property(self, property_key, value=None, overwrite=False): """Initialize all properties. Args: property_key: key of property in self.__dict__ value: new value to set to property overwrite: whether or not to update values from xml_file (default: read only) """ if value is None: if property_key not in self.online_properties: xpath, value, cls = self.properties_dictionary.get(property_key) if len(self.page_tree.xpath('//' + xpath)) > 0: value = self.page_tree.xpath('//' + xpath)[0] if value is not None: if cls.__module__ == 'builtins': self.update_tree(value, xpath) self.__dict__.update({property_key: cls(value)}) else: value = cls(node=value)\ if type(value) != cls\ else value self.__dict__.update({property_key: value}) self.__dict__.get(property_key).attach_object_to_tree(self.page_tree) else: self.__dict__.update({property_key: value}) self.online_properties.append(property_key) elif overwrite or property_key not in self.online_properties: xpath, default_value, cls = self.properties_dictionary.get(property_key) if cls.__module__ == 'builtins': self.__dict__.update({property_key: cls(value)}) self.update_tree(value, xpath) else: self.__dict__.update({property_key: value}) self.__dict__.get(property_key).attach_object_to_tree(self.page_tree) self.online_properties.append(property_key) def is_locked(self): """Return true if page is locked. """ return len(self.page_tree.xpath('//metadata/lock')) > 0 def is_page_source_xml_file(self, source_tree=None): """Return true if xml_file is of type FILE_TYPE_SVG_WORD_POSITION. """ if not isfile(self.xml_file): return True if source_tree is None: source_tree = ET.parse(self.xml_file) return source_tree.getroot().find('metadata/type').text == self.FILE_TYPE_SVG_WORD_POSITION def lock(self, reference_file, message=''): """Lock tree such that ids of words etc. correspond to ids in reference_file, optionally add a message that will be shown. """ if not self.is_locked(): metadata = self.page_tree.xpath('./metadata')[0]\ if len(self.page_tree.xpath('./metadata')) > 0\ else ET.SubElement(self.page_tree.getroot(), 'metadata') lock = ET.SubElement(metadata, 'lock') ET.SubElement(lock, 'reference-file').text = reference_file if message != '': ET.SubElement(lock, 'message').text = message def unlock(self): """Lock tree such that ids of words etc. correspond to ids in reference_file, optionally add a message that will be shown. """ if self.is_locked(): lock = self.page_tree.xpath('//metadata/lock')[0] lock.getparent().remove(lock) def update_and_attach_words2tree(self, update_function_on_word=None, include_special_words_of_type=[]): """Update word ids and attach them to page.page_tree. """ if not self.is_locked(): update_function_on_word = [ update_function_on_word ]\ if type(update_function_on_word) != list\ else update_function_on_word for node in self.page_tree.xpath('.//word|.//' + MarkForeignHands.XML_TAG + '|.//' + TextConnectionMark.XML_TAG): node.getparent().remove(node) for index, word in enumerate(self.words): word.id = index for func in update_function_on_word: if callable(func): func(word) word.attach_word_to_tree(self.page_tree) for index, mark_foreign_hands in enumerate(self.mark_foreign_hands): mark_foreign_hands.id = index if MarkForeignHands in include_special_words_of_type: for func in update_function_on_word: if callable(update_function_on_word): func(mark_foreign_hands) mark_foreign_hands.attach_word_to_tree(self.page_tree) for index, text_connection_mark in enumerate(self.text_connection_marks): text_connection_mark.id = index if TextConnectionMark in include_special_words_of_type: for func in update_function_on_word: if callable(update_function_on_word): func(text_connection_mark) text_connection_mark.attach_word_to_tree(self.page_tree) else: print('locked') def update_property_dictionary(self, property_key, default_value): """Update properties_dictionary. """ content = self.properties_dictionary.get(property_key) if content is not None: self.properties_dictionary.update({property_key: (content[0], default_value, content[2])}) else: msg = f'ERROR: properties_dictionary does not contain a key {property_key}!' raise Exception(msg) def update_tree(self, value, xpath): """Update tree. """ node_name = dirname(xpath) node = self.page_tree.xpath('//' + node_name)[0]\ if len(self.page_tree.xpath('//' + node_name)) > 0\ else ET.SubElement(self.page_tree.getroot(), node_name) node.set(basename(xpath).replace('@', ''), str(value)) def _init_tree(self, should_xml_file_exist=False): """Initialize page_tree from xml_file if it exists. """ if isfile(self.xml_file): parser = ET.XMLParser(remove_blank_text=True) self.page_tree = ET.parse(self.xml_file, parser) elif not should_xml_file_exist: self.page_tree = ET.ElementTree(ET.Element('page')) self.page_tree.docinfo.URL = self.xml_file else: msg = f'ERROR: xml_source_file {self.xml_file} does not exist!' raise FileNotFoundError(msg) Index: svgscripts/create_manuscript.py =================================================================== --- svgscripts/create_manuscript.py (revision 0) +++ svgscripts/create_manuscript.py (revision 88) @@ -0,0 +1,204 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- + +""" This program can be used to create a ArchivalManuscriptUnity. +""" +# Copyright (C) University of Basel 2020 {{{1 +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see 1}}} + +__author__ = "Christian Steiner" +__maintainer__ = __author__ +__copyright__ = 'University of Basel' +__email__ = "christian.steiner@unibas.ch" +__status__ = "Development" +__license__ = "GPL v3" +__version__ = "0.0.1" + +from colorama import Fore, Style +import getopt +import re +import sys +from os import listdir, sep, path +from os.path import isfile, isdir, dirname, basename +import lxml.etree as ET + +if dirname(__file__) not in sys.path: + sys.path.append(dirname(__file__)) + +from datatypes.manuscript import ArchivalManuscriptUnity +from datatypes.super_page import SuperPage + +sys.path.append('shared_util') +from myxmlwriter import parse_xml_of_type, write_pretty, xml_has_type, FILE_TYPE_SVG_WORD_POSITION, FILE_TYPE_XML_MANUSCRIPT + + + +UNITTESTING = False + +class ManuscriptCreator: + """This class can be used to create a ArchivalManuscriptUnity. + """ + + def __init__(self, xml_target_dir): + self.xml_target_dir = xml_target_dir + + def _get_or_create_element(self, node, xpath, create_id=False) ->ET.Element: + """Return a element with name == element_name, or create if it does not exist. + """ + elements = node.xpath(xpath) + if len(elements) > 0: + return elements[0] + else: + if re.match(r'[a-z]+\[@[a-z-]+=', xpath): + element_name = re.match(r'(.+?)\[@[a-z]+.*', xpath).group(1) + num_elements = len(node.xpath(element_name)) + element = ET.SubElement(node, element_name) + element_attribute = re.match(r'[a-z]+\[@(.+?)=.*', xpath).group(1) + element_value = re.match(r'[a-z]+\[@[a-z-]+="(.+?)"]', xpath).group(1) + element.set(element_attribute, element_value) + if create_id: + element.set('id', str(num_elements)) + return element + else: + num_elements = len(node.xpath(xpath)) + element = ET.SubElement(node, xpath) + if create_id: + element.set('id', str(num_elements)) + return element + + def _create_or_update_pages(self, pages_node, manuscript_page_url_mapping): + """Create or update pages. + """ + for page_number, url in manuscript_page_url_mapping.items(): + xpath = SuperPage.XML_TAG + f'[@number="{page_number}"]' + page_node = self._get_or_create_element(pages_node, xpath, create_id=True) + if not bool(page_node.get('alias')): + page_node.set('alias', basename(url)) + + def create_or_update_manuscripts(self, manuscript_files, page_url_mapping): + """Create or update manuscripts. + """ + for key in page_url_mapping: + relevant_files = [ manuscript_file for manuscript_file in manuscript_files\ + if basename(manuscript_file) == key.replace(' ', '_') + '.xml'] + if len(relevant_files) == 0: + manuscript_files.append(key.replace(' ', '_') + '.xml') + for manuscript_file in manuscript_files: + target_file = self.xml_target_dir + sep + manuscript_file\ + if dirname(manuscript_file) == ''\ + else manuscript_file + title = basename(target_file).replace('.xml', '').replace('_', ' ') + manuscript = ArchivalManuscriptUnity(title=title) + if isfile(target_file): + manuscript = ArchivalManuscriptUnity.create_cls(target_file) + else: + manuscript.manuscript_tree = ET.ElementTree(ET.Element(ArchivalManuscriptUnity.XML_TAG)) + manuscript.manuscript_tree.docinfo.URL = target_file + manuscript.manuscript_tree.getroot().set('title', manuscript.title) + manuscript.manuscript_tree.getroot().set('type', manuscript.manuscript_type) + if title in page_url_mapping.keys(): + pages_node = self._get_or_create_element(manuscript.manuscript_tree.getroot(), 'pages') + self._create_or_update_pages(pages_node, page_url_mapping[title]) + if not UNITTESTING: + write_pretty(xml_element_tree=manuscript.manuscript_tree, file_name=target_file,\ + script_name=__file__, file_type=FILE_TYPE_XML_MANUSCRIPT) + +def create_page_url_mapping(input_file, mapping_dictionary, default_title=''): + """Create a page to url mapping from input file. + + File content: + + TITLE PAGENUMBER\nURL + + See: 'tests_svgscripts/test_data/content.txt' + """ + lines = [] + with open(input_file, 'r') as f: + lines = f.readlines() + key = None + url = None + current_key = default_title + for content in lines: + if content.startswith('http')\ + or content.startswith('www'): + url = content.replace('\n', '')\ + if content.startswith('http')\ + else 'http://' + content.replace('\n', '') + if current_key not in mapping_dictionary.keys(): + mapping_dictionary.update({current_key: {}}) + mapping_dictionary[current_key].update({key: url}) + else: + key_parts = [ part.strip() for part in content.replace('\n', '').replace('S.', '').split(',') ] + key_index = 0 + if len(key_parts) > 1: + title = key_parts[0] + if title not in mapping_dictionary.keys(): + current_key = title + mapping_dictionary.update({current_key: {}}) + key_index = 1 + key = key_parts[key_index] + +def usage(): + """prints information on how to use the script + """ + print(main.__doc__) + +def main(argv): + """This program can be used to create or update one or more manuscripts. + + + svgscripts/create_manuscript.py [OPTIONS] [, ...] [, ...] + + One or more files mapping pages to faksimile URLs, with 'txt'-suffix + manuscript file(s) (~ArchivalManuscriptUnity). + + OPTIONS: + -h|--help: show help + -t|--title=title manuscript's title, e.g. "Mp XV". + -x|--xml-target-dir directory containing xmlManuscriptFile, default "./xml" + + :return: exit code (int) + """ + title = '' + xml_target_dir = ".{}xml".format(sep) + page_url_mapping = {} + + try: + opts, args = getopt.getopt(argv, "ht:x:", ["help", "title=", "xml-target-dir="]) + except getopt.GetoptError: + usage() + return 2 + + for opt, arg in opts: + if opt in ('-h', '--help'): + usage() + return 0 + elif opt in ('-t', '--title'): + title = arg + elif opt in ('-x', '--xml-target-dir'): + xml_target_dir = arg + + manuscript_files = [ arg for arg in args if arg.endswith('.xml')\ + and '_page' not in arg ] + input_files = [ arg for arg in args if arg.endswith('.txt')\ + and isfile(arg)] + for input_file in input_files: + create_page_url_mapping(input_file, page_url_mapping, default_title=title) + creator = ManuscriptCreator(xml_target_dir=xml_target_dir) + creator.create_or_update_manuscripts(manuscript_files, page_url_mapping) + return 0 + +if __name__ == "__main__": + sys.exit(main(sys.argv[1:])) Index: tests_svgscripts/test_create_manuscript.py =================================================================== --- tests_svgscripts/test_create_manuscript.py (revision 0) +++ tests_svgscripts/test_create_manuscript.py (revision 88) @@ -0,0 +1,50 @@ +import unittest +from os import sep, path, remove +from os.path import isfile +import lxml.etree as ET +import warnings +import sys + +sys.path.append('svgscripts') +import create_manuscript +from datatypes.manuscript import ArchivalManuscriptUnity + +class TestCreateManuscript(unittest.TestCase): + + def setUp(self): + create_manuscript.UNITTESTING = True + DATADIR = path.dirname(__file__) + sep + 'test_data' + self.content_file = DATADIR + sep + 'content.txt' + + def test_create_page_url_mapping(self): + mapping = {} + create_manuscript.create_page_url_mapping(self.content_file, mapping) + self.assertTrue('Mp XV' in mapping.keys()) + #print(mapping) + #mapping = {} + #create_manuscript.create_page_url_mapping('content.txt', mapping, default_title='Mp XV') + #print(mapping) + creator = create_manuscript.ManuscriptCreator('') + pages_node = ET.Element('pages') + #creator._create_or_update_pages(pages_node, mapping['Mp XV']) + #print(ET.dump(pages_node)) + + def test_get_or_create_element(self): + creator = create_manuscript.ManuscriptCreator('') + manuscript_tree = ET.ElementTree(ET.Element(ArchivalManuscriptUnity.XML_TAG)) + self.assertEqual(len(manuscript_tree.xpath('test')), 0) + node = creator._get_or_create_element(manuscript_tree.getroot(), 'test', create_id=True) + self.assertEqual(len(manuscript_tree.xpath('test')), 1) + node = creator._get_or_create_element(manuscript_tree.getroot(), 'test[@id="0"]') + self.assertEqual(len(manuscript_tree.xpath('test')), 1) + node = creator._get_or_create_element(manuscript_tree.getroot(), 'page[@number="10"]') + self.assertEqual(node.get('number'), '10') + node = creator._get_or_create_element(manuscript_tree.getroot(), 'page[@number="0"]', create_id=True) + self.assertEqual(node.get('id'), '1') + self.assertEqual(node.get('number'), '0') + + def test_main(self): + create_manuscript.main(['-x', 'xml', '-t', 'Mp XV', self.content_file]) + +if __name__ == "__main__": + unittest.main() Index: tests_svgscripts/test_data/content.txt =================================================================== --- tests_svgscripts/test_data/content.txt (revision 0) +++ tests_svgscripts/test_data/content.txt (revision 88) @@ -0,0 +1,76 @@ +Mp XV, S. 74r +www.nietzschesource.org/DFGA/Mp-XV-2c,1 +Mp XV, S. 74v +http://www.nietzschesource.org/DFGA/Mp-XV-2c,2 +Mp XV, S. 75r +www.nietzschesource.org/DFGA/Mp-XV-2c,3 +Mp XV, S. 75v +http://www.nietzschesource.org/DFGA/Mp-XV-2c,4 +Mp XV, S. 76r +www.nietzschesource.org/DFGA/Mp-XV-2c,5 +Mp XV, S. 77r +www.nietzschesource.org/DFGA/Mp-XV-2c,7 +Mp XV, S. 78r +www.nietzschesource.org/DFGA/Mp-XV-2d,1 +Mp XV, S. 78v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,2 +Mp XV, S. 79r +www.nietzschesource.org/DFGA/Mp-XV-2d,3 +Mp XV, S. 79v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,4 +Mp XV, S. 80r +www.nietzschesource.org/DFGA/Mp-XV-2d,5 +Mp XV, S. 80v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,6 +Mp XV, S. 81r +www.nietzschesource.org/DFGA/Mp-XV-2d,7 +Mp XV, S. 81v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,8 +Mp XV, S. 82r +www.nietzschesource.org/DFGA/Mp-XV-2d,9 +Mp XV, S. 82v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,10 +Mp XV, S. 83r +www.nietzschesource.org/DFGA/Mp-XV-2d,11 +Mp XV, S. 83v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,12 +Mp XV, S. 84r +www.nietzschesource.org/DFGA/Mp-XV-2d,13 +Mp XV, S. 85v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,16 +Mp XV, S. 86r +www.nietzschesource.org/DFGA/Mp-XV-2d,17 +Mp XV, S. 86v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,18 +Mp XV, S. 87r +www.nietzschesource.org/DFGA/Mp-XV-2d,19 +Mp XV, S. 87v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,20 +Mp XV, S. 88r +www.nietzschesource.org/DFGA/Mp-XV-2d,21 +Mp XV, S. 89r +www.nietzschesource.org/DFGA/Mp-XV-2d,23 +Mp XV, S. 90r +www.nietzschesource.org/DFGA/Mp-XV-2d,25 +Mp XV, S. 92r +www.nietzschesource.org/DFGA/Mp-XV-2d,29 +Mp XV, S. 92v +http://www.nietzschesource.org/DFGA/Mp-XV-2d,30 +Mp XV, S. 94r +http://www.nietzschesource.org/DFGA/Mp-XV-2e,1 +Mp XV, S. 94v +http://www.nietzschesource.org/DFGA/Mp-XV-2e,2 +Mp XV, S. 95r +http://www.nietzschesource.org/DFGA/Mp-XV-2e,3 +Mp XV, S. 96r +http://www.nietzschesource.org/DFGA/Mp-XV-2e,5 +Mp XV, S. 97r +http://www.nietzschesource.org/DFGA/Mp-XV-2e,7 +Mp XV, S. 98v +http://www.nietzschesource.org/DFGA/Mp-XV-2e,10 +Mp XV, S. 99r +http://www.nietzschesource.org/DFGA/Mp-XV-2e,11 +Mp XV, S. 100r +http://www.nietzschesource.org/DFGA/Mp-XV-2e,13 +Mp XV, S. 113r +www.nietzschesource.org/DFGA/Mp-XV-3c,1 Index: shared_util/myxmlwriter.py =================================================================== --- shared_util/myxmlwriter.py (revision 87) +++ shared_util/myxmlwriter.py (revision 88) @@ -1,203 +1,203 @@ #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ This program can be used to pretty-write a xml string to a xml file. """ # Copyright (C) University of Basel 2019 {{{1 # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see 1}}} import inspect import xml.dom.minidom as MD import xml.etree.ElementTree as ET import lxml.etree as LET from datetime import datetime from rdflib import URIRef from os import makedirs -from os.path import sep, basename, dirname +from os.path import sep, basename, dirname, isfile import sys import warnings sys.path.append('svgscripts') from datatypes.page import FILE_TYPE_SVG_WORD_POSITION, FILE_TYPE_XML_MANUSCRIPT __author__ = "Christian Steiner" __maintainer__ = __author__ __copyright__ = 'University of Basel' __email__ = "christian.steiner@unibas.ch" __status__ = "Development" __license__ = "GPL v3" __version__ = "0.0.1" FILE_TYPE_SVG_WORD_POSITION = FILE_TYPE_SVG_WORD_POSITION FILE_TYPE_XML_MANUSCRIPT = FILE_TYPE_XML_MANUSCRIPT FILE_TYPE_XML_DICT = 'xml-dictionary' def attach_dict_to_xml_node(dictionary, xml_node): """Create a xml tree from a dictionary. """ for key in dictionary.keys(): elem_type = type(dictionary[key]) if elem_type != dict: node = LET.SubElement(xml_node, key, attrib={'type': elem_type.__name__}) node.text = str(dictionary[key]) else: attach_dict_to_xml_node(dictionary[key], LET.SubElement(xml_node, key)) def dict2xml(dictionary, target_file_name): """Write dict 2 xml. """ xml_tree = LET.ElementTree(LET.Element('root')) attach_dict_to_xml_node(dictionary, LET.SubElement(xml_tree.getroot(), 'dict')) write_pretty(xml_element_tree=xml_tree, file_name=target_file_name,\ script_name=inspect.currentframe().f_code.co_name, file_type=FILE_TYPE_XML_DICT) def get_dictionary_from_node(node): """Return dictionary from node. :return: dict """ new_dict = {} if len(node.getchildren()) > 0: new_dict.update({ node.tag : {} }) for child_node in node.getchildren(): new_dict.get(node.tag).update(get_dictionary_from_node(child_node)) else: elem_cls = eval(node.get('type')) if bool(node.get('type')) else str value = elem_cls(node.text) if bool(node.text) else None new_dict.update({ node.tag: value }) return new_dict def lock_xml_tree(xml_element_tree, **locker_dict): """Lock xml_element_tree. """ if xml_element_tree is not None and not test_lock(xml_element_tree, silent=True): message = locker_dict.get('message') if bool(locker_dict.get('message')) else '' reference_file = locker_dict.get('reference_file') if bool(locker_dict.get('reference_file')) else '' metadata = xml_element_tree.xpath('./metadata')[0]\ if len(xml_element_tree.xpath('./metadata')) > 0\ else LET.SubElement(xml_element_tree.getroot(), 'metadata') lock = LET.SubElement(metadata, 'lock') LET.SubElement(lock, 'reference-file').text = reference_file if message != '': LET.SubElement(lock, 'message').text = message def parse_xml_of_type(xml_source_file, file_type): """Return a xml_tree from xml_source_file is file is of type file_type. """ parser = LET.XMLParser(remove_blank_text=True) xml_tree = LET.parse(xml_source_file, parser) if not xml_has_type(file_type, xml_tree=xml_tree): msg = 'File {} is not of type {}!'.format(xml_source_file, file_type) raise Exception(msg) return xml_tree def test_lock(xml_element_tree=None, silent=False): """Test if xml_element_tree is locked and print a message. :return: True if locked """ if xml_element_tree is None: return False if len(xml_element_tree.findall('./metadata/lock')) > 0: reference_file = xml_element_tree.findall('./metadata/lock/reference-file') message = xml_element_tree.findall('./metadata/lock/message') if not silent: warning_msg = 'File {0} is locked!'.format(xml_element_tree.docinfo.URL) if len(reference_file) > 0: warning_msg = warning_msg.replace('!', ' ') + 'on {0}.'.format(reference_file[0].text) if len(message) > 0: warning_msg = warning_msg + '\n{0}'.format(message[0].text) warnings.warn(warning_msg) return True return False def update_metadata(xml_element_tree, script_name, file_type=None): """Updates metadata of xml tree. """ if len(xml_element_tree.getroot().findall('./metadata')) > 0: if len(xml_element_tree.getroot().find('./metadata').findall('./modifiedBy[@script="{}"]'.format(script_name))) == 0: LET.SubElement(xml_element_tree.getroot().find('./metadata'), 'modifiedBy', attrib={'script': script_name}) xml_element_tree.getroot().find('./metadata').findall('./modifiedBy[@script="{}"]'.format(script_name))[0].text = \ datetime.now().strftime('%Y-%m-%d %H:%M:%S') else: metadata = LET.SubElement(xml_element_tree.getroot(), 'metadata') if file_type is not None: LET.SubElement(metadata, 'type').text = file_type createdBy = LET.SubElement(metadata, 'createdBy') LET.SubElement(createdBy, 'script').text = script_name LET.SubElement(createdBy, 'date').text = datetime.now().strftime('%Y-%m-%d %H:%M:%S') def write_backup(xml_element_tree: LET.ElementTree, file_type=None, bak_dir='./bak') -> str: """Back up a xml_source_file. :return: target_file_name """ date_string = datetime.now().strftime('%Y-%m-%d_%H:%M:%S') makedirs(bak_dir, exist_ok=True) target_file_name = bak_dir + sep + basename(xml_element_tree.docinfo.URL) + '_' + date_string reference_file = xml_element_tree.docinfo.URL write_pretty(xml_element_tree=xml_element_tree, file_name=target_file_name,\ script_name=__file__ + '({0},{1})'.format(inspect.currentframe().f_code.co_name, reference_file),\ file_type=file_type) return target_file_name def write_pretty(xml_string=None, xml_element_tree=None, file_name=None, script_name=None, backup=False, file_type=None, **locker_dict): """Writes a xml string pretty to a file. """ if not bool(xml_string) and not bool(xml_element_tree): raise Exception("write_pretty needs a string or a xml.ElementTree!") if not test_lock(xml_element_tree): if len(locker_dict) > 0 and bool(locker_dict.get('reference_file')): lock_xml_tree(xml_element_tree, **locker_dict) if script_name is not None and xml_element_tree is not None: update_metadata(xml_element_tree, script_name, file_type=file_type) if file_name is None and xml_element_tree is not None\ and xml_element_tree.docinfo is not None and xml_element_tree.docinfo.URL is not None: file_name = xml_element_tree.docinfo.URL if file_name is None: raise Exception("write_pretty needs a file_name or a xml.ElementTree with a docinfo.URL!") if backup and xml_element_tree is not None: write_backup(xml_element_tree, file_type=file_type) dom = MD.parseString(xml_string) if(bool(xml_string)) else MD.parseString(ET.tostring(xml_element_tree.getroot())) f = open(file_name, "w") dom.writexml(f, addindent="\t", newl='\n', encoding='utf-8') f.close() def xml2dict(xml_source_file): """Create dict from xml_source_file of Type FILE_TYPE_XML_DICT. :return: dict """ new_dict = {} xml_tree = LET.parse(xml_source_file) if xml_has_type(FILE_TYPE_XML_DICT, xml_tree=xml_tree)\ and len(xml_tree.xpath('/root/dict')) > 0: for node in xml_tree.xpath('/root/dict')[0].getchildren(): new_dict.update(get_dictionary_from_node(node)) else: msg = 'File {} is not of type {}!'.format(xml_source_file, FILE_TYPE_XML_DICT) raise Exception(msg) return new_dict def xml_has_type(file_type, xml_source_file=None, xml_tree=None): """Return true if xml_source_file/xml_tree has file type == file_type. """ if xml_tree is None and xml_source_file is None: return False - if xml_tree is None: + if xml_tree is None and isfile(xml_source_file): xml_tree = LET.parse(xml_source_file) if len(xml_tree.xpath('//metadata/type/text()')) < 1: return False return xml_tree.xpath('//metadata/type/text()')[0] == file_type